You are here
Language Variation through the Lens of Web Data
Courses
Language Variation through the Lens of Web Data
The rise of social media has resulted in an unprecedented quantity of user-generated data such as text on Twitter or speech and video on YouTube. This content is often associated with demographic information – the gender, geographic location, ethnicity, and social network connections of the author – which opens up the opportunity to study language variation from a corpus-based "big data" point of view.
This class will introduce relevant technologies in machine learning, text and signal processing, and statistics, with a view towards applying these methods to study language variation. For example, can we identify when a certain linguistic feature entered a community? Which gender was responsible for the adoption of that feature? What kinds of language contact phenomena are observed in mixed and immigrant populations in the US and elsewhere? Students will gain exposure to the relevant machine learning and statistical ideas, and practice writing programs to mine and analyze linguistic data from web sources.