MEDAL Events Methods: Advanced data retrieval for corpus linguistics

Methods: Advanced data retrieval for corpus linguistics

brackets and plus signs and quotation marks

Regular expressions, part-of-speech annotation, and constituency parsing

This hands-on workshop introduces methods for retrieving complex patterns from corpus using R. We will start with the basics of regular expressions (regex), which enable analysts to identify abstract patterns in text data by using special characters known as meta-characters.

Next, we will explore part-of-speech annotation, which aids in extracting conventional linguistic structures, such as the progressive aspect and passive voice. Following this, I will cover syntactic parsing, focusing on
constituency parsing, and demonstrate how to identify patterns in parsed texts using Tregex, a query language for parsed trees. We will look at its basic use in identifying common patterns in English (e.g., phrasal coordination).

Additionally, if time permits, we will also see how constituency parsing and Tregex are utilised in calculating syntactic complexity measures in the L2 Syntactic Complexity Analyzer (Lu, 2010). The workshop will primarily focus on using R for text processing, where I will introduce some of the common and useful functions from the tidyverse, specifically the stringr package.

When? 23 September 14:15-15:45 EE and 24 September 12:15-13:45 EE

Where? This is an in-person event at the University of Tartu, in Lossi 3-425.

Click here to register!

Schedule (EE time zone):

Monday September 23 14:15-15:45

First session

Tuesday September 24 12:15-13:45

Second session

About the instructor

Dr. Akira Murakami from the University of Birmingham specializes in second language acquisition and corpus linguistics. He brings the two areas together so that developmental research in second language acquisition can benefit from large-scale corpus data.