MEDAL Summer School in Corpus Linguistics
Each day of the summer school will include a plenary lecture. Our plenary speakers are Natalia Levshina (Max Planck Institute for Psycholinguistics); Anke Lüdeling (Humboldt-Universität zu Berlin); Amanda Potts (Cardiff University) and Peter Uhrig (Technische Universität Dresden). The lecture is followed by parallel workshops on topics in statistics, such as descriptive statistics, data visualisation, cluster analysis with random forests and multi-level models. And make sure to stay until the end, because the fifth day will include a social event connected with midsummer celebrations in Estonia!
If you are an early career researcher, we strongly encourage you to participate! You will be able to present your research in posters and flash talks, and meet with instructors to discuss their research in person. Affiliated partner mobility grants are available here, and the guidelines can be found here. Note that MEDAL mobility funding for the summer school is only open to MA/MSc, PhD and ECRs with a MEDAL affiliation!
After registration to our summer school you should get confirmation by email.
We are looking forward to seeing you this summer in Tartu! In the meantime, please stay tuned and spread the word!
We are pleased to announce that registration for the MEDAL Summer School in Corpus Linguistics is open here. The event will occur at the University of Tartu, Estonia, from 19-23 June 2023. The preliminary program for the summer school can be found here. The deadline for registration is 19 May. To be informed about events and research we conduct, join our mailing list here or bookmark our homepage here.
Workshops and social events
Descriptions of the various workshops and social events. Workshops take place after the plenary lecture and lunch. Social events usually take place in the evenings.
Monday June 19th
Deep Learning and artificial neural networks are the most common algorithms that play a key role in AI applications in our daily lives. However, the use of Deep Learning has been rather limited in linguistics, including research on grammatical variation. This workshop aims to fill in this gap. I will first introduce the basic concepts, such as layers, weights, loss function and backpropagation. I will then present case studies, in which I use neural networks to test theoretical hypotheses about the genitive alternation and the role of information structure in determining word order. I will also demonstrate how one can perform Deep Learning with the help of Keras, a convenient intuitive API of TensorFlow.
Collecting and analysing naturalistic data from children, especially from children engaging in conversations with others in the home or community, is challenging. In this workshop we will a) cover some basic principles about how to collect child language data, including issues about how to deal with ethical and legal issues in different countries and communities, b) introduce you to the most common transcription and analysis systems, with a focus on the CHILDES programs (CHAT and CLAN; see https://childes.talkbank.org/), and c) finish by briefly introducing you to some of the new automated tools available, including LENA (https://www.lena.org/) and a new pipeline we are developing for semi-automated annotation using Whisper https://openai.com/research/whisper.
This workshop provides an introduction to ELAN (tla.mpi.nl/tools/tla-tools/elan/), a tool used to annotate multi-media recordings. It is intended for people who have little or no experience with ELAN. We will cover terminology as well as theoretical and practical considerations necessary to set up your own set of ELAN files for a linguistics project, thus enabling you to create your own ELAN corpus of linguistically annotated audio/video recordings. In addition to completing practice exercises, participants will create their own initial ELAN files. We will also consider the strengths and weaknesses of using ELAN as a corpus search tool.
In the evening of day 1, we will gather for a social exploration of Tartu city centre from 18:00 to 19:30. Our guide will take us on a tour through Tartu city streets and history. You will have a chance to get to know each other, enjoy the sights and perhaps have a pint after the tour. Meet in front of the university main building at Ülikooli 18.
Attire: casual
Tuesday June 20th
This workshop is concerned with different types of variation and the analyses of them. Many corpus studies rely on annotated corpora. In most annotation tasks we find cases that are difficult to decide, and the more interesting a linguistic problem is, the more difficult a decision may be. Rather than viewing such difficult annotation tasks, we will work on topics like variation, concept building (tagsets), research questions, and modelling. We will distinguish between different types of problems, such as (a) unclear research questions, (b) unclear/vague theories, (c) pre-trained annotation procedures with unsuitable parameter sets, (d) genuine ambiguity, etc. and discuss what can be learned from each of these cases. Students are encouraged to bring their own research questions and examples.
This workshop is concerned with tracking second language learners’ development in the use of multi-word units such as collocations, binomials, lexical bundles and multi-morphemic expressions. In this workshop we will critically analyse corpus-based association measures, specifically focusing on phrasal frequencies and commonly used measures of collocation strength such as mutual information, Log Dice, and Delta P, and lexical gravity. We will examine the extent to which learners’ proficiency levels affect their use of multi-word units through the lenses of above-mentioned measures of association. We will then explore the similarities and differences in multi-word units in morphologically isolating languages like English and multi-morphemic units in morphologically rich (i.e., agglutinating) languages like Turkish and Estonian.
This workshop introduces techniques for exploring and manipulating linguistic and other data using R and in particular the tidyverse packages, including ggplot2 for visualization, and additional packages like plotly for producing interactive graphs. The workshop also integrates ChatGPT as a coding assistant to expedite learning. Basic familiarity with R is expected, but beginners are otherwise very welcome.
In the evening of day 2, we will meet for socialising and a light reception. The reception will take place from 18:30 until 21:00 at Ülikooli Kohvik (located at Ulikooli 20). It's a great opportunity for networking and getting to know each other further.
Attire: smart casual
Wednesday June 21st
In this two-part workshop, participants will be introduced to the web-based corpus analysis tool, Sketch Engine. Sketch Engine is a powerful tool that allows users to upload their own corpora in nearly any language and applies advanced part-of-speech tagging. In Part 1 of this workshop, participants will be introduced to the fundamentals of Sketch Engine, uploading their own data and applying corpus linguistic methods. In Part 2 of this workshop, participants will explore more advanced resources, including the distinctive Word Sketch feature, which makes use of part-of-speech tags and collocation to visualise the grammatical ‘behaviour’ of a lemma in a given corpus. By the end of the workshop, participants will be able to perform frequency, concordance, collocation, and keyness analysis in Sketch Engine using their own data. They will be able to describe discourses and representations of social actors and/or phenomena within the corpus (for instance: by comparing alternative phrasing) and to other contexts (i.e. in comparison to reference corpora).
A large bulk of Estonian historical newspapers have been digitised and made available for research (roughly ~25%). This can be a powerful resource for linguists as well as for historians, literary scholars, and social scientists. The workshop will demonstrate available resources to do text analytics on historical newspaper texts, particularly the ones offered by the National Library of Estonia. It will provide: 1) An introduction to how the materials can be accessed (via a JupyterLab environment and otherwise); 2) What tools and helpful visualisations are available to plan your study; and 3) Simple techniques to analyse historical texts based on keyword searches, frequency analysis, and co-occurrence patterns. Historical digitised newspapers bring in a few extra technical difficulties: 1) technical errors made in digitisation (e.g. OCR errors), 2) variation in language use, 3) imbalance in the datasets. They will be discussed and a few solutions offered to these issues. The workshop will take 1.5 h + 1.5 h. The code used in the workshop will be R, and knowledge in R will be useful. However, on a superficial level, changing a few parameters in a pre-given code is possible also without prior training. Estonian.
The main goal of the workshop is to introduce CLARIN, the research infrastructure for language as social and cultural data, to participants. The workshop will present an overview of CLARIN, and how its resources support corpus linguistics based research all over Europe. After the overview, participants of the workshop will be able to familiarize themselves with the CLARIN infrastructure with a few hands on learning tasks. The second part of the workshop presents naturalistic neuroscience methods that utilize the methods of natural language processing and imaging for studying the brain basis of meaning.
Thursday June 22nd
In this workshop, participants will learn step by step how to carry out their own multimodal corpus study based on the NewsScape 2016 corpus, a collection of more than 30,000 hours of American TV News. The workshop will start with a discussion of the types of research questions that might be addressed with such a corpus approach, followed by a hands-on session introducing CQPweb and the Rapid Annotator. Students should bring a laptop with a working Internet connection and ideally headphones/earphones they can connect to their computer. Students are invited to send ideas for potential research questions via email before the workshop.
In this workshop you will learn the basics of how to use ELAN, a free annotation software, for coding and annotating multimodal communication corpora. The workshop is divided into three blocks: theoretical foundations of gesture, ELAN tutorial, and hands-on practice. Participants will first gain theoretical knowledge about different types of gestures, their structure, their interaction with speech and role in discourse. We will then use this theoretical foundation in a step-by-step tutorial in ELAN software in order to learn how to create and structure annotation tiers, segment and code gestures as well as how to use the coding for analysing and visualising data. In the final part of the workshop, you will engage in hands-on practice, applying your newly gained theoretical and practical knowledge. By the end of the workshop, you will be equipped with the skills that will enable you to conduct your own research on multimodal communication.
Analyzing corpus data often ends up either taking a qualitative, manual-coded side, or utilizes automated methods and computer vision approaches to extract and summarize data. However, manual coding and computer-vision based approaches can be highly complementary, and work very well together. In this workshop, I will provide an introduction to using manual coding to focus and inform automated methods, which in turn can provide a rich method of analysis. Specifically, we will cover 1) easy-to-use automated movement detection to speed up manual coding of visual signals, 2) automatically extracting movement data using manual annotations, 3) quantifying the temporal relationship between visual and linguistic or acoustic signals. The workshop will tutorial-like walkthroughs using open code and materials, as well as open discussions for current issues and future directions.asis of meaning.
Over the past 15 years, Multilevel or Mixed-Effect statistical modelling have evolved from being the "new kid on the block" to becoming the gold standard for data analysis in language sciences. With the advancements in computational implementations and researchers' growing confidence, these models have witnessed significant and rapid growth. In this workshop, we aim to revisit the fundamental principles and explore the essential requirements associated with these models. To facilitate understanding, we will provide practical examples that demonstrate their application
Friday June 23rd
at Selli-Sillaotsa. The 4.3km hike follows a dirt trail in the woods, extensive boardwalks across the bog and a short part along a gravel road.
14:00 - bus leaves from Jakobi 1
16:30 - bus returns to TartuAttire: sporty or casual; note also that sometimes mosquitoes and other insects can be really annoying, so long sleeves and long trousers might be preferrable, and consider using insect repellent.
For centuries, the Midsummer holiday has brought family members of all ages together to have fun. At Midsummer celebrations, you can listen to good music, enjoy the midsummer bonfire, games and dances.
The event will take place at Raadi Park and is organised by the city of Tartu and the Estonian National Museum.
19:30 - Meet up by the huge #TARTU2024 sign on Raekoja plats to go there together
20:00 - Everyone is welcome to the shore of Lake Raadi, where the band Svjata Vatra will start with a concert
21:00 - The victory fire arrives at the party site and Tartu city fire is lit
21.30 - Live music by Svyata Vatra, Legend and Nedsaja Village Band continues
Attire: casual
Resources

They will be made available on YouTube.
Tartu Info
The easiest way to get to Tartu from abroad is via Tallinn. To reach Tartu from the Tallinn airport, you can take a bus (Lux Express; buy beforehand) directly from the airport or a train (Elron; can be bought on the train) from Ülemiste station (a 10-minute walk from the airport). Both are excellent options. Riga is also an option, but there are only minimal bus connections to/from Tartu (Lux Express; buy beforehand).
Arriving late at Tallinn airport or leaving early from Tallinn airport typically requires spending the night in Tallinn, so make sure to check the bus/train schedules and find a place to stay in Tallinn, if necessary (the Mercure hotel is a five-minute walk from the airport).
Tartu is small enough that you can walk everywhere within 30 minutes. For other options, see:
city bus network (can be bought from the bus driver for 2 EUR or by getting "Ühiskaart")
"smart bike" rental system (requires the "Ühiskaart" transport card)
taxi app (get the app)
city tourism website
taxi company elektritakso 1918 (get the app)
Several rooms have been pre-booked for this event in the following two locations (up to 45 people):
Dorpat Hotel (breakfast included, ~70 EUR / night per person and rooms can also be shared; pre-booked 20 spots; use the code "summer school" via direct booking at info@dorpat.ee to get this price)
Tartu student hostel (shared kitchen, ~35 EUR / night; pre-booked 25 spots; use the code "summer school" via direct booking at info@campus.ee to get this price)
Other options include affordable Tamme or EMÜ hostels.
More exclusive options include Lydia, VSpa (with spa), and SOHO hotels.
You can check out the city tourism website for other ideas.
Coffee breaks every day and the reception on Tuesday evening are included; Participants are responsible for all other meals on their own.
Tartu has a lot of good quality, relatively affordable restaurants and bars. To get an impression, check out the city tourism website. Weekday lunch specials ("päevapraad") are listed here and here.
The summer school will take place in the Humanities building at Jakobi 2. See the program for specific times and room numbers.
Summer School Program
Tartu, 19-23 June 2023
Each day will centre around one major topic with a plenary talk, parallel workshops and other opportunities to develop your own corpus linguistics research. Descriptions of the various workshops and social events can be found here.
Plenary talks, workshops and coffee breaks all take place in the Humanities building located at Jakobi 2.
An informal event for early arrivals is planned for Sunday evening (details coming soon via email).
Day 1: Corpus and Grammar (June 19)
| 09:15 - 09:30 | Opening | Ringauditoorium | |
| 09:30 - 11:00 | Plenary talk | Corpus-based Typology: Opportunities and Challenges (Natalia Levshina, Max Planck Institute for Psycholinguistics) | Ringauditoorium |
| 11:00 - 11:30 | Coffee break | Foyer | |
| 11:30 - 12:45 | Flash talks | 3-minute poster presentations | Ringauditoorium |
| 12:45 - 14:00 | Lunch | ||
| 14:00 - 15:30 | Parallel workshops | A. Grammatical Variation and Deep Learning (Natalia Levshina) B. Collecting and Analysing Child Language Data (Caroline Rowland) C. ELAN for Beginners (Joshua Wilbur) | A. Room 438 B. Room 428 C. Room L3 - 425 |
| 15:30 - 16:00 | Coffee break | Foyer | |
| 16:00 - 17:00 | Parallel workshops continuation (see above) | ||
| 18:00 - 19:30 | Guided tour of Tartu city centre meet in front of the university main building (Ülikooli 18) | ||
Day 2: Corpus, Semantics and Register (June 20)
| 09:15 - 09:30 | Opening | Ringauditoorium | |
| 09:30 - 11:00 | Plenary talk | Intra-Individual variation. Corpus Linguistics and Research on Register (Anke Lüdeling, Humboldt-Universität zu Berlin) | Ringauditoorium |
| 11:00 - 11:30 | Coffee break | Foyer | |
| 11:30 - 12:45 | Flash talks | Poster presentations + Consultations for students with instructors | Ringauditoorium |
| 12:45 - 14:00 | Lunch | ||
| 14:00 - 15:30 | Parallel workshops | A. Corpora and Variation. Concepts Options and Challenges (Anke Lüdeling) B. Tracking the development of multi-word and multi-morphemic expressions in learner language (Doğuş Öksüz) C. Descriptive Stats and Data Visualisation (Andres Karjus) | A. Room 438 B. Room 427 C. Room 428 |
| 15:30 - 16:00 | Coffee break | Foyer | |
| 16:00 - 17:00 | Parallel workshops continuation (see above) | ||
| 18:30 - 21:00 | Summer school reception at Ülikooli Kohvik (Ülikooli 20) | ||
| 09:30 - 11:00 | Plenary talk | Exploring constructions of identity using Corpus-based discourse analysis (Amanda Potts, Cardiff University) | Ringauditoorium |
| 11:00 - 11:30 | Coffee break | Foyer | |
| 11:30 - 12:45 | Flash talks | Poster presentations + Consultations for students with instructors | Ringauditoorium |
| 12:45 - 14:00 | Lunch | ||
| 14:00 - 15:30 | Parallel workshops | A. Identity Analysis in SketchEngine. Basics. (Amanda Potts) B. Using Newspapers in Estonia for text analytics (Peeter Tinits) C. Using CLARIN resources for corpus linguistics (Satu Saalasti) | A. Room 438 B. Room 427 C. Room 428 |
| 15:30 - 16:00 | Coffee break | Foyer | |
| 16:00 - 17:00 | Parallel workshops continuation (see above) | ||
Day 3: Corpus, Sociolinguistics and Discourse Analysis (June 21)
| 09:30 - 11:00 | Plenary talk | Intra-Individual variation. Corpus Linguistics and Research on Register (Anke Lüdeling, Humboldt-Universität zu Berlin) | Ringauditoorium |
| 11:00 - 11:30 | Coffee break | Foyer | |
| 11:30 - 12:45 | Flash talks | Poster presentations + Consultations for students with instructors | Ringauditoorium |
| 12:45 - 14:00 | Lunch | ||
| 14:00 - 15:30 | Parallel workshops | A. Corpora and Variation. Concepts Options and Challenges (Anke Lüdeling) B. Tracking the development of multi-word and multi-morphemic expressions in learner language (Doğuş Öksüz) C. Descriptive Stats and Data Visualisation (Andres Karjus) | A. Room 438 B. Room 427 C. Room 428 |
| 15:30 - 16:00 | Coffee break | Foyer | |
| 16:00 - 17:00 | Parallel workshops continuation (see above) | ||
| 18:30 - 21:00 | Summer school reception at Ülikooli Kohvik (Ülikooli 20) | ||
Day 4: Corpus and Multimodality (June 22)
| 09:30 - 11:00 | Plenary talk | Large-scale multimodal Corpus Linguistics: Concepts and Applications (Peter Uhrig, Technische Universität Dresden) | Ringauditoorium |
| 11:00 - 11:30 | Coffee break | Foyer | |
| 11:30 - 12:45 | Flash talks | Consultations for students with instructors | Ringauditoorium |
| 12:45 - 14:00 | Lunch | ||
| 14:00 - 15:30 | Parallel workshops | A. A workflow for Multimodal Corpus Research (Peter Uhrig) B. Annotation and coding of multimodal corpora in ELAN (Anita Slonimska) C. Bringing together manual coding and motion-tracking for advanced analysis of multimodal communication (James Trujillo) D. Multi-level / mixed-effect models (Petar Millin) | A. Room 438 B. Room 427 C. Room 428 |
| 15:30 - 16:00 | Coffee break | Foyer | |
| 16:00 - 17:00 | Parallel workshops continuation (see above) | ||
Day 5: Combining Corpus methods with experimental and computational approaches (June 23)
| 09:30 - 10:45 | Plenary talk | Corpus linguistics, experimental methods, computational modelling MEDAL team: Caroline Rowland, Dagmar Divjak & Virve Vihman | Ringauditoorium |
| 10:45 - 11:00 | Closing remarks | Foyer | |
| 11:00 - 12:00 | Final coffee break | Foyer | |
| 11:30 - 13:45 14:00 - 18:00 20:00 - 03.00 | MEDAL steering committee lunch and meeting (invite only) Bog hike at Selli-Sillaotsa The bus leaves from Jakobi 1 Midsummer celebration at Raadi Park Meet up at 19:30 by the huge #TARTU2024 sign on town hall square (Raekoja plats) to go there together | ||
Download a PDF version of the program below.