MEDAL Events MEDAL Summer School in Corpus Linguistics

MEDAL Summer School in Corpus Linguistics

Each day of the summer school will include a plenary lecture. Our plenary speakers are Natalia Levshina (Max Planck Institute for Psycholinguistics); Anke Lüdeling (Humboldt-Universität zu Berlin); Amanda Potts (Cardiff University) and Peter Uhrig (Technische Universität Dresden). The lecture is followed by parallel workshops on topics in statistics, such as descriptive statistics, data visualisation, cluster analysis with random forests and multi-level models. And make sure to stay until the end, because the fifth day will include a social event connected with midsummer celebrations in Estonia!
If you are an early career researcher, we strongly encourage you to participate! You will be able to present your research in posters and flash talks, and meet with instructors to discuss their research in person. Affiliated partner mobility grants are available here, and the guidelines can be found here. Note that MEDAL mobility funding for the summer school is only open to MA/MSc, PhD and ECRs with a MEDAL affiliation!
After registration to our summer school you should get confirmation by email.
We are looking forward to seeing you this summer in Tartu! In the meantime, please stay tuned and spread the word!

We are pleased to announce that registration for the MEDAL Summer School in Corpus Linguistics is open here. The event will occur at the University of Tartu, Estonia, from 19-23 June 2023. The preliminary program for the summer school can be found here. The deadline for registration is 19 May. To be informed about events and research we conduct, join our mailing list here or bookmark our homepage here.

Click here to register

Workshops and social events

Descriptions of the various workshops and social events. Workshops take place after the plenary lecture and lunch. Social events usually take place in the evenings.

Monday June 19th

Natalia Levshina - Grammatical Variation and Deep Learning

Deep Learning and artificial neural networks are the most common algorithms that play a key role in AI applications in our daily lives. However, the use of Deep Learning has been rather limited in linguistics, including research on grammatical variation. This workshop aims to fill in this gap. I will first introduce the basic concepts, such as layers, weights, loss function and backpropagation. I will then present case studies, in which I use neural networks to test theoretical hypotheses about the genitive alternation and the role of information structure in determining word order. I will also demonstrate how one can perform Deep Learning with the help of Keras, a convenient intuitive API of TensorFlow.

Caroline Rowland - Collecting and Analysing Child Language Data

Collecting and analysing naturalistic data from children, especially from children engaging in conversations with others in the home or community, is challenging. In this workshop we will a) cover some basic principles about how to collect child language data, including issues about how to deal with ethical and legal issues in different countries and communities, b) introduce you to the most common transcription and analysis systems, with a focus on the CHILDES programs (CHAT and CLAN; see https://childes.talkbank.org/), and c) finish by briefly introducing you to some of the new automated tools available, including LENA (https://www.lena.org/) and a new pipeline we are developing for semi-automated annotation using Whisper https://openai.com/research/whisper.

Joshua Wilbur - ELAN for Beginners

This workshop provides an introduction to ELAN (tla.mpi.nl/tools/tla-tools/elan/), a tool used to annotate multi-media recordings. It is intended for people who have little or no experience with ELAN. We will cover terminology as well as theoretical and practical considerations necessary to set up your own set of ELAN files for a linguistics project, thus enabling you to create your own ELAN corpus of linguistically annotated audio/video recordings. In addition to completing practice exercises, participants will create their own initial ELAN files. We will also consider the strengths and weaknesses of using ELAN as a corpus search tool.

Social event: guided tour of Tartu

In the evening of day 1, we will gather for a social exploration of Tartu city centre from 18:00 to 19:30. Our guide will take us on a tour through Tartu city streets and history. You will have a chance to get to know each other, enjoy the sights and perhaps have a pint after the tour. Meet in front of the university main building at Ülikooli 18.
Attire: casual

Tuesday June 20th

Anke Lüdeling - Corpora and Variation: Concepts, Opinions and Challenges

This workshop is concerned with different types of variation and the analyses of them. Many corpus studies rely on annotated corpora. In most annotation tasks we find cases that are difficult to decide, and the more interesting a linguistic problem is, the more difficult a decision may be. Rather than viewing such difficult annotation tasks, we will work on topics like variation, concept building (tagsets), research questions, and modelling. We will distinguish between different types of problems, such as (a) unclear research questions, (b) unclear/vague theories, (c) pre-trained annotation procedures with unsuitable parameter sets, (d) genuine ambiguity, etc. and discuss what can be learned from each of these cases. Students are encouraged to bring their own research questions and examples.

Doğuş Öksüz - Tracking the development of multi-word and multi-morphemic expressions in learner language

This workshop is concerned with tracking second language learners’ development in the use of multi-word units such as collocations, binomials, lexical bundles and multi-morphemic expressions. In this workshop we will critically analyse corpus-based association measures, specifically focusing on phrasal frequencies and commonly used measures of collocation strength such as mutual information, Log Dice, and Delta P, and lexical gravity. We will examine the extent to which learners’ proficiency levels affect their use of multi-word units through the lenses of above-mentioned measures of association. We will then explore the similarities and differences in multi-word units in morphologically isolating languages like English and multi-morphemic units in morphologically rich (i.e., agglutinating) languages like Turkish and Estonian.

Andres Karjus - Descriptive Stats and Data Visualisation

This workshop introduces techniques for exploring and manipulating linguistic and other data using R and in particular the tidyverse packages, including ggplot2 for visualization, and additional packages like plotly for producing interactive graphs. The workshop also integrates ChatGPT as a coding assistant to expedite learning. Basic familiarity with R is expected, but beginners are otherwise very welcome.

Social event: reception

In the evening of day 2, we will meet for socialising and a light reception. The reception will take place from 18:30 until 21:00 at Ülikooli Kohvik (located at Ulikooli 20). It's a great opportunity for networking and getting to know each other further. 
Attire: smart casual

Wednesday June 21st

Amanda Potts - Identity Analysis in SketchEngine: Basics

In this two-part workshop, participants will be introduced to the web-based corpus analysis tool, Sketch Engine. Sketch Engine is a powerful tool that allows users to upload their own corpora in nearly any language and applies advanced part-of-speech tagging. In Part 1 of this workshop, participants will be introduced to the fundamentals of Sketch Engine, uploading their own data and applying corpus linguistic methods. In Part 2 of this workshop, participants will explore more advanced resources, including the distinctive Word Sketch feature, which makes use of part-of-speech tags and collocation to visualise the grammatical ‘behaviour’ of a lemma in a given corpus. By the end of the workshop, participants will be able to perform frequency, concordance, collocation, and keyness analysis in Sketch Engine using their own data. They will be able to describe discourses and representations of social actors and/or phenomena within the corpus (for instance: by comparing alternative phrasing) and to other contexts (i.e. in comparison to reference corpora).

Peeter Tinitis - Using Newspapers in Estonia for Text Analytics

A large bulk of Estonian historical newspapers have been digitised and made available for research (roughly ~25%). This can be a powerful resource for linguists as well as for historians, literary scholars, and social scientists. The workshop will demonstrate available resources to do text analytics on historical newspaper texts, particularly the ones offered by the National Library of Estonia. It will provide: 1) An introduction to how the materials can be accessed (via a JupyterLab environment and otherwise); 2) What tools and helpful visualisations are available to plan your study; and 3) Simple techniques to analyse historical texts based on keyword searches, frequency analysis, and co-occurrence patterns. Historical digitised newspapers bring in a few extra technical difficulties: 1) technical errors made in digitisation (e.g. OCR errors), 2) variation in language use, 3) imbalance in the datasets. They will be discussed and a few solutions offered to these issues. The workshop will take 1.5 h + 1.5 h. The code used in the workshop will be R, and knowledge in R will be useful. However, on a superficial level, changing a few parameters in a pre-given code is possible also without prior training. Estonian.

Satu Saalasti - Using CLARIN Resources for Corpus Linguistics

The main goal of the workshop is to introduce CLARIN, the research infrastructure for language as social and cultural data, to participants. The workshop will present an overview of CLARIN, and how its resources support corpus linguistics based research all over Europe. After the overview, participants of the workshop will be able to familiarize themselves with the CLARIN infrastructure with a few hands on learning tasks. The second part of the workshop presents naturalistic neuroscience methods that utilize the methods of natural language processing and imaging for studying the brain basis of meaning.

Thursday June 22nd

Peter Uhrig - Large-scale Multimodal Corpus Linguistics: Concepts and Applications

In this workshop, participants will learn step by step how to carry out their own multimodal corpus study based on the NewsScape 2016 corpus, a collection of more than 30,000 hours of American TV News. The workshop will start with a discussion of the types of research questions that might be addressed with such a corpus approach, followed by a hands-on session introducing CQPweb and the Rapid Annotator. Students should bring a laptop with a working Internet connection and ideally headphones/earphones they can connect to their computer. Students are invited to send ideas for potential research questions via email before the workshop.

Anita Slonimska - Annotation and coding of multimodal corpora in ELAN

In this workshop you will learn the basics of how to use ELAN, a free annotation software, for coding and annotating multimodal communication corpora. The workshop is divided into three blocks: theoretical foundations of gesture, ELAN tutorial, and hands-on practice. Participants will first gain theoretical knowledge about different types of gestures, their structure, their interaction with speech and role in discourse. We will then use this theoretical foundation in a step-by-step tutorial in ELAN software in order to learn how to create and structure annotation tiers, segment and code gestures as well as how to use the coding for analysing and visualising data. In the final part of the workshop, you will engage in hands-on practice, applying your newly gained theoretical and practical knowledge. By the end of the workshop, you will be equipped with the skills that will enable you to conduct your own research on multimodal communication.

James Trujillo - Bringing together manual coding and motion-tracking for advanced analysis of multimodal communication

Analyzing corpus data often ends up either taking a qualitative, manual-coded side, or utilizes automated methods and computer vision approaches to extract and summarize data. However, manual coding and computer-vision based approaches can be highly complementary, and work very well together. In this workshop, I will provide an introduction to using manual coding to focus and inform automated methods, which in turn can provide a rich method of analysis. Specifically, we will cover 1) easy-to-use automated movement detection to speed up manual coding of visual signals, 2) automatically extracting movement data using manual annotations, 3) quantifying the temporal relationship between visual and linguistic or acoustic signals. The workshop will tutorial-like walkthroughs using open code and materials, as well as open discussions for current issues and future directions.asis of meaning.

Petar Milin - Multi-level / Mixed-effect Models

Over the past 15 years, Multilevel or Mixed-Effect statistical modelling have evolved from being the "new kid on the block" to becoming the gold standard for data analysis in language sciences. With the advancements in computational implementations and researchers' growing confidence, these models have witnessed significant and rapid growth. In this workshop, we aim to revisit the fundamental principles and explore the essential requirements associated with these models. To facilitate understanding, we will provide practical examples that demonstrate their application

Friday June 23rd

Social event: bog hike

at Selli-Sillaotsa. The 4.3km hike follows a dirt trail in the woods, extensive boardwalks across the bog and a short part along a gravel road.
14:00 - bus leaves from Jakobi 1
16:30 - bus returns to TartuAttire: sporty or casual; note also that sometimes mosquitoes and other insects can be really annoying, so long sleeves and long trousers might be preferrable, and consider using insect repellent.

Social event: Midsummer celebration

For centuries, the Midsummer holiday has brought family members of all ages together to have fun. At Midsummer celebrations, you can listen to good music, enjoy the midsummer bonfire, games and dances.
The event will take place at Raadi Park and is organised by the city of Tartu and the Estonian National Museum
19:30 - Meet up by the huge #TARTU2024 sign on Raekoja plats to go there together
20:00 - Everyone is welcome to the shore of Lake Raadi, where the band Svjata Vatra will start with a concert
21:00 - The victory fire arrives at the party site and Tartu city fire is lit
21.30 - Live music by Svyata Vatra, Legend and Nedsaja Village Band continues
Attire: casual

Resources

Where can I find the materials of the 2023 summer school?

Here!

Where can I find the video recordings of the plenaries?

They will be made available on YouTube.

Tartu Info

International travel to/from Tartu

The easiest way to get to Tartu from abroad is via Tallinn. To reach Tartu from the Tallinn airport, you can take a bus (Lux Express; buy beforehand) directly from the airport or a train (Elron; can be bought on the train) from Ülemiste station (a 10-minute walk from the airport). Both are excellent options. Riga is also an option, but there are only minimal bus connections to/from Tartu (Lux Express; buy beforehand).

Arriving late at Tallinn airport or leaving early from Tallinn airport typically requires spending the night in Tallinn, so make sure to check the bus/train schedules and find a place to stay in Tallinn, if necessary (the Mercure hotel is a five-minute walk from the airport).

Getting around in Tartu

Tartu is small enough that you can walk everywhere within 30 minutes. For other options, see:
city bus network (can be bought from the bus driver for 2 EUR or by getting "Ühiskaart")
"smart bike" rental system (requires the "Ühiskaart" transport card)
taxi app (get the app)
city tourism website
taxi company elektritakso 1918 (get the app)

Accommodation

Several rooms have been pre-booked for this event in the following two locations (up to 45 people):
Dorpat Hotel (breakfast included, ~70 EUR / night per person and rooms can also be shared; pre-booked 20 spots; use the code "summer school" via direct booking at info@dorpat.ee to get this price)
Tartu student hostel (shared kitchen, ~35 EUR / night; pre-booked 25 spots; use the code "summer school" via direct booking at info@campus.ee to get this price)
Other options include affordable Tamme or EMÜ hostels.
More exclusive options include LydiaVSpa (with spa), and SOHO hotels.
 
You can check out the city tourism website for other ideas.

Food and drink

Coffee breaks every day and the reception on Tuesday evening are included; Participants are responsible for all other meals on their own.
Tartu has a lot of good quality, relatively affordable restaurants and bars. To get an impression, check out the city tourism website. Weekday lunch specials ("päevapraad") are listed here and here.

Venue

The summer school will take place in the Humanities building at Jakobi 2. See the program for specific times and room numbers.

Summer School Program

Tartu, 19-23 June 2023
Each day will centre around one major topic with a plenary talk, parallel workshops and other opportunities to develop your own corpus linguistics research. Descriptions of the various workshops and social events can be found here.

Plenary talks, workshops and coffee breaks all take place in the Humanities building located at Jakobi 2.

An informal event for early arrivals is planned for Sunday evening (details coming soon via email).

Day 1: Corpus and Grammar (June 19)

09:15 - 09:30OpeningRingauditoorium
09:30 - 11:00Plenary talkCorpus-based Typology: Opportunities and Challenges (Natalia Levshina, Max Planck Institute for Psycholinguistics)Ringauditoorium
11:00 - 11:30Coffee breakFoyer
11:30 - 12:45Flash talks3-minute poster presentationsRingauditoorium
12:45 - 14:00Lunch
14:00 - 15:30Parallel workshopsA. Grammatical Variation and Deep Learning (Natalia Levshina)
B. Collecting and Analysing Child Language Data (Caroline Rowland)
C. ELAN for Beginners (Joshua Wilbur)
A. Room 438
B. Room 428
C. Room L3 - 425
15:30 - 16:00Coffee breakFoyer
16:00 - 17:00Parallel workshops continuation (see above)
18:00 - 19:30Guided tour of Tartu city centre
meet in front of the university main building (Ülikooli 18)

Day 2: Corpus, Semantics and Register (June 20)

09:15 - 09:30OpeningRingauditoorium
09:30 - 11:00Plenary talkIntra-Individual variation. Corpus Linguistics and Research on Register (Anke Lüdeling, Humboldt-Universität zu Berlin)Ringauditoorium
11:00 - 11:30Coffee breakFoyer
11:30 - 12:45Flash talksPoster presentations + Consultations for students with instructors
Ringauditoorium
12:45 - 14:00Lunch
14:00 - 15:30Parallel workshopsA. Corpora and Variation. Concepts Options and Challenges (Anke Lüdeling)
B. Tracking the development of multi-word and multi-morphemic expressions in learner language (Doğuş Öksüz)
C. Descriptive Stats and Data Visualisation (Andres Karjus)
A. Room 438
B. Room 427
C. Room 428
15:30 - 16:00Coffee breakFoyer
16:00 - 17:00Parallel workshops continuation (see above)
18:30 - 21:00Summer school reception at Ülikooli Kohvik (Ülikooli 20)
09:30 - 11:00Plenary talkExploring constructions of identity using Corpus-based discourse analysis (Amanda Potts, Cardiff University)Ringauditoorium
11:00 - 11:30Coffee breakFoyer
11:30 - 12:45Flash talksPoster presentations + Consultations for students with instructors
Ringauditoorium
12:45 - 14:00Lunch
14:00 - 15:30Parallel workshopsA. Identity Analysis in SketchEngine. Basics. (Amanda Potts)
B. Using Newspapers in Estonia for text analytics (Peeter Tinits)
C. Using CLARIN resources for corpus linguistics (Satu Saalasti)
A. Room 438
B. Room 427
C. Room 428
15:30 - 16:00Coffee breakFoyer
16:00 - 17:00Parallel workshops continuation (see above)

Day 3:  Corpus, Sociolinguistics and Discourse Analysis (June 21)

09:30 - 11:00Plenary talkIntra-Individual variation. Corpus Linguistics and Research on Register (Anke Lüdeling, Humboldt-Universität zu Berlin)Ringauditoorium
11:00 - 11:30Coffee breakFoyer
11:30 - 12:45Flash talksPoster presentations + Consultations for students with instructors
Ringauditoorium
12:45 - 14:00Lunch
14:00 - 15:30Parallel workshopsA. Corpora and Variation. Concepts Options and Challenges (Anke Lüdeling)
B. Tracking the development of multi-word and multi-morphemic expressions in learner language (Doğuş Öksüz)
C. Descriptive Stats and Data Visualisation (Andres Karjus)
A. Room 438
B. Room 427
C. Room 428
15:30 - 16:00Coffee breakFoyer
16:00 - 17:00Parallel workshops continuation (see above)
18:30 - 21:00Summer school reception at Ülikooli Kohvik (Ülikooli 20)

Day 4: Corpus and Multimodality (June 22)

09:30 - 11:00Plenary talkLarge-scale multimodal Corpus Linguistics: Concepts and Applications (Peter Uhrig, Technische Universität Dresden)Ringauditoorium
11:00 - 11:30Coffee breakFoyer
11:30 - 12:45Flash talksConsultations for students with instructors
Ringauditoorium
12:45 - 14:00Lunch
14:00 - 15:30Parallel workshopsA. A workflow for Multimodal Corpus Research (Peter Uhrig)
B. Annotation and coding of multimodal corpora in ELAN (Anita Slonimska)
C. Bringing together manual coding and motion-tracking for advanced analysis of multimodal communication (James Trujillo)
D. Multi-level / mixed-effect models (Petar Millin)
A. Room 438
B. Room 427
C. Room 428
15:30 - 16:00Coffee breakFoyer
16:00 - 17:00Parallel workshops continuation (see above)

Day 5: Combining Corpus methods with experimental and computational approaches (June 23)

09:30 - 10:45Plenary talkCorpus linguistics, experimental methods, computational modelling
MEDAL team: Caroline Rowland, Dagmar Divjak & Virve Vihman
Ringauditoorium
10:45 - 11:00Closing remarksFoyer
11:00 - 12:00Final coffee breakFoyer
11:30 - 13:45

14:00 - 18:00

20:00 - 03.00
MEDAL steering committee lunch and meeting (invite only)

Bog hike at Selli-Sillaotsa
The bus leaves from Jakobi 1

Midsummer celebration at Raadi Park
Meet up at 19:30 by the huge #TARTU2024 sign on town hall square (Raekoja plats) to go there together

Download a PDF version of the program below.