On this page we collect all kinds of external resources which might be useful to early-career researchers!
Online corpora and language databases
These are links to several language repositories and corpora, most of them are freely accessible, for others you need special permission.
- Sketch Engine
Sketch Engine is a tool to create corpora. You can also use the website to search for available corpora. - Archive of Indigenous Languages of Latin America
This is an archive of the indigenous languages of Latin America.
- The Language Archive
This is an archive of many languages, including many endangered ones. The archive is hosted at the Max Planck Institute for Psycholinguistics.
- CLARIN Virtual Language Observatory
Here you can find metadata of most available corpora.
- Universal Dependencies
Universal Dependencies (UD) is a dependency-grammar framework for cross-linguistically consistent morphological annotation. A Treebank is a collection of sentences with morphological and syntactic annotations. UD Treebanks are corpora available online for over 100 languages.
- The SpeechReporting Corpus
The SpeechReporting corpus contains corpora of traditional folk stories, annotated for a number of discourse phenomena using the ELAN-CorpA software and tools (Chanard 2015; Nikitina et al. 2019). It is updated regularly with newly available data, including data from new languages. All texts are transcribed, glossed, translated, and annotated.
- Pangloss Collection
Pangloss Collection is the archive of the fieldwork data from CNRS-affiliated research. It is developed by CNRS-LACITO.
- OPUS
A collection of parallel translated texts in multiple languages.
- DELAMAN
DELAMAN stands for Digital Endangered Languages and Musics Archives Network. It is an international network of archives of data on linguistic and cultural diversity, in particular on small languages and cultures under pressure. - Bambara Reference Corpus
This Sketch Engine corpus contains texts of the Mande language Bambara. It contains about 1 million words and was built by Valentin Vydrin, Kirill Maslinsky, Jean Jacques Méric and Andrij Rovenchak. - TalkBank
TalkBank is a project that contains many different language corpora, including CHILDES, the child language corpora.
Blogs and popular science websites
Here you can find links to interesting blogs.
- MPI TalkLing
This is the blog of the Max Planck Institute for Psycholinguistics. - Novaator (Estonian)
Novaator is an Estonian popular science website. - NEMO Kennislink (Dutch)
NEMO Kennislink is a Dutch popular science website.
Open Science resources
OSF is an open platform for sharing your data, materials and scripts!
Here you can read about the importance of pre-registration and find useful resources about it: Preregistration (cos.io)