Linguistics 290A/1: Corpora
This page contains links to lists of available corpora and
descriptions of individual corpus projects. Because of the
nature of WWW, there is considertable overlap between some
of the lists. Some of the corpora linked to here are freely
available, others only for a free. An asterisk next to a corpus
name indicates that we have all or part of the corpus on
corpus.linguistics.berkeley.edu.
Jump to:
- CHILDES: Child Language Data Exchange System
- Silfide wordlists Word frequency lists for English, French and German based on the Silfide corpora.
- Silfide
Texts in French, English, German, Danish, Italian, Spanish and
Portuguese. Some of these texts appear to be translations, and many
appear in more than one language.
- UPF
corpus Texts (law, environment, medicine, economy and IT)
in Catalan, Spanish, French, English and German. (Online tools and demos) (Information in English)
- COSMAS (A large, searchable corpus of German, including
some diachronic and spoken collections, as well as some matched
East German/West German collections. It's possible to use part of
this corpus for free, in sessions that are limited to 60 minutes.)
- COSMAS wordlist 30,000 most frequent forms in the COSMAS corpus.
- Project Gutenberg included 84 German texts as of 12/5/2000.
- LAPT & DA Word and morpheme frequency lists for German, based on 7 corpora.
- EDR Electronic Dictionary of Japanese, organized into 11 sub-dictionaries.
- IPAL Information-technology Promotion Agency
Lexicon of the Japanese Language. (The link given here is for a page with Japanese on top and English further down.)
- Tuebinger
russische Korpora Russian Corpora, searchable via the web.
Currently (5/2001) consists of the Uppsala corpus and a
corpus of interviews from Russian online newspapers.
- Subscribe.Ru
Links to a subscription list about a collection of various
dictionaries. The subscription list is searchable.
- ssu.komi.com A collection of
dictionaries on all sorts of subjects. Online search of a
collection of science fiction stories.
- vault.agava.ru Another collection of dictionaries (work
in progress).
- Yugoslav
Corpus ~700,000 words of modern Yugoglav fiction, representing all
of the Serbo-Croatian speaking areas of the former Yugoslavia.
(Archived at CRL.)
Emily M. Bender
Last modified: Thu May 31 08:05:48 2001