The Mainz Corpus Collection

Corpora of Present-Day British English

  • Lancaster-Oslo/Bergen Corpus of British English (LOB), 1 million words
  • Freiburg LOB Corpus of British English (FLOB), 1 million words
  • London-Lund Corpus of Spoken English (LLC), 1 million words
  • British National Corpus (BNC), 100 million words
  • The Daily Mail and The Mail on Sunday 1993-2000, 198 million words
  • The Daily Telegraph and The Sunday Telegraph 1991-2000, 2002, 2004, 422 million words
  • The Guardian and The Observer 1990-2005, 629 million words
  • The Independent 1992-1994, 2002-2005
  • The Times and The Sunday Times 1990-2004, 709 million words

 

Corpora of Present-Day American English

  • Standard Corpus of Present-Day Edited American English (BROWN), 1 million words
  • Freiburg BROWN Corpus of American English (FROWN), 1 million words
  • American National Corpus (ANC), 12.5 million words
  • Corpus of Contemporary American English (COCA), offline data, 440 million words
  • Corpus of Spoken Professional American English (CSPAE), 2 million words
  • Switchboard corpus of American telephone conversations
  • The Buckeye Corpus of conversational speech, 300,000 words
  • The Denver Post, 32 million words
  • The Detroit Free Press 1992-1995, 91 million words
  • The Los Angeles Times 1992-1995, 550 million words
  • The New York Times 2001, 52 million words
  • The Washington Times and Insight on the News 1990-1992, 84 million words
  • Time Almanac 1920s-1990s 3.4 million words, 1989-1994 11 million words

 

Historical Corpora

  • Early English Prose Fiction (EEPF)
  • 18th Century Fiction (ECF)
  • 19th Century Fiction (NCF)
  • Changing Times 1785-1985, 11.5 million words
  • Chaucer Corpus, 500,000 words
  • Corpus of Historical American English (COHA), offline data, 377 million words
  • Early American Fiction
  • Lampeter Corpus
  • Oxford English Dictionary, Version 1.13
  • Penn-Helsinki Parsed Corpus of Middle English, 1.3 million words
  • The Bible in English
  • The Helsinki Corpus of English Texts

 

Other Varieties of English

  • Australian Corpus of English (ACE)
  • Corpus of Global Web-Based English (GloWbE), offline data, 1.8 billion words
  • Kolhapur Corpus of written Indian English, 1 million words
  • Wellington Corpus of Written New Zealand English (WC)
  • Wellington Corpus of Spoken New Zealand English (WSC)

 

German Corpora

  • Die Zeit, 25.5 million words
  • TAZ 1986-1999, 170.5 million words

 

Corpora of Language in Politics

  • Corpus of Political Speeches (CORPS I), 2.3 million words
  • Corpus of Political Speeches (CORPS Release II), 8 million words
  • Hansard Reports, House of Commons 1991, 1992
  • Hansard Reports, House of Lords Hansard 1992
  • Political Tweets of Trump and US-Senators (PoTTUS). 25 million words. More information about the corpus

 

Learner Corpora

Collections of texts produced by foreign/second language learners.
Apart from their role as a resource for second language acquisition research, they can be used to identify typical difficulties of learners of a certain learner group (e.g. intermediate learners) or learners of a certain native language (e.g. German learners of English).

 

Other Corpora

  • Child Language Data Exchange system (CHILDES), 21 million words
  • International Computer Archive of Modern and Medieval English (ICAME Collection of Corpora)
  • ICAME Collection of Corpora II