More Information about COCA and COHA

Both the Corpus of Contemporary American English (COCA) and the Corpus of Historical American English (COHA) are very useful resources for research. They can easily be accessed online and various types of analyses can be done on the web interface. Both corpora contain texts from various genres such as fiction, academic writing, magazines and newspapers. In COCA, there are also transcripts of unscripted conversation from more than 150 TV shows. (More information about the composition of these corpora is given on the corpus websites and in various publications by Mark Davies, cf. links below).

Normalizing frequencies

Since the corpus sections have different sizes, it is necessary to use frequencies that are normalized to a common base (e.g. per million words, per thousand words) for comparisons. For example, imagine you searched for the word awesome in the spoken section (1133 occurrences) and the fiction section (658 occurrences) in COCA.

To determine the number of occurrences of awesome per million words, we need to divide the raw frequency by the total number of words in the corpus section and multiply the result with one million.

spoken section: 1133 ÷ 95,565,075 * 1,000,000 = 11.86 occurrences of awesome per million words (pmw)
fiction section: 658 ÷ 90,429,400 * 1,000,000 = 7.28 occurrences of awesome per million words (pmw)

More information (external links):

Davies, Mark (2012) “Expanding Horizons in Historical Linguistics with the 400 million word Corpus of Historical American English”. Corpora 7: 121-57.
Davies, Mark (2010) “The Corpus of Contemporary American English as the First Reliable Monitor Corpus of English”. Literary and Linguistic Computing 25: 447-65.
Davies, Mark (2009) "The 385+ million word Corpus of Contemporary American English (1990–2008+): Design, architecture, and linguistic insights". International Journal of Corpus Linguistics 14.2: 159-190.

Youtube tutorial: Introduction to Using COCA
Tabellenkalkulationskurse des Zentrums für Datenverarbeitung
Part-of-speech tagging (CLAWS7)