Corpus Linguistics

What is corpus linguistics?

Corpus linguistics is a methodology that involves computer-based empirical analyses (both quantitative and qualitative) of language use by employing large, electronically available collections of naturally occurring spoken and written texts, so-called corpora.

Corpus-based studies and other empirical research have shown that speakers' intuitions oftentimes provide only limited access to the open-ended nature of language, which can cause problems when examining infrequent linguistic structures, e.g. lexical co-occurrence patterns, patterns of variation between grammatical constructions, word meaning, or idioms and metaphorical language.

Corpus linguistics and language variation

Which factors condition the choice between competing grammatical variants is one topic that features prominently in our research as well as in students' projects at Mainz University.

While grammar books make us believe that, e.g. yet is a trigger of the present perfect, we see the phrase "Did you vote yet?" used in U.S. election campaigns.
While standard reference works used in schools advise students to use the synthetic comparative in -er with monosyllabic adjectives, we observe native speakers using more apt or more proud rather than prouder and apter in the majority of cases.
While the 's-genitive is described as being used with persons while the of-genitive is allegedly used with things, linguists studying actual language use find a marked discrepancy between what is taught and what is done. Thus, the topic's relevance cannot be stigmatized as an exception or even be marked as incorrect.

Corpus linguistics at Mainz University

The issue of variation poses an intriguing challenge for English teachers and researchers. While to some the task of bringing schoolbook knowledge up to scratch with actual language use seems insurmountable, English Linguistics at Mainz University tries to offer ways out of the dilemma.

In most English linguistics classes in Mainz students practice the collection, processing and analysis of empirical data, often by making use of corpora.
In advanced classes in particular, students will be asked to carry out corpus-based projects, sometimes involving replications and extensions of earlier case studies.
The Department of English and Linguistics offers its students a wide range of computerized corpora comprising British and American English. The Mainz Corpus Collection MACOCO, is a continuously growing source for student research on the grammaticality, use, and historical development of language structures.

What are possible applications of corpus-based research?

Foreign language teaching: Materials and syllabus design, exams testing language competence, and teaching methods
Corpus information is extensively used in lexicography: Almost all monolingual learner dictionaries are now corpus-based, e.g. the Longman Dictionary of Contemporary English
Corpus-based reference and student grammars of English:
- Biber, Douglas et al. (1999) Longman Grammar of Spoken and Written English. London: Longman.
- Biber, Douglas et al. (2002) Longman Student Grammar of Spoken and Written English. London: Longman.
- Huddleston, Rodney and Geoffrey K. Pullum (2005) A Student's Introduction to English Grammar. Cambridge: CUP.
- Huddleston, Rodney and Geoffrey K. Pullum, eds. (2002) The Cambridge Grammar of the English Language. Cambridge: CUP.

What is corpus linguistics?

Corpus linguistics and language variation

Corpus linguistics at Mainz University

What are possible applications of corpus-based research?

Further reading