Internet

Fact-checked

What Is a Text Corpus?

Dan Cavallari

Last Modified Date: January 31, 2024

A text corpus is a collection of texts, spoken or written, that is the basis for corpus linguistics research. Storing these large banks of texts allows researchers to analyze various aspects of any language. A text corpus is an efficient way to conduct research because once the material is gathered, it can be used to investigate a variety of language-related issues including morphology, syntax, vocabulary and pragmatics. Unlike older methods of conducting linguistic research, a text corpus allows researchers to look at language according to how it is actually used in context, rather than how it hypothetically could be used. Linguists typically have access to much larger data samples than when they had to limit themselves to the data they could collect themselves in a limited period of time with limited financial resources.

Corpora are typically stored in a computer, so computer software programs can be created to facilitate research. One common way to use a text corpus is to count the total number of words in the texts, then count and rank the number of times certain words appeared. The ratio that is created between the number of total words and specific words is known as Zipf’s Law. This ratio helps explain word frequency in a language. Understanding Zipf’s Law helps computer programmers design computer software that meets the demands of a given language. They can count and predict how often certain words and phrases will be used as input.

Another way to use a text corpus is to tag specific elements in it that the researcher wants to study. An example of how this would be used is to count how many times the passive voice appears in different text genres. Tagging has also been useful in creating computer programs that assist people in their daily lives. Part-of- speech tagging has been critical to voice recognition software development. In English, for example, the same word might have more than one part of speech. Multisyllabic words are often stressed differently to signal which part of speech is being used. The noun “object” carries its stress on the first syllable, but the verb “object” is stressed on the second syllable. Tagging the noun form of “object” helps the computer program both read it aloud correctly and recognize it when “object” is being said by a human.

Text corpora are useful to both human linguistics and computational linguistics. They allow for research to be conducted that helps people better understand the language humans use which in turn helps develop the language computers use. Great leaps have been made in voice recognition technology, allowing consumers to verbally control computers in their offices, homes, and vehicles. Continued advances will allow humans to communicate with computers as naturally as they do with each other.