Full-text corpus data
Title
Full-text corpus data
Subject
Text data, Web data
Description
These full text corpora are some of the most widely used text corpora. They represent a wide range of subjects and sources including webpages, forums, magazines, newspapers, TV and Movie subtitles, academic papers, and a Spanish and Portuguese Corpus. Date ranges vary from corpora. For more information about the corpora, you can read the Full-text corpus data overview.
Creator
English-Corpora.org
Format
TXT, SQL
Language
English, Spanish, Portuguese
Links
MDL Licensed Data and Software page
Select "Full-Text Corpus Data"
Select "Full-Text Corpus Data"
Collection
There are 12 different corpora available, to view a description of them, please use the following link: overview of the corpora
Each corpora is available in three different formats: a database, a word/lemma/part of speech format, and a linear text format. Additionally, each corpora contains a full lexicon file and a file containing the list of sources used.
Terms of Use
The English-Corpora.org text corpora are intended for academic study, research, teaching and administrative use at the University of Toronto. The data is restricted to University of Toronto faculty, students, researchers and staff. It is strictly forbidden to use this dataset or derivatives for commercial use. Further distribution of this data or derivatives, is prohibited. The full restrictions are available here:
Restrictions on use of the corpora
Restrictions on use of the corpora
Citation
English-Corpora.org, “Full-text corpus data,” MDL Data, accessed June 25, 2026, https://mdl-data.library.utoronto.ca/items/show/1815.
