Full-text corpus data

Title

Subject

Text data, Web data

Description

These full text corpora are some of the most widely used text corpora. They represent a wide range of subjects and sources including webpages, forums, magazines, newspapers, TV and Movie subtitles, academic papers, and a Spanish and Portuguese Corpus. Date ranges vary from corpora. For more information about the corpora, you can read the Full-text corpus data overview.

Creator

English-Corpora.org

Format

TXT, SQL

Language

English, Spanish, Portuguese

Links

MDL Licensed Data and Software page
Select "Full-Text Corpus Data"

Collection

There are 12 different corpora available, to view a description of them, please use the following link: overview of the corpora

Each corpora is available in three different formats: a database, a word/lemma/part of speech format, and a linear text format. Additionally, each corpora contains a full lexicon file and a file containing the list of sources used.

A full overview of the formats

Terms of Use

The English-Corpora.org text corpora are intended for academic study, research, teaching and administrative use at the University of Toronto. The data is restricted to University of Toronto faculty, students, researchers and staff. It is strictly forbidden to use this dataset or derivatives for commercial use. Further distribution of this data or derivatives, is prohibited. The full restrictions are available here:
Restrictions on use of the corpora

Notes

Citation

English-Corpora.org, “Full-text corpus data,” MDL Data, accessed June 25, 2026, https://mdl-data.library.utoronto.ca/items/show/1815.