Wiktionary:Corpora

This page is dedicated to listing collections of texts useful for the work of creating a dictionary. These collections are often known as "corpora" or less commonly "corpuses". Many of them feature functions like full-text search, term frequency information and collocation search.

For a more user-friendly introduction to some of the most prominent corpora, as well as other resources like dictionaries, see Wiktionary:Quotations/Resources. Another page, Wiktionary:Searchable external archives also contains information with a more specific focus on those which can solidly provide citations passing Wiktioanry's criteria for inclusion.

Note that corpora that contain text in multiple languages but where English text makes up of a significant portion of the corpora are listed in the English table below with their "Dialect" in the listing including the word "Multilingual".

If there are any other resources that you know of which aren't listed here, please do add them or suggest them on the talk page.

English edit

^ Go back to top

English corpora table
Name Resource Type Size in words Size in texts Dialect Start year End year Original Medium Available Medium Genre Re-use restrictions Access restrictions Date of entry update
News on the Web (NOW) Corpus, Tagged 10^10 * 2 10^7 * 3 (Various)[1][2] 2010 Present Written, Computer, Internet Written Nonfiction, News None Free registration required 2022/10/31
iWeb: The Intelligent Web-based Corpus Corpus, Tagged 10^10 * 1 10^7 * 2 (Various)[3][2] 2017 2017 Written, Computer, Internet Written General, esp. Nonfiction None Free registration required 2022/10/31
Global Web-Based English (GloWbE) Corpus, Tagged 10^9 * 2 10^5 * 2 (Various)[1][2] 2012 2013 Written, Computer, Internet Written General, esp. Nonfiction None Free registration required 2022/10/31
Wikipedia Corpus Corpus, Tagged 10^9 * 2 10^6 * 4 (Various) 2014 2014 Written, Computer, Internet Written Nonfiction, Encyclopedia None Free registration required 2022/10/30
Coronavirus Corpus Corpus, Tagged 10^9 * 2 10^5 * 4 (Various)[1][2] 2020 Present Written, Computer, Internet Written Nonfiction, News, COVID-19 None Free registration required 2022/10/31
Corpus of Contemporary American English (COCA) Corpus, Tagged 10^9 * 1 10^5 * 5 American 1990 2019 Multimedia Written General, esp. Nonfiction None Free registration required 2023/03/27
Early English Books Online (EEBO) Corpus, Tagged 10^8 * 8 10^4 * 3 British 1470 (apprx.) 1690 (apprx.) Written, Books, Print Written General None Free registration required 2022/10/30
Early English Books Online (EEBO) TCP Corpus, Untagged - 10^4 * 6 British 1475 1700 Written, Books, Print Written General None None 2022/10/31
Early English Books Online (EEBO, V2) Corpus, Untagged 10^8 * 6 10^4 * 1 British 1470 (apprx.) 1690 (apprx.) Written, Books, Print Written General None Free registration required 2022/11/02
Filmnot Library - 10^8 * 5 (Various, Multilingual) 2005 (apprx.) Present Spoken, General Audio-visual General, esp. Nonfiction None None 2022/10/30
YouGlish Library - 10^8 * 1 (Various) 2005 (apprx.) Present Spoken, Formal[4] Audio-visual Nonfiction None None 2022/10/30
TED Corpus Search Engine (TCSE) Corpus, Tagged 10^7 * 1 10^3 * 5 (Various) 2007 Present Spoken, Formal, Speeches Audio-visual Nonfiction None None 2022/10/30
Archive-It Collections Library - 10^6 * 2 (Various) 1996 Present Written, Computer, Internet Written General, esp. Nonfiction None None 2022/10/30
ACL Anthology Reference Corpus (ARC) Corpus, Tagged 10^7 * 6 10^4 * 2 (Various) 1979 2015 Written, Journals Written Nonfiction, Academic, NLP None None 2022/10/30
COVID-19 Open Research Dataset (CORD-19) Corpus, Tagged 10^9 * 3 10^5 * 7 (Various) 1922[5] 2020[6] Written, Journals Written Nonfiction, Academic None None 2022/10/30
EcoLexicon English Corpus, Tagged 10^7 * 2 10^3 * 2 (Various) 1973 2016 Written Written Nonfiction, Academic, Environment None None 2022/10/30
Lipstick Alley Social Media - - American, African 2000 Present Written, Computer, Social Media, Forum Written General, esp. Nonfiction, Celebrity News None Free registration required[7] 2023/06/23
Corpus of Regional African American Language (CORAAL) Corpus, Untagged 10^6 * 1 10^2 * 2 American, African 1968 2017 Spoken, Interviews Audio General, Anthropological interviews None None 2022/10/31
Google Trends Trends - - (Various, Multilingual) 2004 Present Written, Computer, Internet Searches Written General None None 2022/10/31
Google Ngrams Trends - 10^7 * 4 (Various, Multilingual)[8] 1400 (apprx.) Present Multimedia Written General None None 2022/10/31
Google Books Library - 10^7 * 4 (Various, Multilingual) 1400 (apprx.) Present Multimedia Written General None None 2022/10/31
Google Scholar Library - 10^8 * 1[9] (Various, Multilingual) 1700 (apprx.) Present Written, Journals Written Nonfiction, Academic; Law None None 2023/01/19
Corpus of Middle English Prose and Verse Corpus, Untagged - 10^2 * 3 Middle 1000 1500 Written, Books, Print Written General, esp. Nonfiction None None 2022/10/31
Michigan Corpus of Upper-level Student Papers (MICUSP) Corpus, Untagged 10^6 * 3 10^2 * 8 (Various, ESL[10]) 2002 2009 Written, College Work Written Nonfiction, Academic Restrictions on commercial use[11] None 2022/12/28
Michigan Corpus of Academic Spoken English (MiCASE)[12][13] Corpus, Untagged 10^6 * 2 10^2 * 2 American (mostly) 1998 2001 Spoken, Formal, Speeches Audio,[12] Written Nonfiction, Academic Restrictions on commercial use[14] None 2022/10/31
British Academic Spoken English Corpus (BASE) Corpus, Tagged 10^6 * 1 10^2 * 2 British 1998 2005 Spoken, Formal, Speeches Written Nonfiction, Academic None None 2022/11/02
British Academic Written English Corpus (BAWE) Corpus, Tagged 10^6 * 7 10^3 * 3 British 2000 2007 Written, College Work Written Nonfiction, Academic None None 2022/11/02
Public Papers of the Presidents of the United States Library - 10^2 * 1 American 1938 2002 Multimedia Written Nonfiction, Politics None None 2023/06/17
Google Groups Social Media - - (Various) 1981 2024 Written, Computer, Social Media, Usenet Written General, esp. Nonfiction None None 2024/03/20
UsenetArchives.com Social Media - 10^8 * 7 (Various) 1990 (apprx.) Present? Written, Computer, Social Media, Usenet Written General, esp. Nonfiction None None 2024/03/20
Narkive Social Media - 10^8 * 3 (Various) 1990 (apprx.) Present? Written, Computer, Social Media, Usenet Written General, esp. Nonfiction None None 2024/03/20
Europeana Library - 10^7 * 2 (Various, Multilingual) 0400 (apprx.) Present Multimedia Multimedia General None None 2022/10/31
Internet Archive Library - 10^7 * 6 (Various, Multilingual) - Present Multimedia Multimedia General None Free registration required 2022/10/31
Eighteenth Century Collections Online (ECCO) TCP Corpus, Untagged - 10^3 * 2 (Various, Multilingual) 1701 1800 Written, Books, Print Written General None None 2022/10/31
Old Bailey Corpus (OBC) 2.0 Corpus, Tagged 10^7 * 4 10^6 * 1 British (various dialects) 1720 1913 Spoken, Formal, Court Proceedings Written Nonfiction, Law, Courts, Criminal None Free registration required 2022/10/31
Old Bailey Proceedings Online Corpus, Untagged 10^8 * 1 - British (various dialects) 1674 1913 Spoken, Formal, Court Proceedings Written Nonfiction, Law, Courts, Criminal None None 2022/10/31
Royal Society Corpus (RSC) 6.0.1 Open Corpus, Tagged 10^7 * 8 10^4 * 2 British 1665 1920 Written, Journals, Print Written Nonfiction, Academic None Free registration required 2022/10/31
Royal Society Corpus (RSC) 6.0.4 Open with Topics Corpus, Tagged 10^8 * 3 10^4 * 2 British 1665 1920 Written, Journals, Print Written Nonfiction, Academic None Free registration required 2022/10/31
Twitter Social Media 10^12 * 3[15] - (Various, Multilingual) 2005 Present Written, Computer, Social Media, Twitter Written General, esp. Nonfiction None None 2022/10/31
SocialGrep (Reddit) Corpora Corpus, Untagged - 10^7 * 9 (Various) 2005 (apprx.) Present (apprx.) Written, Computer, Social Media, Reddit Written General, esp. Nonfiction None None 2022/10/31
Europarl 7 Sample, English Corpus, Tagged 10^7 * 2 10^3 * 8 International/ELF[16] 2007 2011 Spoken, Formal, Legislative Proceedings Written Nonfiction, Law, Legislatures None None 2022/11/01
Europarl 3, English Corpus, Tagged 10^7 * 2 10^2 * 7 International/ELF[16] 1996 2006 Spoken, Formal, Legislative Proceedings Written Nonfiction, Law, Legislatures None Free registration required 2022/11/01
TARA Corpus, Tagged 10^5 * 9 10^4 * 2 British 2006 (apprx.) 2006 (apprx.) Written, Newspapers, Print Written Nonfiction, News None Free registration required 2022/11/01
British National Corpus (BNC) Corpus, Tagged 10^8 * 1 10^3 * 4 British 1960 1993 Multimedia Written General None Free registration required 2022/11/01
British National Corpus (BNC) Sampler Corpus, Tagged 10^6 * 2 10^2 * 2 British 1975 1993 Multimedia Written General None Free registration required 2022/11/01
Phrases in English (BNC)[17][18] Corpus, Tagged 10^8 * 1 10^3 * 4 British 1960 1993 Multimedia Written General None None 2023/02/12
Just The Word (BNC)[17] Corpus, Tagged 10^8 * 1 10^3 * 4 British 1960 1993 Multimedia Written General None None 2023/02/12
British English 2006 (BE06) Corpus, Tagged 10^6 * 1 10^2 * 5 British 2003 2008 Written Written General None Free registration required 2022/11/01
American English 2006 (AME06) Corpus, Tagged 10^6 * 1 10^2 * 5 American 2006 (apprx.) 2006 (apprx.) Written Written General None Free registration required 2022/11/22
Hansard Corpus (British Parliament) Corpus, Tagged 10^9 * 2 10^6 * 8 British 1803 2005 Spoken, Formal, Legislative Proceedings Written Nonfiction, Law, Legislatures None Free registration required 2022/11/01
British Parliament Hansard Library - - British 1800 Present Spoken, Formal, Legislative Proceedings Written Nonfiction, Law, Legislatures None None 2022/11/01
Australian Parliament Hansard Library - - Australian 1901 Present Spoken, Formal, Legislative Proceedings Written Nonfiction, Law, Legislatures None None 2022/11/01
Canadian House of Commons Hansard Library - - Canadian 2002 Present Spoken, Formal, Legislative Proceedings Written Nonfiction, Law, Legislatures None None 2022/11/01
New Zealand Parliament Hansard Library - - New Zealand 1854 Present Spoken, Formal, Legislative Proceedings Written Nonfiction, Law, Legislatures None None 2022/11/01
GovInfo (United States) Library - - American 1793 Present Multimedia Written Nonfiction, Law None None 2022/11/01
Transgender Usenet Archive (TUA) Corpus, Untagged - 10^5 * 4 (Various) 1994 2013 Written, Computer, Social Media, Usenet Written General, Transgender Topics None None 2022/11/01
Science Forums Social Media - 10^5 * 1 (Various) 1992 2014 Written, Computer, Social Media, BBS Written Nonfiction, Science None None 2022/11/01
TextFiles.com Library - - (Various) 1980 (apprx.) 1995 (apprx.) Multimedia Multimedia General, esp. Nonfiction, Technology None None 2022/11/01
LDS General Conference Corpus Corpus, Tagged 10^7 * 3 10^4 * 1 American 1851 Present Spoken, Formal, Speeches Written Religious, Latter Day Saints None None 2022/11/01
FidoNet Echomail Archive Social Media - - (Various) 1990 (apprx.) 2016 (apprx.) Written, Computer, Social Media, FidoNet Written General, esp. Nonfiction, Technology None None 2022/11/01
FidoNet HolySmoke Archive Library - 10^5 * 4 (Various) 1993 2004 Written, Computer, Social Media, FidoNet Written Nonfiction, Religion None None 2022/11/02
Dúchas Project Library - 10^6 * 2 Irish 1900 (apprx.) 1940 (apprx.) Multimedia Written Fiction, Folklore None None 2022/11/02
Freiburg-Brown Corpus of American English (FROWN) Corpus, Tagged 10^6 * 1 10^2 * 5 American 1992 1992 Written, Print Written General None Free registration required 2022/11/02
Brown Corpus Family Corpus, Tagged 10^6 * 1 10^3 * 2 - - - Written, Print Written General None Free registration required 2022/11/02
Brown Family (C8 tags) Corpus, Tagged 10^6 * 6 10^3 * 2 (Various) 1931 1991 Written, Print Written General None Free registration required 2022/11/02
Brown Corpus[19] Corpus, Tagged 10^6 * 1 10^3 * 1 American 1961 1961 Written, Print Written General None None 2022/11/02
Corpus of English Dialogues Corpus, Tagged 10^6 * 1 10^2 * 2 British(?) 1560 1760 Multimedia Written General, Dialogues None Free registration required 2022/11/02
Florence Early English Newspapers (FEEN) Corpus, Tagged 10^5 * 3 -[20] British(?) 1620 1649 Written, Newspapers, Print Written Nonfiction, News None None 2023/03/27
Transhistorical Corpus of Written English Corpus, Tagged 10^5 * 5 10^2 * 8 (Various) 1405 2019 Written Written General None None 2022/11/02
Linguistic Landscape Corpus Corpus, Tagged 10^6 * 5 10^2 * 6 (Various) 1997 2018 Written Written Nonfiction, Academic None Free registration required 2022/11/02
ICNALE Online[21] Corpus, Tagged 10^6 * 4 10^4 * 2 (Various, ESL[10])[22] 2007 (apprx.) 2022 (apprx.) Multimedia, College Work Multimedia Nonfiction, Academic None None 2022/11/02
European Football Championship Interpreting Corpus (EFCIC) Corpus, Tagged 10^4 * 1 10^1 * 1 - 2020 2020 Spoken, Entertainment, Interpretation, Interview Written Nonfiction, Sports None None 2022/11/02
UkWac Complete[23] Corpus, Tagged 10^9 * 2 10^6 * 3 British[2] 2005 (apprx.) 2007 (apprx.) Written, Computer, Internet Written General None None 2022/11/02
UkWac Small[23] Corpus, Tagged 10^7 * 8 10^5 * 1 British[2] 2005 (apprx.) 2007 (apprx.) Written, Computer, Internet Written General None None 2022/11/02
Postcard Archive @ Florida State University[24] Library - 10^3 * 3[25] (Various) 1829 (apprx.) 2016 (apprx.) Written, Postcards Written Nonfiction, Postcards None None 2022/11/06
PlayPhrase.me Corpus, Tagged - 10^6 * 8[26] (Various) 1970 (apprx.) Present Spoken, Entertainment, Movies Audio-visual Fiction, Movies None None 2022/11/07
European Union DGT-UD: English Corpus, Tagged 10^8 * 1 10^4 * 5 International/ELF[16] 1948 (apprx.) 2016 Written, Legislative Acts Written Nonfiction, Law, Legislatures None None 2022/11/16
Opus-MontenegrinSubs 1.0: English Corpus, Tagged 10^5 * 5 10^2 * 2 (Various) 2007 2013 Spoken, Entertainment, Television Written Fiction, Television None None 2022/11/16
Archive of Our Own (AO3) Library - 10^7 * 1 (Various) 2007 Present Written, Computer, Internet Written Fiction, Short Stories, Fan Works[27] None None 2022/11/22
SCP Foundation Library - 10^3 * 2 (Various) 2007 Present Written, Computer, Internet Written Fiction, Short Stories, Sci-Fi[27] None None 2022/11/22
NEWS-GB (British newspapers) Corpus, Tagged 10^8 * 2 - British 2004 (apprx.) 2004 (apprx.) Written, Print Written Nonfiction, News None None 2022/11/22
INTERNET-EN Corpus, Tagged 10^8 * 2 10^4 * 5 (Various) 2006 (apprx.) 2006 (apprx.) Written, Computer, Internet Written General None None 2022/11/22
BLOGS-EN (Political blogs) Corpus, Tagged 10^8 * 5 - (Various) 2008 (apprx.) 2008 (apprx.) Written, Computer, Internet Written Nonfiction, Politics None None 2022/11/22
Manually Annotated Sub-Corpus (MASC) Library[28] 10^5 * 5 10^2 * 4 American 1990 (apprx.) 2010 (apprx.) Multimedia Written General None None 2022/11/23
Lancaster Newsbooks Corpus (1654 part) Corpus, Tagged 10^5 * 9 10^2 * 2 British 1653 1654 Written, Newspaper, Print Written Nonfiction, News None Free registration required 2022/11/23
The Mail Arcive Library - 10^8 * 2 (Various) 1990 Present Written, Computer, Mailing List Written Nonfiction, esp. Coding and Computers None None 2022/11/26
CataList (LISTSERV catalog)[29] Library - -[30] (Various) 1990 (apprx.) Present Written, Computer, Mailing List Written Nonfiction None None 2022/11/28
United Nations Digital Library Library - 10^5 * 7[31] (Various, International/ELF[16]) 1875[32] Present Multimedia Multimedia Nonfiction, Politics None None 2022/11/29
Genius.com Library - - (Various, Multilingual) 1900 (apprx.) Present Spoken, Entertainment, Music Written General, Music None None 2022/12/06
Chronicling America Library - - American 1777 1963 Written, Newspaper, Print Written Nonfiction, News None None 2022/12/06
Library of Congress Library - 10^6 * 3[33] (Various, Multilingual) 1470 (apprx.) Present Multimedia Multimedia General None None 2022/12/06
World Radio History Library - 10^5 * 1[34] (Various, Multilingual)[35] 1900 (apprx.) Present Written, Magazines, Print Written Nonfiction, Radio; Television; Music None None 2022/12/06
Google News Newspapers Archive Library - 10^6 * 6[36][37] (Various, Multilingual)[35] 1738 (apprx.) 2009 Written, Magazines, Print Written Nonfiction, News None None 2022/12/14
VESPA[38] Corpus 10^6 * 2 10^2 * 9 International/ESL[10] 2008 (apprx.) 2008 (apprx.) Written, College Work Written Nonfiction, Academic Restriction to non-profit educational use only[39] Free registration required 2022/12/28
I-EN (Internet English Corpus) Corpus, Tagged 10^8 * 2 - (Various) 2005 2005 Written, Computer, Internet Written Nonfiction, News? None None 2022/12/28
I-EN-CC (Internet English Creative Commons Corpus) Corpus, Tagged 10^8 * 2 - (Various) 2005 (apprx.) 2005 (apprx.) Written, Computer, Internet Written Nonfiction, News? None None 2022/12/28
Springfield! Springfield! Library - 10^5 * 2 (Various) 1910 (apprx.) Present Spoken, Entertainment, Movies and Television Written General None None 2023/03/27
Issuu Library - 10^7 * 5[40] (Various, Multilingual) 2000 (apprx.)[41] Present Written, Magazines Written Nonfiction None Free registration required for full access[42] 2023/01/19
Smithsonian Transcription Center Library - -[43] American 1400 (apprx.)[44] Present Written Written Nonfiction None None 2023/01/22
Voices Remembering Slavery: Freed People Tell Their Stories Library 10^4 * 7[36][45] 10^1 * 3[46] American, African 1932 1975[47] Spoken, Interviews Audio General, Anthropological interviews None None 2023/01/28
Born in Slavery: Slave Narratives from the Federal Writers' Project Library - 10^3 * 2[48] American, African[49] 1936 1938 Written Written Nonfiction, Biographies None None 2023/01/28
Corpus of Historical American English (COHA)[50] Corpus, Tagged 10^8 * 5 10^5 * 1 American 1820 2019 Multimedia Written General None Free registration required 2023/02/14
The TV Corpus Corpus, Tagged 10^8 * 3 10^4 * 8 (Various)[51] 1950 2017 Spoken, Entertainment, Television Written General None Free registration required 2023/03/27
The Movie Corpus Corpus, Tagged 10^8 * 2 10^4 * 3 (Various)[51] 1930 2018 Spoken, Entertainment, Movies Written General None Free registration required 2023/03/27
Corpus of American Soap Operas (CASO) Corpus, Tagged 10^8 * 1 10^4 * 2 American 2001 2012 Spoken, Entertainment, Movies Written Fiction, Television, Soap Operas None Free registration required 2023/03/27
Corpus of US Supreme Court Opinions Corpus, Tagged 10^8 * 1 10^4 * 3 American 1790 (apprx.) 2019 (apprx.)[52] Written Written Nonfiction, Law, Courts, Constitutional None Free registration required 2023/02/16
TIME Magazine Corpus Corpus, Tagged 10^8 * 1 10^5 * 3[53] American 1923 2006 Written, Magazines, Print Written Nonfiction, News None Free registration required 2023/02/16
Corpus of Online Registers of English (CORE) Corpus, Tagged 10^7 * 5 10^4 * 5 (Various)[54] 2013 (apprx.) 2016 (apprx.) Written, Computer, Internet Written General None Free registration required 2023/02/16
Strathy Corpus of Canadian English Corpus, Tagged 10^7 * 5 10^3 * 1 Canadian 1921[55] 2011[55] Multimedia Written General None Free registration required 2023/02/16
Biodiversity Heritage Library Library - 10^5 * 3[56] (Various, Multilingual) 1400 (apprx.) Present Written Written Nonfiction, Academic, Biology None None 2023/02/23
African American Writers, 1892-1912 (AAW) Corpus, Untagged 10^5 * 5 10^0 * 8 American, African 1892 1912 Written Written General None None 2023/03/15
Children's Literature (ChiLit) Corpus, Untagged 10^6 * 4 10^1 * 7 (Unclear)[57] ? ? Written Written Fiction, Children None None 2023/03/15
The Philadelphia Neighborhood Corpus of LING560 Studies (PNC)[58] Corpus 10^6 * 2 10^2 * 3 American 1972 Present?[59] Spoken, Interviews Written (Unclear) Restrictions on excerpt size[60] Yes[61] 2023/03/15
British Pathé[62] Library - 10^5 * 2 British 1896 1984 Spoken, Formal Audio-visual Nonfiction, News None? None 2023/04/06
Newspapers.com Library - 10^5 * 8 (Various)[63] 1690 Present Written, Newspaper, Print Written Nonfiction, News None Wikipedia Library access available. Paid subscription otherwise required. Free trials are available. 2023/04/30
NewspaperArchive Library - ? (Various) 1607 Present Written, Newspaper, Print Written Nonfiction, News None Wikipedia Library access available. Paid subscription otherwise required. Free trials are available. 2023/05/31
PressReader Library - ? (Various) ? Present Written, Newspaper, Print Written Nonfiction, News None Some snippets freely visible, most content requires paid subscription. Free trials are available. 2023/05/31
ProQuest Library - ? (Various) ? Present Written, Newspaper, Print Written Nonfiction, News None Wikipedia Library access available. Some snippets freely visible, most content requires paid subscription. Free trials are available. 2023/05/31
Welsh Newspapers Library - ?[64] Welsh,[65] Multilingual 1804 1919 Written, Newspaper, Print Written Nonfiction, News None? None 2023/08/08
Welsh Journals Library - ?[66] Welsh, Multilingual 1735 2007 Written, Periodicals, Print Written General None? None 2023/08/08
Crime and Punishment Database Library - - English?[67] 1730 1830 Written, Formal, Court Records Written Nonfiction, Law, Courts, Criminal None? None 2023/08/08
American Archive of Public Broadcasting Library - 10^5 * 1[68] (Various, Multilingual)[35] 1931[69] Present Spoken, esp. Formal Audio-visual General, esp. Nonfiction None None, additional content available on-site at GBH or the Library of Congress. 2023/11/01
Buckeye Speech Corpus Corpus, Tagged 10^6 * 3 10^2 * 4 American 1999 2000 Spoken, Interviews Audio, Written General, Anthropological interviews Restriction to educational and research use only Free registration required 2024/02/19
Westminster Detective Library Library 10^7 * 5[36][70] 10^4 * 2[70][71] American 1818 1891 Written, Newspapers, Print[72] Written Fiction, Short Stories, Detective Stories None None 2024/02/26
Usenet Archive (UTZOO Wiseman/Zach Barth) Social Media - 10^6 * 2[73] (Various) 1981 1991 Written, Computer, Social Media, Usenet Written General, esp. Nonfiction None None 2024/03/20
Searchids.com[74] Library[75] 10^7 * 7[76] 10^7 * 2[77] (Various) 2006 2006 Written, Computer, Internet Searches Written General Restriction to non-commercial research use only[78] None 2024/04/11

Non-English edit

^ Go back to top

Non-English corpora table
Name Language Language Code Resource Type Size in words Size in texts Start year End year Original medium Available medium Genre Use restrictions Access restrictions Date of entry update
Czech National Corpus[79] Czech cs Corpus, Tagged ? ? ? ? ? Multimedia General None? None 2024/04/22
Polish National Corpus[79] Polish pl Corpus, Tagged 10^9 * 2 ? ? ? ? Written General None? None 2023/02/12
Russian National Corpus[79] Russian ru Corpus, Tagged 10^9 * 2[80] 10^6 * 5[80] 1100 Present Multimedia Written General Restriction to non-commercial linguistic use only[81] None 2023/02/12
Turkish National Corpus[82] Turkish tr Corpus, Tagged? 10^7 * 5 10^3 * 6 1990 2009 Written[83] Written General Restriction to educational use only[84] Free registration required 2023/02/12
Bruno Corpus[85] Spanish es Corpus, Untagged 10^6 * 1 10^2 * 5 ? 2010 (apprx.) Written Written? General None None 2023/02/12
Braun Corpus[85] German de Corpus, Untagged 10^6 * 1 10^2 * 5 ? 2008 (apprx.) Written Written? General None None 2023/02/12
Corpus del Español: Genre/Historical Spanish es Corpus, Tagged 10^8 * 1 10^4 * 1 1200 (apprx.) 2000 (apprx.) Multimedia, esp. Written Written General None Free registration required 2023/03/24
Corpus del Español: Web/Dialects Spanish[86][87] es Corpus, Tagged 10^9 * 2 10^6 * 2 2010 (apprx.) 2014 Written, Computer, Internet Written General None Free registration required 2023/03/24
Corpus del Español: NOW Spanish[86][88] es Corpus, Tagged 10^9 * 7 10^7 * 1 2012 2019 Written, Computer, Internet Written Nonfiction, News None Free registration required 2023/03/24
Corpus del Español del Siglo XXI (CORPES)[89] Spanish es Corpus, Tagged 10^8 * 4 10^5 * 4 2001 2022 Multimedia, esp. Written Multimedia, esp. Written General None? None 2023/03/24
Lemko and Karpatska Rus’ Archive[90] Carpathian Rusyn rue Library - 10^3 * 2 1928 1989 Written, Newspaper, Print Written Nonfiction, News None? None 2024/04/22
Spauda[90] Lithuanian lt Library - ? 1886 2015 Written, Newspaper, Print Written Nonfiction, News None? None 2023/04/04
Gallica French fr Library - 10^7 ? ? Written, Newspaper, Print Written General, esp. Nonfiction, News None None 2023/05/31
RetroNews French fr Library - ? (>10^6 * 3) 1631 1951 Written, Newspaper, Print Written Nonfiction, News None None 2023/05/31
The Database of Early Cantonese Bible Cantonese yue Corpus, Untagged? - 10^0 * 7 1863 1927 Written, Religious Text Written Religious, Christianity, Bible Passages None? None 2023/12/10
The Database of Early Christian Literature Cantonese yue Corpus, Untagged? - 10^0 * 5 1845 (apprx.) 1906 Written, Books, Print Written Religious, Christianity None? None 2023/12/10

Glossary edit

^ Go back to top

The following is a brief explanation of how various terms are used in describing and categorizing the corpora on this page.

  • Access restrictions: Any barriers to accessing the resource's contents, such as registration or paying a subscription. A number of resources can be accessed through the Wikipedia Library for free
  • Apprx.: "Approximately", used to indicate that a date or quantity was estimated or not exactly known when inputted, but a best guess was given.
  • Available medium: The format through which the language can be accessed in the resource, such as written text or as a spoken and recorded in a video.
  • Esp.: "Especially", used to qualify the most common quality of a corpus, event if there are notable exceptions.
  • Hyphen (-): The symbol "-" is used in tables for information about a corpus that cannot be readily determined or approximated.
  • Library: Collection of texts gathered with a wide net and without linguistics work particularly in mind. It must be possible to search the contents of these texts.
  • Original Medium: The way the language was originally produced, whether it spoken, written, etc.
  • Question mark (?): The symbol "?" is used tables for information about a corpus that has not yet been determined, but probably could be.
  • Social media: A live website or other online center for mass user communication, or the attempt at a near complete archive of such. If the resource is an archive with a particular focus, then it is considered a library or corpus.
  • Re-use restrictions: Unique restrictions on the distribution of the resource's contents beyond general copyright law, in particular restrictions on commercial use or to academic users only. This restriction is particularly relevant to Wiktionary were all content must be able to be redistributed commercially per Wiktionary's CC BY-SA 4.0 license.
  • Strikethrough ( ): Resources with their name's crossed out with a strikethrough were nonfunctional or otherwise broken at the time of the entry's last update.
  • Tagged Corpus: Collection of texts gathered within a specific scope with linguistics work at least partly in mind. The contents of the texts are marked by part of speech, meaning, pragmatics, or any other method.
  • Text: A continuous use of language by a particular author presented as a whole. This could be a forum post in a thread, a book, or speech.
  • Untagged Corpus: Collection of texts gathered within a specific scope with linguistics work at least partly in mind. The contents of the texts are not marked by part of speech, meaning, pragmatics, or any other method.

Other lists and databases edit

^ Go back to top

Other lists and databases table
Name Language Language Code Size in corpora Date of entry update
Corpus Resource Database (CoRD) Translingual, esp. English mul, en 10^2 * 1 2023/02/13
Czech National Corpus KonText interface Translingual mul 10^3 * 1[91] 2023/02/13
English-Corpora.org English en 10^1 * 2 2023/02/13
Leipzig Corpora Collection Translingual mul 10^3 * 1 2023/02/13
Lextutor Web Concordance English English en 10^1 * 5 2023/02/13
Lextutor Web Concordance French French fr 10^1 * 2 2023/02/13
LINDAT/CLARIAH-CZ Corpora Translingual mul 10^2 * 7 2023/02/13
Linguistic Data Consortium (LDC) Translingual mul 10^3 * 1 2023/02/13
Martin Weisser's On-line Corpora of English Translingual, esp. English mul, en 10^1 * 2 2023/02/13
SketchEngine Translingual mul 10^1 * 2[92] 2023/02/13
University of Warwick list of free online corpora English en 10^1 * 2 2023/02/13
University of Edinburgh Scots and Scottish English corpora Scots, English sco, en 10^1 * 3 2023/02/13
SHACHI Database of Language Resources[93] Translingual mul 10^3 * 2 2024/04/22
CLARIN.SI Online Concordancers Translingual, esp. Slovene mul, sl 10^2 * 2 2023/02/26
CLARIN.SI Corpus Repository Translingual, esp. Slovene mul, sl 10^2 * 2 2023/02/26
CLARINO Corpuscle Translingual, esp. Norwegian mul, no 10^1 * 6 2023/02/26
CLARINO Corpus Repository Translingual, esp. Norwegian mul, no 10^1 * 4 2023/02/26
Online Resources for African American Language (ORAAL), external data sources English en 10^1 * 1 2023/03/15
Online Resources for African American Language (ORAAL), supplements English en 10^0 * 2 2023/03/15
Corpus Linguistics in Context (CLiC) English en 10^0 * 5 2023/03/15
The Spanish Coprus Spanish es 10^0 * 4 2023/03/24
Pennsylvania State University scripts and transcripts of popular film, TV, and sports English[94] en 10^1 * 2 2023/04/02
/r/Screenwriting Guide to Finding Scripts Online English[94] en 10^1 * 2 2023/04/02
BBC.com[95] Translingual mul 10^1 * 3 2024/04/22
Corpus4U.org[96][97] English, Chinese en, zh 10^2 * 2 2023/06/17
Beijing Foreign Studies University CQPweb[98] Translingual mul 10^2 * 2 2023/06/17
Lancaster Univerity CQPweb Translingual, esp. English mul, en 10^2 * 1 2023/06/17
Hong Kong University of Science and Technology Resources for Chinese Linguistics Chinese, esp. Cantonese zh, yue 10^0 * 3 2023/12/10
PolyU Corpus of Spoken Chinese, links to other corpora and databases Translingual, esp. Chinese mul, zh 10^2 * 1 2024/01/13

See also Wiktionary:Searchable external archives and Wiktionary:Quotations/Resources.

Notes edit

^ Go back to top

  1. 1.0 1.1 1.2 Specifically Australia, Bangladesh, Canada, Ghana, Great Britain, Hong Kong, India, Ireland, Jamaica, Kenya, Malaysia, New Zealand, Nigeria, Pakistan, Philippines, Singapore, South Africa, Sri Lanka, Tanzania, the United States
  2. 2.0 2.1 2.2 2.3 2.4 2.5 Note that dialect information in internet-derived corpora tends to be somewhat inaccurate because of accidental inclusion of texts in other dialects.
  3. ^ Specifically Australia, Canada, Ireland, New Zealand, the United Kingdom, and the United States
  4. ^ Particularly speeches and interviews
  5. ^ Most after 2005
  6. ^ Most before 2017
  7. ^ An account is required to use the site's built in search function. Nonetheless, the forum threads can still be viewed and navigated without hindrance when logged out.
  8. ^ Note that "British English" and "American English" sub-corpora of Google Ngram are sometime very inaccurate/misleading because of the accidental inclusion of texts in other dialects. Consider color vs colour and airplane vs aeroplane in the "British English" corpus. In both cases, Google Ngram shows the forms as being roughly equally as common from 2000-2019, which is blatantly untrue.
  9. ^ Madian Khabsa, C. Lee Giles (2014 May 9) “The Number of Scholarly Documents on the Public Web”, in PLOS ONE, volume 9, number 5, →DOI, →ISSN
  10. 10.0 10.1 10.2 "English as a Second Language"
  11. ^ The corpus' fair use statement says that "if any portion of this material is to be used for commercial purposes, such as for textbooks or tests, permission must be obtained in advance and a license fee may be required" which is incompatible with Wiktionary's license.
  12. 12.0 12.1 Audio files are available separately on TalkBank.org.
  13. ^ The corpuses manual can be accessed online.
  14. ^ The corpus' fair use statement says that "if any portion of this material is to be used for commercial purposes, such as for textbooks or tests, permission must be obtained in advance and a license fee may be required" which is incompatible with Wiktionary's license.
  15. ^ Based on back-of-the-napkin extrapolation of data at the Internet Live Stats website.
  16. 16.0 16.1 16.2 16.3 "English as a Lingua Franca"
  17. 17.0 17.1 The website is composed of a series of search tools, including n-gram and concordance search, based on the BNC.
  18. ^ Selection of different tools can be done through the "Grams" menu in the top left of the page.
  19. ^ Full name "Brown University Standard Corpus of Present-Day American English"
  20. ^ The corpus is made of six "texts", but looking at their descriptions reveals that each one is actually a compilation of multiple texts. For example, "feen4" is described as "7 separate titles". Overall, the exact number of independent texts included is unclear.
  21. ^ Full name "International Corpus Network of Asian Learners of English, Online Version"
  22. ^ Shin Ishikawa (2022 April 12) “The ICNALE [] ”, in language.sakura.ne.jp[1], SAKURA Internet, archived from the original on 2022-08-14:The ICNALE includes [] speeches and essays produced by college students [] in ten countries/ regions in Asia (China, Hong Kong, Indonesia, Japan, Korea, Pakistan, the Philippines, Singapore/ Malaysia, Taiwan, and Thailand) as well as English native speakers.
  23. 23.0 23.1 The name is based on an abbreviation of the phrase "UK web as corpus".
  24. ^ To search the collection, select either "User-Added Text (Back)" or "User-Added Text (Front)" under "Narrow by Specific Fields", then select "contains" from the drop down just to the right and then enter the search term next to that and hit enter. Note that the overall quality and style of the data presented in the collection varies considerably.
  25. ^ As of 2023/03/07, 2,574 cards have the field "Writing on Card (Yes or No)?" marked as "yes". Nonetheless, there are cards that do have hand writing on them and have the field marked "no".
  26. ^ Approximately, the site actually lists its size as "7,600,186 phrases" (emphasis added).
  27. 27.0 27.1 Though not exclusively short stories, the format dominates the library.
  28. ^ Although MASC is technically a corpus, it is only directly available through a web browser as a library. A complete copy of MASC as a corpus can be downloaded, though, and then processed with another application.
  29. ^ Many, if not most, of the LISTSERVs in the catalog do not have publicly accessible archives.
  30. ^ The catalog describes itself (as of 28 Nov 2022) as containing of 58,100 public lists, each of which contains a number of messages.
  31. ^ Approximately, not all items cataloged in the library are available online. In particular, it seems none of the around 300,000 speeches cataloged are available online.
  32. ^ Most after 1945
  33. ^ Number of items which are both available online and have their language marked as "English".
  34. ^ Approximately, based on a search of the collection for the basic words "a" and "the".
  35. 35.0 35.1 35.2 The number of non-English items is small.
  36. 36.0 36.1 36.2 Approximately, based on a statistical calculation.
  37. ^ Note that this number represents the number of newspaper issues in the archive.
  38. ^ Full name "Varieties of English for Specific Purposes dAtabase"
  39. ^ The corpus' end-user license states "Grant of the Product license entitles Licensee to use the Product for non-profit educational and/or linguistic research purposes only. [...] Licensees agree not to lease, sell, or commercially exploit the results of their searches (such as texts, concordances, metadata)." which is incompatible with Wiktionary's license.
  40. ^ Per https://issuu.com/about as of 2023/01/19
  41. ^ Issuu was founded in 2006, but includes some publications uploaded since then, but most of those are from after 1990, if not 2000.
  42. ^ Registration is required in order to turn "safe mode" off/show explicit search results.
  43. ^ Unclear. The collection is organized by "projects" which sometimes correspond to individual texts (such as diaries or funeral programs) and other times correspond to a collection of short texts (such as notes or letters). There were 11,372 projects on 2023/01/22. The length of projects is reported by the number of pages they contain. Using random sampling, it was estimated that the total length of all projects was around 2 million pages on 2023/01/22.
  44. ^ Most after 1800
  45. ^ Note that some transcripts were incomplete when this number was calculated.
  46. ^ Each interview in the collection, regardless of the number of parts it has, is considered one text. According to the "Faces and Voices from the Presentation" article, 26 interviews are in the collection.
  47. ^ Most from before 1950.
  48. ^ See Appendix I: Narratives in the Slave Narrative Collection by State for numerical breakdown by state
  49. ^ On the accuracy of the writings, see The Limitations of the Slave Narrative Collection
  50. ^ Note that the COHA was updated in 2021.
  51. 51.0 51.1 Specifically "United States/Canada", "United Kingdom/Ireland", "Australia/New Zealand", and "Miscellaneous".
  52. ^ Note that the corpus is listed as going up to the "present", but as of 2023/02/16 the most recent section is the 2010s implying that no opinions from later decades are included.
  53. ^ Note that this number reflects the number of articles in the corpus, not the number of issues of TIME Magazine in the corpus.
  54. ^ Specifically Australia, Canada, New Zealand, the United Kingdom, and the United States
  55. 55.0 55.1 Note that the Queen's University page describing the corpus describes the start year as 1970 and end year as 2010 despite english-corpora.org providing a source spreadsheet which spans the years 1921 to 2011 and its corpus description page showing a time span from the 1920s to 2010s.
  56. ^ On the website, this number is associated with how many "volumes" are available and is listed along side the number of "titles" (10^5 * 2) as well as the number of pages. The exact meaning of the terms "volumes" and "titles" in this context is unclear.
  57. ^ Note that although the corpus does explicitly mention its contents, I have not put in the effort to determine the dialect of each of the included texts.
  58. ^ The website for the corpus is now offline for unclear reasons, but the it is presumably still possible to access the corpus by contacting the university.
  59. ^ The corpus' description implies that it is continually expanding project, but in 2018 the page had not been updated in 5 years (since 2013) which may suggest the project stopped expanding around the same time.
  60. ^ An apparently genuine archived version of the corpus' confidenality agreement does state "If I need to cite more than one paragraph (300 words) in a publication, I will obtain permission from the Philadelphia Neighborhood Corpus Committee".
  61. ^ An archive of the corpus' home page states that "only members of the research group have access".
  62. ^ Note that searches cover both metadata and transcripts for newsreels simultaneously.
  63. ^ Specifically Australia, Canada, Ireland, New Zealand, Panama, and the United Kingdom.
  64. ^ Issue counts are provided for individual publications, but not for the entire collection. 12.7 million articles in English are available, though, with each issue featuring many articles.
  65. ^ A few publications originate from regions outside Wales, in particular three from London, one from the United States, and one from Argentina. An additional publication has no region listed though its "issuing body note" states "Published in Caernarfon by Thomas Jones", with Caernarfon being in Wales.
  66. ^ Issue counts are provided for individual publications, but not for the entire collection. 363 thousand pages in English are available, though, with each issue featuring many pages.
  67. ^ The English Wikipedia article on the Court of Great Sessions in Wales stated on 2023-08-08 that "[o]f the 217 judges who sat on its benches [...], only 30 were Welshmen". Those involved in keeping the court's records likely had a similar make up and so the database's dialect likely reflects England rather than Wales.
  68. ^ This number represents the number of recordings available online.
  69. ^ This date represents the earliest year specified for any recording in the archive, though that recording does not have audio. It is not immediately clear what the earliest recording with audio is. The earliest audio-only recording is from 1938.
  70. 70.0 70.1 Note that this number was calculated to include the about 25% of work listings which were placeholders on 2024/02/24 but should eventually become full entries and excluded the about 15% of work listings were redirects to other listings on the same date.
  71. ^ Based on the fact that the list pages for browsing works display 25 works at a time there are 78 pages to browser as of 2024/02/26.
  72. ^ Not explicitly stated, but browsing the collection on 2024-02-26 revealed only newspapers being cited as the source of the stories provided.
  73. ^ Samantha Cole (2020 October 13) “2.1 Million of the Oldest Internet Posts Are Now Online for Anyone to Read”, in Vice[2], archived from the original on 2020-10-13:Around 2.1 million posts from between February 1981 and June 1991 from Henry Spencer's UTZOO NetNews Archive are archived at the Usenet Archive for anyone to browse.
  74. ^ There is also a mirror site, Explicit-Id.com
  75. ^ Though the site does feature a built in search function, it is significantly limited and prone to errors. For this reason, I've classified it as a "library" rather than a "corpus". A complete copy of the original data can be downloaded (see here for details) and processed with another application, though.
  76. ^ From the number of queries multiplied by the average of 3.5 words per query mentioned in the scientific article that originally accompanied the data: Greg Pass, Abdur Chowdhury, Cayley Torgeson (2006 May) “A Picture of Search”, in Proceedings of the First International Conference on Scalable Information Systems, Hong Kong, →DOI, page 2
  77. ^ Number of queries, per the README included with the data
  78. ^ This requirement is incompatible with Wiktionary's license.
  79. 79.0 79.1 79.2 Note that multiple sub-corpora and related corpora can be searched on the site.
  80. 80.0 80.1 Note that these numbers represent the size of all the corpora on the site tallied together.
  81. ^ The corpus' terms FAQ states "All data published under [this website] are available exclusively for non-commercial use for research and educational purposes [...] they can only be used as sources of examples (citations) illustrating a particular linguistic phenomenon." This requirement is incompatible with Wiktionary's license.
  82. ^ As of 2023-02-12 the query interface was offline.
  83. ^ The corpus' about page states that it is specifically 98% written and 2% spoken.
  84. ^ The corpus' user agreement states "TUD sadece araştırma ve sunum amaçlı kullanıma açıktır ve fikri mülkiyet hakları tümüyle Sağlayıcıya aittir." (roughly, '[the corpus] is available for research and presentation purposes only and the intellectual property rights remain the sole property of the Provider.') This requirement is incompatible with Wiktionary's license.
  85. 85.0 85.1 This corpus was designed to imitate the English-language Brown Corpus.
  86. 86.0 86.1 Specifically Argentina, Bolivia, Chile, Colombia, Costa Rica, Cuba, Dominican Republic, Ecuador, Guatemala, Honduras, Mexico, Nicaragua, Panama, Paraguay, Peru, Puerto Rico, El Salvador, Spain, United States, Uruguay, Venezuela.
  87. ^ Note that dialect information in internet-derived corpora is usually somewhat inaccurate because of the accidental inclusion of texts in other dialects. This issue is addressed on the website with the conclusion that the "categorization is quite good".
  88. ^ Note that dialect information in internet-derived corpora is usually somewhat inaccurate because of the accidental inclusion of texts in other dialects. This issue was addressed for the related Web/Dialects corpus with the conclusion that the "categorization is quite good" so a similar level of quality may exist for this corpus.
  89. ^ Note that the CORPES is currently undergoing continuous revision and so this information may be out of date. To be specific, the information presented is for version 0.99.
  90. 90.0 90.1 Note that the newspapers were published in the United States.
  91. ^ Approximately, it is difficult to see the full list of corpora in order to get an accurate estimate.
  92. ^ Note that this number reflects the number of corpora freely available. Including the corpora which require a subscription or special permission the number comes up to 722 as of 2023/0/13.
  93. ^ Note that the database has not been updated since 2016 and has a somewhat buggy search system.
  94. 94.0 94.1 Not confirmed to be English exclusively, but probably almost all English.
  95. ^ The BBC publishes news online in a wide variety of languages which can then be searched manually using a search engine like Google. The languages are specifically Arabic, Azeri, Bangla, Burmese, Chinese, French, Hausa, Hindi, Indonesian, Japanese, Kinyarwanda, Kirundi, Kyrgyz, Marathi, Nepali, Pashto, Persian, Portuguese, Russian, Sinhala, Somali, Spanish, Swahili, Tamil, Turkish, Ukrainian, Urdu, Uzbek, and Vietnamese.
  96. ^ The forum is primarily written in Chinese, though some posts are in English.
  97. ^ The section which primarily hosts links to corpuses is labeled "专题研究" (Google Translate translates this as "Special Research".)
  98. ^ Both user ID and password are "test" for freely available corpora.

Further reading edit

^ Go back to top