Oxford Curated Corpora

Oxford Curated Corpora

The Oxford Curated Corpora programme delivers rich, structured corpus data curated to the highest industry standards.


It is informed by more than two decades of experience building and working with corpora for our own language research programme at Oxford Languages, refining corpus exploitation solutions for the use of our in-house team of linguists, lexicographers, and language technologists as they create our language content.


We have brought this tried-and-tested expertise in developing and applying corpora solutions to clients around the world, enabling leaders in global communication and AI technologies to enhance their own language technology applications with our authoritative and reliable corpora.




How can the Oxford Curated Corpora power your product development?


Versatile and customizable, the Oxford Curated Corpora programme delivers corpus data that supports a range of use cases, such as:

  • Training AI applications, e.g. auto-completion, sense induction, ontology inference
  • Developing language domain-specific models
  • Providing linguistic insights, and other corpus linguistics activities


Tell us about your unique use case to explore bespoke corpora solutions with our expert team.



What goes into the Oxford Curated Corpora?


An Oxford Curated Corpus provides clean, structured data in the languages, genres, domains, registers, and modes of your choice, all of which can be supplied with linguistic labelling and/or annotation.


We currently have licensable corpora in Arabic, Indonesian, and English, with our English corpora including numerous regional varieties – such as American, British, East Asian, and Indian – that are available for specific selection.


Our corpora encompass news publications, columns, and blogs; scholarly texts, academic journals, and textbooks; literary fiction, online articles, and more – all selected by our language data experts.


The data is structured, tagged, and annotated by genre and domain, making it simple to create corpora tailored to your use case, whether your use case belongs to the finance, medical, legal, or any other industry.


We are continually expanding the Oxford Curated Corpora offering, with more languages, domains, and linguistic labels in the programme pipeline.


Get in touch for more information on our current corpora offering and forthcoming developments.




Types of corpora available with the Oxford Curated Corpora


We can provide both monolingual corpora and parallel corpora as off-the-shelf or bespoke datasets.


All of our corpora and related products are delivered as CSV files as standard. Other formats, such as XML and JSON, are available on request.


Monolingual corpora:

  • Curated and annotated corpora in a single language

A corpus of clean, structured data in a single language, tailored for genre, domain, register, and mode, with tagging and linguistic annotation. Currently available for English, Arabic, and Indonesian.


  • English Monitor corpus

A corpus of current English collated from online sources using our state-of-the-art language monitoring technology. Regularly updated to include the latest English language developments.


Parallel corpora:

We offer parallel corpora in multiple languages and language pairs, including low-resource languages, developed with high-quality, aligned data for consistent translation accuracy.


View our list of available languages and get in touch with our team to discuss how our parallel-aligned corpora can power your product development.




Other products available from the Oxford Curated Corpora




From filtered glossaries to our off-the-shelf Oxford English Wordlist, we extract and curate wordlists in many languages, including low resource languages, from our corpora programme. We also clean, structure, and customize wordlists on behalf of our industry-leading language technology partners.


Our wordlists are currently in use in gaming, eCommerce, online security, and more, with our flexible, reliable data tailored to each unique use case.


Find out more about the Oxford Wordlists





We offer n-grams in more than 20 languages, all customized for domain and genre on request – see our full list of licensable languages.


Use cases for our n-grams include NLP for keyboard prediction, categorizing word usage for lexicography, ontology inference and sense induction.


Our team is also developing word embedding, bag of words, sentiment analysis training data, and many more products derived from the Oxford Curated Corpora – get in touch to find out more.