Oxford Curated Corpora
The Oxford Curated Corpora programme delivers rich, structured corpus data curated to the highest industry standards.
It is informed by more than two decades of experience building and working with corpora for our own language research programme at Oxford Languages, refining corpus exploitation solutions for the use of our in-house team of linguists, lexicographers, and language technologists as they create our language content.
We have brought this tried-and-tested expertise in developing and applying corpora solutions to clients around the world, enabling leaders in global communication and AI technologies to enhance their own language technology applications with our authoritative and reliable corpora.
How can the Oxford Curated Corpora power your product development?
Versatile and customizable, the Oxford Curated Corpora programme delivers corpus data that supports a range of use cases, such as:
- Training AI applications, e.g. auto-completion, sense induction, ontology inference
- Developing language domain-specific models
- Providing linguistic insights, and other corpus linguistics activities
What goes into the Oxford Curated Corpora?
An Oxford Curated Corpus provides clean, structured data in the languages, genres, domains, registers, and modes of your choice, all of which can be supplied with linguistic labelling and/or annotation.
We currently have licensable corpora in Arabic, Indonesian, and English, with our English corpora including numerous regional varieties – such as American, British, East Asian, and Indian – that are available for specific selection.
Our corpora encompass news publications, columns, and blogs; scholarly texts, academic journals, and textbooks; literary fiction, online articles, and more – all selected by our language data experts.
The data is structured, tagged, and annotated by genre and domain, making it simple to create corpora tailored to your use case, whether your use case belongs to the finance, medical, legal, or any other industry.
We are continually expanding the Oxford Curated Corpora offering, with more languages, domains, and linguistic labels in the programme pipeline.
Types of corpora available with the Oxford Curated Corpora
We can provide both monolingual corpora and parallel corpora as off-the-shelf or bespoke datasets.
All of our corpora and related products are delivered as CSV files as standard. Other formats, such as XML and JSON, are available on request.
- Curated and annotated corpora in a single language
A corpus of clean, structured data in a single language, tailored for genre, domain, register, and mode, with tagging and linguistic annotation. Currently available for English, Arabic, and Indonesian.
- English Monitor corpus
A corpus of current English collated from online sources using our state-of-the-art language monitoring technology. Regularly updated to include the latest English language developments.
We offer parallel corpora in multiple languages and language pairs, including low-resource languages, developed with high-quality, aligned data for consistent translation accuracy.
Other products available from the Oxford Curated Corpora