Language datasets

Flexible and curated
language datasets

We are the leading provider of lexical and language datasets for artificial intelligence, natural language processing, machine learning, and a wide range of language technologies.

Oxford has been partnering with the world’s leading technology companies for over a decade, providing them with clean, structured datasets for an extensive variety of use cases.



60+ available languages We offer flexible, curated datasets for more than 60 of the world’s major languages, which include definitions, translations, examples, idioms, phonetics and phonetic transcriptions, regional varieties, and inflected forms.

Uniquely, we also offer an ever-growing portfolio of high quality datasets in low resource languages, part of our ongoing commitment to ensure that all language communities benefit from digital access and representation.

See our available languages ⟶



Children's Language Datasets

Our children's language datasets curate the right words, defined at the right level, for each age and stage of learning. They are created using our unique Oxford Children's Corpus, the world's largest children's language database.

Read more here ⟶



How could our language datasets enhance your products?

This deep experience informs how we support your projects, large or small, to ensure language and technology integrate seamlessly to enhance your products.

A partnership with Oxford ensures that the language content and data you need meets Oxford’s quality standards and gives you a single point of contact for multiple languages.