Oxford Languages Datasets

Oxford Languages Datasets

Oxford Languages is part of Oxford University Press, and we are the leading provider of lexical and language datasets for artificial intelligence, natural language processing, machine learning, and a wide range of language technologies.


Our high-quality data is trusted by some of the world’s leading technology companies, providing them with clean, structured datasets for an extensive variety of use cases.


As the world leader in providing language content, our data is flexible, curated, and comprehensively annotated. Our team of in-house language experts work to seamlessly deliver datasets to enhance your products, while our flexible data model ensures our lexical content is a good match for your products and your users: from Educational Technology to AI, we have the solution to fit your needs.


Building on over 150 years of experience and technological innovation, we deliver authoritative, evidence-based language content in 56 languages. Alongside our flagship English content, and well known high-resource languages, we also boast a large collection of low resource languages including Hindi, Tamil, Telugu, and more.

Solutions our data can provide


Validation - Our wordlists are ideal for customers needing to incorporate word validation and spellcheck into interfaces simply and effectively, such as game developers building username features, or app developers working on spellcheck features.


Display - Our monolingual, bilingual, and thesauri datasets serve as an effective solution in the broader educational space, where having dictionary data in the interface is essential to the user experience, such as click-to-define, expand to see synonyms, and visual presentation of etymologies.


Translation - Suited to adult language learning companies or interfaces, which offer translation and localisation services, or supporting materials, our bilingual or bilingualized datasets provide the perfect fix.


Enhanced Reading - Our mono and thesauri datasets are ideal for education platforms that assist learning with reading. A tool or feature may suggest synonyms, collocates, or definitions, which help the reader to develop their understanding, reading, and vocabulary.


Prediction - Our wordlists and n-gram datasets are the optimal solution for interfaces that require improved efficiency of user generated content, such as a predictive text developer.


Pronunciation - Our Soundbank and transcriptions (IPA or plain text) are an excellent solution for customers requiring accurate data with low error margins, such as transliteration specialists, or those working with languages in different scripts.