Lexical Datasets for NLP: Domain-specific data

At Oxford Languages, we provide domain-specific data that supports language models used within specific industries. Our team of experienced linguists curated this data and organized it using taxonomies present in our dictionary data, ensuring that enterprises receive the best possible support for their language models.


Our data is developed using trustworthy resources, including our flagship English dictionaries (Oxford Dictionary of English and New Oxford American Dictionary) and our pronunciation data program. We offer two domains- medical and finance, both of which are constantly updated with the latest evidence from the world's largest language research program, including the multi-billion-word Oxford English Corpus.


The data provides high-level and granular taxonomies to support language models, making it the perfect resource for optimizing language models for specific industries.


  • Element: subject, DomainClass, SemanticClass)


  • Related headword count for each domain


  • Distinct related meaning count


  • Value: the category present within the domain


  • Related headwords for each domain


  • Distinct related meaning (definition)


Medical domain-specific dataset

Contains 16 different medical-related categories with XXXX wordforms.

Finance domain-specific dataset

Contains 42 different finance-related categories with XXXX wordforms.

Get in touch to arrange a meeting with our sales team and explore how we could enhance your language models.

Our Privacy Policy sets out how Oxford University Press handles your personal information, and your rights to object to your personal information being used for marketing to you or being processed as part of our business activities.