We’ve created this FAQ page to support your journey with us and our English datasets. If you have questions that aren’t covered in this material, please share them with us by contacting our Customer Success Managers or sending your query to [email protected].

 

If you have any questions about the Oxford Dictionaries API specifically, visit our dedicated API FAQ page.

 

I am especially interested in pronunciation data. Can I get this separately?

 

Yes. We offer a separate English Pronunciations Asset, which provides pronunciation data for all words in English, including inflectional and derivational forms. Features in this asset include variant word spellings, variety (US/UK), parts of speech, syllabified IPA transcription, simple text respell information, and sound files.

 

 

Do you offer audio data?

 

Yes, we offer audio files for every word in English, which are referenced in respective entries of the English monolingual dictionary.

 

The English datasets (ODE* and NOAD**) will include the audio files for each word in the dictionary, offered in MP3 or WAV format. The size of the audio delivery is approximately 2GB. Our pronunciations assets offer a combination of audio data along with pronunciations data (which includes metadata such as transcription of each word), which comes in tabular format. If you are interested in audio data for specific languages, please contact us.

 

*ODE: Oxford Dictionary of English (British English)

**NOAD: New Oxford American Dictionary (US English)

 

 

How frequently are the dictionary and thesaurus updated by OUP?

 

ODE and NOAD are updated every six months. The Oxford Thesaurus of English (OTE) is updated yearly. Acquirement of updates is subject to license fee/business terms. Data available on the API is also updated regularly.

 

 

What do dictionary updates include?

 

Oxford Languages provides two different types of updates.

 

The most frequent is content updates, which correspond to the addition of new words or phrases, revisions of entries, and changes made within sensitivity, etymology and audio. An overview of these updates can be found in the Content Release Notes, which are included with the annual or bi-annual update deliveries.

 

The other type of update is the data updates, which are less frequent. These correspond to improvements in specific parts of our data structure, such as the removal, replacement, and addition of elements or attributes, leading to some differences from previous versions of the data. An overview of these updates can be found in the Data Release Notes, which are included with the respective update deliveries.

 

 

What languages and types of dataset do you offer?

 

Oxford Languages offers data for over 60 languages. The range of datasets available per language varies from one language to another. Types of datasets include monolingual, bilingual and bilingualised dictionaries, audio data, thesauri, pronunciation datasets, morphology datasets, wordlists, and also corpora.

 

If you are looking for data in specific languages, please ask about their availability as API or one-off file deliveries, or check out this link to see what’s available on the API.

 

 

Does the dictionary data include synonyms?

 

No, but it does include sense-level linking to the Oxford Thesaurus of English within the <linkGroup> element. This seamless integration of the dictionary with the thesaurus enables the user to easily link a word's synonyms with its dictionary entry. This supports various dictionary display use cases.

 

For further information about synonym integration, please ask about our thesaurus assets, which you are able to license alongside the dictionary assets.

 

 

Does the dictionary include information for World English varieties?

 

Yes. In our XML data, there are regional labels for the different varieties of English in both ODE and NOAD. These labels are found in the <ge> tag (geographical label) within the <lg> tag (label group), and are able to specify the following range of World English varieties:

 

Australian EnglishNew Zealand English
Australian & New Zealand EnglishPhilippine English
British EnglishScottish English
Canadian EnglishSoutheast Asian English
East African EnglishSouth African English
Indian EnglishSouth Asian English
Irish EnglishUS English
Nigerian EnglishWelsh English
North American EnglishWest African English
Northern EnglandWest Indian English
Northern Irish English

 

 

What pronunciation information can I find in the dictionary?

 

Pronunciation information is nested within the <prx> element of the respective entry, which includes the following elements and attributes:

 

Dialect: for specification of dialect or variety of language (e.g. "AmE" = American English)
Type: for IPA transcription, including primary and secondary stress* (e.g. ˈkædəˌɡɔri)
Media: for reference to sound files that accompany the dictionary file

 

 

Does the dictionary offer syllabification of words?

 

The US dictionary (NOAD) contains this information in the "syllabified" attribute within <hw> (headword). This provides syllabified word forms for headwords throughout the dictionary, e.g. fol·low·er. Currently, this is only included for headwords/lemmas.

 

For further data on syllabification, please ask about our English Pronunciations Asset, which includes syllabified IPA transcriptions across all inflected forms of words. This asset also comes with other pronunciation features, such as sound files and pronunciation variations.

 

 

Does the dictionary offer line breaks for words?

 

The UK dictionary (ODE) contains this information in the linebreaks attribute within <hw> (headword). This offers a suggestion for the best way to split a word for editorial purposes, e.g. gen¦er|ation. The solid line (|) is the more recommended line break, whereas the "broken" line (¦) is an optional line break as "second choice".

 

 

 

Does the dictionary include domain or semantic class information?

 

Yes. You can get domain or semantic information by searching through the data using the <domClass> (domain class) or <semclass> (semantic class) element, respectively. For a better understanding of the domains available and their hierarchy, please ask for our domain hierarchy documentation.

 

 

Do you provide register information?

 

Yes. Register information helps provide more information about the stylistic or social context in which the word can be used. This is included in the dictionary and thesaurus entries at sense-level.

 

ArchaicIronic
Child languageLiterary
DatedMilitary slang
DerogatoryNautical slang
DialectOffensive
EuphemisticRare
FormalRhyming slang
HistoricalTechnical
HumorousTrademark
InformalVulgar slang

 

Additional details regarding the labels can be found here.

 

 

Can I find American and British spelling variants in the same entry?

 

Yes. American and British spelling variants, and important spelling variants that are not region-specific, are shown in the <v> element. This means that in the British English entry for the word "colour", you will find "color" included as a variant form, and in the American English entry for "color", you will find "colour" included as a variant form.

 

 

How are word inflections and derivatives treated in the dictionary?

 

Inflections can be found in <infg> elements (e.g. "write" has irregular inflections "wrote" and "written"), whereas a complete list of inflections are found in the <morphSet>, whether regular or irregular. Inflected word forms do not have their own separate entries in the dictionary. For more complex inflected forms and metadata, please ask for our separate morphology assets.

 

Derivations, on the other hand, such as "mover" or "unmoved" would have their own separate entries and would typically not be included within the root word entry, "move". Often, derivations are included within the <subEntry> elements in the root word entry. Whether the derivation is included in the root entry or as a standalone entry depends on the word's frequency as well as our editors' judgement.

 

 

How are word senses encoded in the data?

 

Within an entry, the various definitions are ordered by part of speech, followed by senses. The element <sg> (sense group) includes one or more of the element <se1>, each of which represents the part of speech, e.g. "noun", "verb", etc. Within <se1>, the element <se2> represents the separate senses the word can have under that part of speech. Within each sense section, senses and their definitions are ordered according to frequency, such that the most frequent sense of the word will appear first. Each separate sense is labelled with its own unique ID.

 

 

What is a dictionary entry?

 

The Oxford Dictionary of English (ODE) and the New Oxford American Dictionary (NOAD) are English monolingual dictionaries that offer meaning information for words. A dictionary entry consists of a headword (with homonym numbers for different words that are spelled identically but have completely unrelated meanings and histories), line breaks or syllable breaks, pronunciations, parts of speech, labels for region/register/subject, senses, definitions, example sentences, phrases/idioms, derivatives, and origins (etymology). Some entries also include notes giving usage, technical, or encyclopedic information. Other features include semantic class, domain class, sensitivity classification, and top 1,000 frequent words.

 

Additional information regarding dictionary content can be found here.

 

 

What is a thesaurus entry?

 

The Oxford Thesaurus of English (OTE) is a type of thesaurus sometimes called a "synonyms dictionary", meaning that it is arranged by headword and gives synonyms based on their relation to a specific word-sense. A thesaurus entry consists of senses containing a number of synonym groups. Some of the synonym groups contain words that are extremely close in meaning, connotation, register, etc., and others that are more distant. The Oxford Thesaurus of English is organized so that the closest synonyms for a given sense, the ones that are the best match in meaning, come first.

 

 

I have found something in the data that I am not sure is correct, who should I contact?

 

Contact your Customer and Partner Success Manager and they will be able to support you.

 

 

What are “sk” attributes?

 

"sk" attributes are also known as "sort keys". They are used to sort the entries into the correct order alphabetically. The zzz etc. are just part of the process of ensuring the order matches the print product as closely as possible. Sometimes a "z" is also added at the start of the key to ensure it sorts to after the a-z headwords (e.g. the headword "@" has the sort key value of "zzz@").

 

 

In which format/formats does the data come in?

 

Our dictionaries can be offered as one-time delivery or on-demand via the API. Dictionaries delivered as a one-time delivery are typically in XML format. Other formats can also be considered under special requests. Dictionary data delivered via the API is in JSON format. Other datasets such as pronunciation and morphology datasets may vary in format and may be offered in XML, JSON or in tabular format.

 

For more information about the API, including endpoints and their capabilities, please visit the Oxford Dictionaries API website.

 

 

What is the difference between the API and bulk data you offer?

 

The main difference between our API and bulk data is the format. While the API outputs in JSON, our bulk datasets are typically offered in XML. The API data has a simpler structure in order to accommodate users who prefer a quicker and more intuitive entry point to the data. Although the bulk data may have a more complex structure, it also provides some more depth and detail in its data features, such as more intricate lexical information as well as convenient links to other datasets.

 

If you are looking for data in specific languages, please ask about their availability as API or bulk deliveries, or check out this link to see what's available on the API.

 

If you would like to see more specific differences between API and bulk XML data, please contact your Customer and Partner Success Manager and ask for our feature-mapping documentation.

 

 

What is the size of the dictionaries as bulk data?

 

The size of our dictionaries depends on the language and dataset type. Typically, monolingual dictionaries can be as small as 24 MB and can go up to 400 MB.

 

The English dictionaries are approximately 200 MB. Thesaurus dataset sizes can range from 30 MB to 60 MB. Pronunciation audio files for English are approximately 2 GB in size.

 

 

Where can I get branding information and assets?

 

Oxford Languages branding resources are available on our website. Please visit our branding resources page for more information.

 

 

Our company details have changed. Do we need to inform Oxford Languages?

 

Yes. Please contact your Customer and Partner Success Manager.

 

 

Our usage of Oxford Languages data has changed slightly. Do we need to inform Oxford Languages?

 

Yes, even small changes are important for us to know about. Please contact your Customer and Partner Success Manager.

 

 

I think I have chosen the wrong product. Am I able to change the product I have?

 

We are more than happy to discuss your requirements and assess whether you need an alternative product, or if we can assist in applying the data you have for your use. Please contact your Customer and Partner Success Manager to discuss your needs and possible next steps.

 

 

I am interested in one of your products, but I am not sure whether it will suit my needs. Do you offer samples?

 

Yes, we offer free samples of our datasets to allow you to test them out in your products. This can be done for bulk data, for which we would typically offer data for one letter of the alphabet, or for API access, for which we offer 500 free calls across all endpoints and languages. During this sample phase, our Customer Success team is available to support you in understanding and ingesting the sample data.