Lexical Datasets for NLP: Dialects

Increase the range and diversity of models with localized lexical data covering dialects with our lexical datasets focused on language varieties in the English language and Swiss German.

Our World English data goes beyond just American and British varieties, offering you access to normalized frequency data from a specified general corpus or a relevant region-specific corpus, such as British, American, Indian, and Australian English.

 

With our data, you can easily establish equivalences between dialects and find or trackback a 'standard' variety of the language that can be understood by all dialects. Plus, our data supports language models that can function with many variations of English, covering different vocabulary, spelling, and pronunciation that differ between the different varieties of English.

 

Features:

 

  • Coverage of the most important wordforms (lemmas, inflections and part o speech) across Englishes, as found in dictionaries and corpora

 

  • Lexical locale classification to determine where the wordform is in use

 

  • Each wordform and locale combination is presented with a frequency bucket to allow nuanced World English localization use cases

 

For AI applications such as:

Predictive text

Generative AI (GenAI)

This large dialectal variety dataset is focused on presenting the Standard Swiss German and Swiss German dialects spoken in Bern, Basel, Zurich, and Luzern for language models performing with Swiss German.

 

 

All wordforms are aligned with their equivalent forms and spellings in the four target dialects. The lexical data includes all possible inflected forms of a lemma (base form of a word) along with their morphology information and contains grammatical features.

 

Developed beyond the extraction of token counts from corpora, the lemmas were sourced from multiple resources to ensure thorough coverage of the most common words. Our team of expert linguists and native speakers manually curated and reviewed the morphological information and parallel translations, guaranteeing the highest level of accuracy.

 

We can provide the dataset in two versions: one focused on word frequency and the other including other grammatical features and labelling, such as sensitivity labels.

 

Features:

  • Standard Swiss German wordforms and gender annotation to show a standardized index for Swiss German intra lingual communication

 

  • Each wordform is presented with a frequency-size bucket to support the predictive use case

 

  • Presentation of the High German equivalent (for non-cognate words) wordform, lemma and part of speech to support intra-lingual communication

 

  • Equivalent dialectal Swiss German wordforms, lemmas and parts of speech for: Bern, Basel, Zurich, Luzern dialects

 

  • Grammatical features to support advanced NLP pipelines which consider syntax

 

  • Sensitivity classification (e.g. offensive, sensitive) and categorization to support detection of potentially sensitive or offensive language

 

Our Swiss German dialects data was designed to meet the requirements of a predictive text scenario while simultaneously satisfying the requirements for other use cases.

Learn more about the development of our Swiss German lexical data

Get in touch to arrange a meeting with our sales team and explore how we could enhance your language models.

Our Privacy Policy sets out how Oxford University Press handles your personal information, and your rights to object to your personal information being used for marketing to you or being processed as part of our business activities.