Lexical Datasets for NLP: Dialects
Increase the range and diversity of models with localized lexical data covering dialects with our lexical datasets focused on language varieties in the English language and Swiss German.
Our World English data goes beyond just American and British varieties, offering you access to normalized frequency data from a specified general corpus or a relevant region-specific corpus, such as British, American, Indian, and Australian English.
With our data, you can easily establish equivalences between dialects and find or trackback a 'standard' variety of the language that can be understood by all dialects. Plus, our data supports language models that can function with many variations of English, covering different vocabulary, spelling, and pronunciation that differ between the different varieties of English.
Features:
- Coverage of the most important wordforms (lemmas, inflections and part of speech) across Englishes, as found in dictionaries and corpora
- Lexical locale classification to determine where the wordform is in use
- Each wordform and locale combination is presented with a frequency bucket to allow nuanced World English localization use cases
For AI applications such as:
Predictive text
Generative AI (GenAI)
This large dialectal variety dataset is focused on presenting the Standard Swiss German and Swiss German dialects spoken in Bern, Basel, Zurich, and Luzern for language models performing with Swiss German.
All wordforms are aligned with their equivalent forms and spellings in the four target dialects. The lexical data includes all possible inflected forms of a lemma (base form of a word) along with their morphology information and contains grammatical features.
Developed beyond the extraction of token counts from corpora, the lemmas were sourced from multiple resources to ensure thorough coverage of the most common words. Our team of expert linguists and native speakers manually curate and review the morphological information and parallel translations, guaranteeing the highest level of accuracy.
We can provide the dataset in two versions: one focused on word frequency and the other including other grammatical features and labelling, such as sensitivity labels.
Features:
- Standard Swiss German wordforms and gender annotation to show a standardized index for Swiss German interlingual communication
- Each wordform is presented with a frequency-size bucket to support the predictive use case
- Presentation of the High German equivalent (for non-cognate words) wordform, lemma and part of speech to support intra-lingual communication
- Equivalent dialectal Swiss German wordforms, lemmas and parts of speech for: Bern, Basel, Zurich, Luzern dialects
- Grammatical features to support advanced NLP pipelines which consider syntax
- Sensitivity classification (e.g. offensive, sensitive) and categorization to support detection of potentially sensitive or offensive language
Our Swiss German dialects data was designed to meet the requirements of a predictive text scenario while simultaneously satisfying the requirements for other use cases.
Our Privacy Policy sets out how Oxford University Press handles your personal information, and your rights to object to your personal information being used for marketing to you or being processed as part of our business activities.