Lexical Datasets for NLP: Dialects

Our World English data goes beyond just American and British varieties, offering you access to normalized frequency data from a specified general corpus or a relevant region-specific corpus, such as British, American, Indian, and Australian English.

With our data, you can easily establish equivalences between dialects and find or trackback a 'standard' variety of the language that can be understood by all dialects. Plus, our data supports language models that can function with many variations of English, covering different vocabulary, spelling, and pronunciation that differ between the different varieties of English.

All wordforms are aligned with their equivalent forms and spellings in the four target dialects. The lexical data includes all possible inflected forms of a lemma (base form of a word) along with their morphology information and contains grammatical features.

Developed beyond the extraction of token counts from corpora, the lemmas were sourced from multiple resources to ensure thorough coverage of the most common words. Our team of expert linguists and native speakers manually curate and review the morphological information and parallel translations, guaranteeing the highest level of accuracy.

We can provide the dataset in two versions: one focused on word frequency and the other including other grammatical features and labelling, such as sensitivity labels.

Dialects

Increase the range and diversity of models with localized lexical data covering dialects with our lexical datasets focused on language varieties in the English language and Swiss German.