Lexical Datasets for NLP: Transliteration & spelling variants
Our lexical data offers a solution for language models that work with languages written in a different script. The data helps models align the language’s original script with the Roman script, enabling seamless language processing and generation.
The Oxford Languages data for Bangla, Gujarati, Hindi, and Tamil languages offer complete coverage of the most significant words, followed by their transliterations and spelling variants.
Our data creation and review process is meticulous, ensuring accurate representation of how the Indian languages are written and spoken today. We use a combination of corpus and natural language processing (NLP) methodologies, linguist native speaker data creation and review, and native speaker-annotation processes to curate data with core word forms and alternative spellings.
We offer the dataset in two versions: one that focuses on word frequency and another that includes other grammatical features and labelling, such as sensitivity labels. We are dedicated to providing accurate and reliable language data to improve language models and help bridge the gap between different scripts.
Approximately 25,000 lemmas
Between 300,000 - 500,000 wordforms
Features:
- Coverage of the most important lemmas and inflections (wordforms) in each language, as defined by dictionaries and corpora
- Each unique wordform part of speech combination is presented with a frequency bucket to support predictive use cases
- Grammatical features to support advanced NLP pipelines which consider syntax
- Spelling variants to allow support of interlingual communication where orthographic variation occurs
- Sensitivity classification (e.g. offensive, sensitive) and categorization to support detection of potentially sensitive or offensive language
- Transliteration (multiple columns for variants) to support dual script interfaces and user experiences
Predictive text
Generative AI (GenAI)
Text-to-speech (TTS)
Our Privacy Policy sets out how Oxford University Press handles your personal information, and your rights to object to your personal information being used for marketing to you or being processed as part of our business activities.