NL(V)P -- Natural Language (Variety) Processing

funded by the NSF (2021-)

Project Description

No language is a monolith. Languages vary richly across countries, regions, social classes, and other factors. Despite recent advances in natural language processing (NLP) technology for translating between languages, answering questions, or engaging in simple conversations, current approaches have largely focused only on “standard” varieties of languages. By ignoring other varieties, treating them essentially as statistical noise, current technologies neglect the millions of people who speak these varieties.

This project is creating ways to enable language technologies such as translation and question-answering systems, both to process and to generate fine-grained language varieties. The team will develop computational methods to automatically recognize features of different language varieties and then create approaches for integrating such linguistic information into the models powering language technologies. Additionally, the team will design methods to adapt models into varieties for which minimal training data may be available. The resulting suite of general methods will benefit diverse communities and less-privileged populations that speak underserved languages and varieties.

This is a collaborative project between George Mason University, the University of Notre Dame, and the University of Washington.

Participants

Faculty

Antonis Anastasopoulos, GMU

David Chiang, ND

Yulia Tsvetkov, UW

Students



Fahim Faisal, PhD CS

Publications

On benchmarking and improving LLM performance for dialects: (Faisal* et al., 2024), (Faisal & Anastasopoulos, 2025), (Faisal & Anastasopoulos, 2024), (Faisal & Anastasopoulos, 2024), (Faisal & Anastasopoulos, 2021), (Faisal et al., 2022)
On creating new datasets: (Ahmadi et al., 2024), (Faisal et al., 2021), (Vakirtzian et al., 2024)
On dialectal machine translation: (Alam et al., 2024), (Alam & Anastasopoulos, 2025)
On dialectal feature riscovery: (Pratapa* et al., 2021), (Xie et al., 2024)

Acknowledgements

This project is supported by NSF grant IIS-2125466. Early work was also supported by a Google Award for Research.

References

2025

VarDial
Testing the Boundaries of LLMs: Dialectal and Language-Variety Tasks

Fahim Faisal, and Antonios Anastasopoulos

In Proceedings of the 12th Workshop on NLP for Similar Languages, Varieties and Dialects Code here , Jan 2025

Abs Bib PDF

This study evaluates the performance of large language models (LLMs) on benchmark datasets designed for dialect-specific NLP tasks. Dialectal NLP is a low-resource field, yet it is crucial for evaluating the robustness of language models against linguistic diversity. This work is the first to systematically compare state-of-the-art instruction-tuned LLMs—both open-weight multilingual and closed-weight generative models—with encoder-based models that rely on supervised task-specific fine-tuning for dialectal tasks. We conduct extensive empirical analyses to provide insights into the current LLM landscape for dialect-focused tasks. Our findings indicate that certain tasks, such as dialect identification, are challenging for LLMs to replicate effectively due to the complexity of multi-class setups and the suitability of these tasks for supervised fine-tuning. Additionally, the structure of task labels—whether categorical or continuous scoring—significantly affects model performance. While LLMs excel in tasks like machine reading comprehension, their instruction-following ability declines in simpler tasks like POS tagging when task instructions are inherently complex. Overall, subtle variations in prompt design can greatly impact performance, underscoring the need for careful prompt engineering in dialectal evaluations.
@inproceedings{faisal-anastasopoulos-2025-testing, title = {Testing the Boundaries of {LLM}s: Dialectal and Language-Variety Tasks}, author = {Faisal, Fahim and Anastasopoulos, Antonios}, editor = {Scherrer, Yves and Jauhiainen, Tommi and Ljube{\v{s}}i{\'c}, Nikola and Nakov, Preslav and Tiedemann, Jorg and Zampieri, Marcos}, booktitle = {Proceedings of the 12th Workshop on NLP for Similar Languages, Varieties and Dialects}, month = jan, year = {2025}, address = {Abu Dhabi, UAE}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2025.vardial-1.6/}, pages = {68--92}, }
VarDial
Large Language Models as a Normalizer for Transliteration and Dialectal Translation

Md Mahfuz Ibn Alam, and Antonios Anastasopoulos

In Proceedings of the 12th Workshop on NLP for Similar Languages, Varieties and Dialects Code here , Jan 2025

Abs Bib PDF

NLP models trained on standardized language data often struggle with variations. We assess various Large Language Models (LLMs) for transliteration and dialectal normalization. Tuning open-source LLMs with as little as 10,000 parallel examples using LoRA can achieve results comparable to or better than closed-source LLMs. We perform dialectal normalization experiments for twelve South Asian languages and dialectal translation experiments for six language continua worldwide. The dialectal normalization task can also be a preliminary step for the downstream dialectal translation task. Among the six languages used in dialectal translation, our approach enables Italian and Swiss German to surpass the baseline model by 21.5 and 25.8 BLEU points, respectively.
@inproceedings{alam-anastasopoulos-2025-large, title = {Large Language Models as a Normalizer for Transliteration and Dialectal Translation}, author = {Alam, Md Mahfuz Ibn and Anastasopoulos, Antonios}, editor = {Scherrer, Yves and Jauhiainen, Tommi and Ljube{\v{s}}i{\'c}, Nikola and Nakov, Preslav and Tiedemann, Jorg and Zampieri, Marcos}, booktitle = {Proceedings of the 12th Workshop on NLP for Similar Languages, Varieties and Dialects}, month = jan, year = {2025}, address = {Abu Dhabi, UAE}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2025.vardial-1.5/}, pages = {39--67}, }

2024

ACL
DIALECTBENCH: An NLP Benchmark for Dialects, Varieties, and Closely-Related Languages

Fahim Faisal^*, Orevaoghene Ahia^*, Aarohi Srivastava^*, Kabir Ahuja, David Chiang, Yulia Tsvetkov, and Antonios Anastasopoulos

In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) code here , Aug 2024

🏅 Abs DOI Bib HTML PDF

Societal Impact Award

Language technologies should be judged on their usefulness in real-world use cases. An often overlooked aspect in natural language processing (NLP) research and evaluation is language variation in the form of non-standard dialects or language varieties (hereafter, varieties). Most NLP benchmarks are limited to standard language varieties. To fill this gap, we propose DIALECTBENCH, the first-ever large-scale benchmark for NLP on varieties, which aggregates an extensive set of task-varied varieties datasets (10 text-level tasks covering 281 varieties). This allows for a comprehensive evaluation of NLP system performance on different varieties. We provide substantial proof of performance disparities between standard and non-standard language varieties, and we also identify language clusters with larger performance divergence across tasks.We believe DIALECTBENCH provides a comprehensive view of the current state of NLP for varieties and one step towards advancing it further.
@inproceedings{faisal-etal-2024-dialectbench, title = {{DIALECTBENCH}: An {NLP} Benchmark for Dialects, Varieties, and Closely-Related Languages}, author = {Faisal, Fahim and Ahia, Orevaoghene and Srivastava, Aarohi and Ahuja, Kabir and Chiang, David and Tsvetkov, Yulia and Anastasopoulos, Antonios}, editor = {Ku, Lun-Wei and Martins, Andre and Srikumar, Vivek}, booktitle = {Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)}, month = aug, year = {2024}, address = {Bangkok, Thailand}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2024.acl-long.777/}, doi = {10.18653/v1/2024.acl-long.777}, pages = {14412--14454}, }
MRL
An Efficient Approach for Studying Cross-Lingual Transfer in Multilingual Language Models

Fahim Faisal, and Antonios Anastasopoulos

In Proceedings of the Fourth Workshop on Multilingual Representation Learning (MRL 2024) code here , Nov 2024

Abs DOI Bib PDF

The capacity and effectiveness of pre-trained multilingual models (MLMs) for zero-shot cross-lingual transfer is well established. However, phenomena of positive or negative transfer, and the effect of language choice still need to be fully understood, especially in the complex setting of massively multilingual LMs. We propose an \textitefficient method to study transfer language influence in zero-shot performance on another target language. Unlike previous work, our approach \textitdisentangles downstream tasks from language, using dedicated adapter units. Our findings suggest that some languages do not largely affect others, while some languages, especially ones unseen during pre-training, can be extremely beneficial or detrimental for different target languages. We find that no transfer language is beneficial for all target languages. We do, curiously, observe languages previously unseen by MLMs consistently benefit from transfer from \textitalmost any language. We additionally use our modular approach to quantify negative interference efficiently and categorize languages accordingly. Furthermore, we provide a list of promising transfer-target language configurations that consistently lead to target language performance improvements.
@inproceedings{faisal-anastasopoulos-2024-efficient, title = {An Efficient Approach for Studying Cross-Lingual Transfer in Multilingual Language Models}, author = {Faisal, Fahim and Anastasopoulos, Antonios}, editor = {S{\"a}lev{\"a}, Jonne and Owodunni, Abraham}, booktitle = {Proceedings of the Fourth Workshop on Multilingual Representation Learning (MRL 2024)}, month = nov, year = {2024}, address = {Miami, Florida, USA}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2024.mrl-1.4/}, doi = {10.18653/v1/2024.mrl-1.4}, pages = {45--92}, }
VarDial
Data-Augmentation-Based Dialectal Adaptation for LLMs

Fahim Faisal, and Antonios Anastasopoulos

In Proceedings of the Eleventh Workshop on NLP for Similar Languages, Varieties, and Dialects (VarDial 2024) code here , Jun 2024

Abs DOI Bib PDF

This report presents gmnlp‘s participation to the Dialect-Copa shared task at VarDial 2024 (Chifu et al., 2024), which focuses on evaluating the commonsense reasoning capabilities of large language models (LLMs) on South Slavic micro-dialects. The task aims to assess how well LLMs can handle non-standard dialectal varieties, as their performance on standard languages is already well-established. We propose an approach that combines the strengths of different types of language models and leverages data augmentation techniques to improve task performance on three South Slavic dialects: Chakavian, Cherkano, and Torlak. We conduct experiments using a language-family-focused encoder-based model (BERTić) and a domain-agnostic multilingual model (AYA-101). Our results demonstrate that the proposed data augmentation techniques lead to substantial performance gains across all three test datasets in the open-source model category. This work highlights the practical utility of data augmentation and the potential of LLMs in handling non-standard dialectal varieties, contributing to the broader goal of advancing natural language understanding in low-resource and dialectal settings.
@inproceedings{faisal-anastasopoulos-2024-data, title = {Data-Augmentation-Based Dialectal Adaptation for {LLM}s}, author = {Faisal, Fahim and Anastasopoulos, Antonios}, editor = {Scherrer, Yves and Jauhiainen, Tommi and Ljube{\v{s}}i{\'c}, Nikola and Zampieri, Marcos and Nakov, Preslav and Tiedemann, J{\"o}rg}, booktitle = {Proceedings of the Eleventh Workshop on NLP for Similar Languages, Varieties, and Dialects (VarDial 2024)}, month = jun, year = {2024}, address = {Mexico City, Mexico}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2024.vardial-1.17/}, doi = {10.18653/v1/2024.vardial-1.17}, pages = {197--208}, }
LREC-COLING
Language and Speech Technology for Central Kurdish Varieties

Sina Ahmadi, Daban Jaff, Md Mahfuz Ibn Alam, and Antonios Anastasopoulos

In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) code here , May 2024

Abs Bib PDF

Kurdish, an Indo-European language spoken by over 30 million speakers, is considered a dialect continuum and known for its diversity in language varieties. Previous studies addressing language and speech technology for Kurdish handle it in a monolithic way as a macro-language, resulting in disparities for dialects and varieties for which there are few resources and tools available. In this paper, we take a step towards developing resources for language and speech technology for varieties of Central Kurdish, creating a corpus by transcribing movies and TV series as an alternative to fieldwork. Additionally, we report the performance of machine translation, automatic speech recognition, and language identification as downstream tasks evaluated on Central Kurdish subdialects. Data and models are publicly available under an open license at https://github.com/sinaahmadi/CORDI.
@inproceedings{ahmadi-etal-2024-language, title = {Language and Speech Technology for {C}entral {K}urdish Varieties}, author = {Ahmadi, Sina and Jaff, Daban and Alam, Md Mahfuz Ibn and Anastasopoulos, Antonios}, editor = {Calzolari, Nicoletta and Kan, Min-Yen and Hoste, Veronique and Lenci, Alessandro and Sakti, Sakriani and Xue, Nianwen}, booktitle = {Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)}, month = may, year = {2024}, address = {Torino, Italia}, publisher = {ELRA and ICCL}, url = {https://aclanthology.org/2024.lrec-main.877/}, pages = {10034--10045}, }
Interspeech
Speech Recognition for Greek Dialects: A Challenging Benchmark

Socrates Vakirtzian, Chara Tsoukala, Stavros Bompolas, Katerina Mouzou, Vivian Stamou, Georgios Paraskevopoulos, Antonios Dimakis, Stella Markantonatou, Angela Ralli, and Antonios Anastasopoulos

In Interspeech 2024, May 2024

Abs DOI Bib PDF

Language technologies should be judged on their usefulness in real-world use cases. Despite recent impressive progress in automatic speech recognition (ASR), an often overlooked aspect in ASR research and evaluation is language variation in the form of non-standard dialects or language varieties. To this end, this work introduces a challenging benchmark that focuses on four varieties of Greek (Aivaliot, Cretan, Griko, Messenian) encompassing challenges related to data availability, orthographic conventions, and complexities arising from language contact. Initial experiments with state-of-the-art models and established cross-lingual transfer techniques highlight the difficulty of adapting to such low-resource varieties.
@inproceedings{vakirtzian24_interspeech, title = {Speech Recognition for Greek Dialects: A Challenging Benchmark}, author = {Vakirtzian, Socrates and Tsoukala, Chara and Bompolas, Stavros and Mouzou, Katerina and Stamou, Vivian and Paraskevopoulos, Georgios and Dimakis, Antonios and Markantonatou, Stella and Ralli, Angela and Anastasopoulos, Antonios}, year = {2024}, booktitle = {Interspeech 2024}, pages = {3974--3978}, doi = {10.21437/Interspeech.2024-2443}, issn = {2958-1796}, }
EACL
CODET: A Benchmark for Contrastive Dialectal Evaluation of Machine Translation

Md Mahfuz Ibn Alam, Sina Ahmadi, and Antonios Anastasopoulos

In Findings of the Association for Computational Linguistics: EACL 2024 code here , Mar 2024

Abs Bib

Neural machine translation (NMT) systems exhibit limited robustness in handling source-side linguistic variations. Their performance tends to degrade when faced with even slight deviations in language usage, such as different domains or variations introduced by second-language speakers. It is intuitive to extend this observation to encompass dialectal variations as well, but the work allowing the community to evaluate MT systems on this dimension is limited. To alleviate this issue, we compile and release CODET, a contrastive dialectal benchmark encompassing 891 different variations from twelve different languages. We also quantitatively demonstrate the challenges large MT models face in effectively translating dialectal variants. All the data and code have been released.
@inproceedings{alam-etal-2024-codet, title = {{CODET}: A Benchmark for Contrastive Dialectal Evaluation of Machine Translation}, author = {Alam, Md Mahfuz Ibn and Ahmadi, Sina and Anastasopoulos, Antonios}, editor = {Graham, Yvette and Purver, Matthew}, booktitle = {Findings of the Association for Computational Linguistics: EACL 2024}, month = mar, year = {2024}, address = {St. Julian{'}s, Malta}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2024.findings-eacl.125/}, pages = {1790--1859}, }
NAACL
Extracting Lexical Features from Dialects via Interpretable Dialect Classifiers

Roy Xie, Orevaoghene Ahia, Yulia Tsvetkov, and Antonios Anastasopoulos

In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers) code here , Jun 2024

Abs DOI Bib PDF

Identifying linguistic differences between dialects of a language often requires expert knowledge and meticulous human analysis. This is largely due to the complexity and nuance involved in studying various dialects. We present a novel approach to extract distinguishing lexical features of dialects by utilizing interpretable dialect classifiers, even in the absence of human experts. We explore both post-hoc and intrinsic approaches to interpretability, conduct experiments on Mandarin, Italian, and Low Saxon, and experimentally demonstrate that our method successfully identifies key language-specific lexical features that contribute to dialectal variations.
@inproceedings{xie-etal-2024-extracting, title = {Extracting Lexical Features from Dialects via Interpretable Dialect Classifiers}, author = {Xie, Roy and Ahia, Orevaoghene and Tsvetkov, Yulia and Anastasopoulos, Antonios}, editor = {Duh, Kevin and Gomez, Helena and Bethard, Steven}, booktitle = {Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers)}, month = jun, year = {2024}, address = {Mexico City, Mexico}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2024.naacl-short.5/}, doi = {10.18653/v1/2024.naacl-short.5}, pages = {54--69}, }

2022

ACL
Dataset Geography: Mapping Language Data to Language Users

Fahim Faisal, Yinkai Wang, and Antonios Anastasopoulos

In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) Code here , May 2022

Abs DOI Bib PDF

As language technologies become more ubiquitous, there are increasing efforts towards expanding the language diversity and coverage of natural language processing (NLP) systems. Arguably, the most important factor influencing the quality of modern NLP systems is data availability. In this work, we study the geographical representativeness of NLP datasets, aiming to quantify if and by how much do NLP datasets match the expected needs of the language speakers. In doing so, we use entity recognition and linking systems, also making important observations about their cross-lingual consistency and giving suggestions for more robust evaluation. Last, we explore some geographical and economic factors that may explain the observed dataset distributions.
@inproceedings{faisal-etal-2022-dataset, title = {Dataset Geography: Mapping Language Data to Language Users}, author = {Faisal, Fahim and Wang, Yinkai and Anastasopoulos, Antonios}, editor = {Muresan, Smaranda and Nakov, Preslav and Villavicencio, Aline}, booktitle = {Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)}, month = may, year = {2022}, address = {Dublin, Ireland}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2022.acl-long.239/}, doi = {10.18653/v1/2022.acl-long.239}, pages = {3381--3411}, }

2021

MRQA
Investigating Post-pretraining Representation Alignment for Cross-Lingual Question Answering

Fahim Faisal, and Antonios Anastasopoulos

In Proceedings of the 3rd Workshop on Machine Reading for Question Answering Code here , Nov 2021

Abs DOI Bib PDF

Human knowledge is collectively encoded in the roughly 6500 languages spoken around the world, but it is not distributed equally across languages. Hence, for information-seeking question answering (QA) systems to adequately serve speakers of all languages, they need to operate cross-lingually. In this work we investigate the capabilities of multilingually pretrained language models on cross-lingual QA. We find that explicitly aligning the representations across languages with a post-hoc finetuning step generally leads to improved performance. We additionally investigate the effect of data size as well as the language choice in this fine-tuning step, also releasing a dataset for evaluating cross-lingual QA systems.
@inproceedings{faisal-anastasopoulos-2021-investigating, title = {Investigating Post-pretraining Representation Alignment for Cross-Lingual Question Answering}, author = {Faisal, Fahim and Anastasopoulos, Antonios}, editor = {Fisch, Adam and Talmor, Alon and Chen, Danqi and Choi, Eunsol and Seo, Minjoon and Lewis, Patrick and Jia, Robin and Min, Sewon}, booktitle = {Proceedings of the 3rd Workshop on Machine Reading for Question Answering}, month = nov, year = {2021}, address = {Punta Cana, Dominican Republic}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2021.mrqa-1.14/}, doi = {10.18653/v1/2021.mrqa-1.14}, pages = {133--148}, }
EMNLP
SD-QA: Spoken Dialectal Question Answering for the Real World

Fahim Faisal, Sharlina Keshava, Md Mahfuz Ibn Alam, and Antonios Anastasopoulos

In Findings of the Association for Computational Linguistics: EMNLP 2021 Code here , Nov 2021

Abs DOI Bib PDF

Question answering (QA) systems are now available through numerous commercial applications for a wide variety of domains, serving millions of users that interact with them via speech interfaces. However, current benchmarks in QA research do not account for the errors that speech recognition models might introduce, nor do they consider the language variations (dialects) of the users. To address this gap, we augment an existing QA dataset to construct a multi-dialect, spoken QA benchmark on five languages (Arabic, Bengali, English, Kiswahili, Korean) with more than 68k audio prompts in 24 dialects from 255 speakers. We provide baseline results showcasing the real-world performance of QA systems and analyze the effect of language variety and other sensitive speaker attributes on downstream performance. Last, we study the fairness of the ASR and QA models with respect to the underlying user populations.
@inproceedings{faisal-etal-2021-sd-qa, title = {{SD}-{QA}: Spoken Dialectal Question Answering for the Real World}, author = {Faisal, Fahim and Keshava, Sharlina and Alam, Md Mahfuz Ibn and Anastasopoulos, Antonios}, editor = {Moens, Marie-Francine and Huang, Xuanjing and Specia, Lucia and Yih, Scott Wen-tau}, booktitle = {Findings of the Association for Computational Linguistics: EMNLP 2021}, month = nov, year = {2021}, address = {Punta Cana, Dominican Republic}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2021.findings-emnlp.281/}, doi = {10.18653/v1/2021.findings-emnlp.281}, pages = {3296--3315}, }
EMNLP
Evaluating the Morphosyntactic Well-formedness of Generated Texts

Adithya Pratapa^*, Antonios Anastasopoulos^*, Shruti Rijhwani, Aditi Chaudhary, David R. Mortensen, Graham Neubig, and Yulia Tsvetkov

In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing Code here , Nov 2021

Abs DOI Bib PDF

Text generation systems are ubiquitous in natural language processing applications. However, evaluation of these systems remains a challenge, especially in multilingual settings. In this paper, we propose L‘AMBRE – a metric to evaluate the morphosyntactic well-formedness of text using its dependency parse and morphosyntactic rules of the language. We present a way to automatically extract various rules governing morphosyntax directly from dependency treebanks. To tackle the noisy outputs from text generation systems, we propose a simple methodology to train robust parsers. We show the effectiveness of our metric on the task of machine translation through a diachronic study of systems translating into morphologically-rich languages.
@inproceedings{pratapa-etal-2021-evaluating, title = {Evaluating the Morphosyntactic Well-formedness of Generated Texts}, author = {Pratapa, Adithya and Anastasopoulos, Antonios and Rijhwani, Shruti and Chaudhary, Aditi and Mortensen, David R. and Neubig, Graham and Tsvetkov, Yulia}, editor = {Moens, Marie-Francine and Huang, Xuanjing and Specia, Lucia and Yih, Scott Wen-tau}, booktitle = {Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing}, month = nov, year = {2021}, address = {Online and Punta Cana, Dominican Republic}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2021.emnlp-main.570/}, doi = {10.18653/v1/2021.emnlp-main.570}, pages = {7131--7150}, }