NL(V)P -- Natural Language (Variety) Processing

funded by the NSF (2021-)

Project Description

No language is a monolith. Languages vary richly across countries, regions, social classes, and other factors. Despite recent advances in natural language processing (NLP) technology for translating between languages, answering questions, or engaging in simple conversations, current approaches have largely focused only on “standard” varieties of languages. By ignoring other varieties, treating them essentially as statistical noise, current technologies neglect the millions of people who speak these varieties.

This project is creating ways to enable language technologies such as translation and question-answering systems, both to process and to generate fine-grained language varieties. The team will develop computational methods to automatically recognize features of different language varieties and then create approaches for integrating such linguistic information into the models powering language technologies. Additionally, the team will design methods to adapt models into varieties for which minimal training data may be available. The resulting suite of general methods will benefit diverse communities and less-privileged populations that speak underserved languages and varieties.

This is a collaborative project between George Mason University, the University of Notre Dame, and the University of Washington.

Participants

Faculty

Antonis Anastasopoulos, GMU

Students

Fahim Faisal, PhD CS

Publications

Acknowledgements

This project is supported by NSF grant IIS-2125466. Early work was also supported by a Google Award for Research.

References

2025

  1. VarDial
    Testing the Boundaries of LLMs: Dialectal and Language-Variety Tasks
    Fahim Faisal, and Antonios Anastasopoulos
    In Proceedings of the 12th Workshop on NLP for Similar Languages, Varieties and Dialects Code here , Jan 2025
  2. VarDial
    normalizer.png
    Large Language Models as a Normalizer for Transliteration and Dialectal Translation
    Md Mahfuz Ibn Alam, and Antonios Anastasopoulos
    In Proceedings of the 12th Workshop on NLP for Similar Languages, Varieties and Dialects Code here , Jan 2025

2024

  1. ACL
    dialectbench.png
    DIALECTBENCH: An NLP Benchmark for Dialects, Varieties, and Closely-Related Languages
    Fahim Faisal*, Orevaoghene Ahia*, Aarohi Srivastava*, Kabir Ahuja, David Chiang, Yulia Tsvetkov, and Antonios Anastasopoulos
    In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) code here , Aug 2024
  2. MRL
    efficient.png
    An Efficient Approach for Studying Cross-Lingual Transfer in Multilingual Language Models
    Fahim Faisal, and Antonios Anastasopoulos
    In Proceedings of the Fourth Workshop on Multilingual Representation Learning (MRL 2024) code here , Nov 2024
  3. VarDial
    Data-Augmentation-Based Dialectal Adaptation for LLMs
    Fahim Faisal, and Antonios Anastasopoulos
    In Proceedings of the Eleventh Workshop on NLP for Similar Languages, Varieties, and Dialects (VarDial 2024) code here , Jun 2024
  4. LREC-COLING
    kurdish.jpeg
    Language and Speech Technology for Central Kurdish Varieties
    Sina Ahmadi, Daban Jaff, Md Mahfuz Ibn Alam, and Antonios Anastasopoulos
    In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) code here , May 2024
  5. Interspeech
    Speech Recognition for Greek Dialects: A Challenging Benchmark
    Socrates Vakirtzian, Chara Tsoukala, Stavros Bompolas, Katerina Mouzou, Vivian Stamou, Georgios Paraskevopoulos, Antonios Dimakis, Stella Markantonatou, Angela Ralli, and Antonios Anastasopoulos
    In Interspeech 2024, May 2024
  6. EACL
    codet.png
    CODET: A Benchmark for Contrastive Dialectal Evaluation of Machine Translation
    Md Mahfuz Ibn Alam, Sina Ahmadi, and Antonios Anastasopoulos
    In Findings of the Association for Computational Linguistics: EACL 2024 code here , Mar 2024
  7. NAACL
    dialectfeatures.png
    Extracting Lexical Features from Dialects via Interpretable Dialect Classifiers
    Roy Xie, Orevaoghene Ahia, Yulia Tsvetkov, and Antonios Anastasopoulos
    In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers) code here , Jun 2024

2022

  1. ACL
    datageography.png
    Dataset Geography: Mapping Language Data to Language Users
    Fahim Faisal, Yinkai Wang, and Antonios Anastasopoulos
    In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) Code here , May 2022

2021

  1. MRQA
    Investigating Post-pretraining Representation Alignment for Cross-Lingual Question Answering
    Fahim Faisal, and Antonios Anastasopoulos
    In Proceedings of the 3rd Workshop on Machine Reading for Question Answering Code here , Nov 2021
  2. EMNLP
    sdqa.png
    SD-QA: Spoken Dialectal Question Answering for the Real World
    Fahim Faisal, Sharlina Keshava, Md Mahfuz Ibn Alam, and Antonios Anastasopoulos
    In Findings of the Association for Computational Linguistics: EMNLP 2021 Code here , Nov 2021
  3. EMNLP
    lambre.png
    Evaluating the Morphosyntactic Well-formedness of Generated Texts
    Adithya Pratapa*Antonios Anastasopoulos*, Shruti Rijhwani, Aditi Chaudhary, David R. Mortensen, Graham Neubig, and Yulia Tsvetkov
    In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing Code here , Nov 2021