Quantifying and Mitigating Disparities in Language Technologies

funded by the NSF (2021-2024)

Project Description

Advances in natural language processing (NLP) technology now make it possible to perform many tasks through natural language or over natural language data – automatic systems can answer questions, perform web search, or command our computers to perform specific tasks. However, “language” is not monolithic; people vary in the language they speak, the dialect they use, the relative ease with which they produce language, or the words they choose with which to express themselves. In benchmarking of NLP systems however, this linguistic variety is generally unattested. Most commonly tasks are formulated using canonical American English, designed with little regard for whether systems will work on language of any other variety. In this work we ask a simple question: can we measure the extent to which the diversity of language that we use affects the quality of results that we can expect from language technology systems? This will allow for the development and deployment of fair accuracy measures for a variety of tasks regarding language technology, encouraging advances in the state of the art in these technologies to focus on all, not just a select few.

Specifically, this work focuses on four aspects of this overall research question. First, we will develop a general-purpose methodology for quantifying how well particular language technologies work across many varieties of language. Measures over multiple speakers or demographics are combined to benchmarks that can drive progress in development of fair metrics for language systems, tailored to the specific needs of design teams. Second, we will move beyond simple accuracy measures, and directly quantify the effect that the accuracy of systems has on users in terms of relative utility derived from using the system. These measures of utility will be incorporated in our metrics for system success. Third, we focus on the language produced by people from varying demographic groups, predicting system accuracies from demographics. Finally, we will examine novel methods for robust learning of NLP systems across language or dialectal boundaries, and examine the effect that these methods have on increasing accuracy for all users.

Participants

Faculty

Students and Alumni (at GMU)

Publications (from GMU authors)

Acknowledgements

This project was supported by NSF and Amazon through FAI grant 2040926.

References

2023

  1. EACL
    survey.png
    Language Generation Models Can Cause Harm: So What Can We Do About It? An Actionable Survey
    Sachin Kumar, Vidhisha Balachandran, Lucille Njoo, Antonios Anastasopoulos, and Yulia Tsvetkov
    In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, May 2023

2022

  1. ACL
    inequalities.png
    Systematic Inequalities in Language Technology Performance across the World‘s Languages
    Damian Blasi, Antonios Anastasopoulos, and Graham Neubig
    In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) Code here , May 2022
  2. ACL
    datageography.png
    Dataset Geography: Mapping Language Data to Language Users
    Fahim Faisal, Yinkai Wang, and Antonios Anastasopoulos
    In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) Code here , May 2022

2021

  1. ACL
    equitableqa.png
    Towards more equitable question answering systems: How much more data do you need?
    Arnab Debnath*, Navid Rajabi*, Fardina Fathmiul Alam*, and Antonios Anastasopoulos
    In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers) Code here , Aug 2021
  2. EMNLP
    sdqa.png
    SD-QA: Spoken Dialectal Question Answering for the Real World
    Fahim Faisal, Sharlina Keshava, Md Mahfuz Ibn Alam, and Antonios Anastasopoulos
    In Findings of the Association for Computational Linguistics: EMNLP 2021 Code here , Nov 2021
  3. ACL
    dialectmt.png
    Machine Translation into Low-resource Language Varieties
    Sachin Kumar, Antonios Anastasopoulos, Shuly Wintner, and Yulia Tsvetkov
    In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers) Code here , Aug 2021