Language Documentation with an Artificial Intelligence (AI) Helper

funded by the NSF (2021-)

Project Description

Documentation of languages, especially endangered languages, is crucial for conserving humanity?s knowledge and cultural heritage, as well as for advancing an understanding of human language. Traditional documentation methods produce invaluable materials such as grammars, dictionaries, and annotated texts, but require more time than can be afforded to keep up with current language extinction rates. The most constructive response to this crisis is to complement documentation efforts by collecting data for as many languages as possible now and to make them accessible and interpretable so that they can be studied later by both linguists and members of the language communities. Digital technologies make it practical to obtain many hours of recordings in an endangered language along with translations. This project advances technologies for analyzing the recordings at the sub-word, word, and clause level so that they become accessible for a wide variety of documentary purposes.

The project makes the information in digital recordings more interpretable for further linguistic analysis in three ways. First, the team is devising computational methods to automatically derive a basic phonological understanding and produce phonetic representations for languages, even if they do not have an established writing system. Second, the team is developing methods to automatically analyze the internal structure of words in languages where this structure is highly complex. Third, the team uses knowledge of more widely spoken languages to analyze related endangered languages. The resulting tool, the AI-helper toolbox, will be packaged with software that is currently widely in use by linguists and language communities in the language documentation process. All tools will be accessible through a web-based interface and the source code will be publicly available through GitHub.

This is a collaborative project between George Mason University and the University of Notre Dame.

Participants

Faculty

Antonis Anastasopoulos, GMU
Géraldine Walther, GMU Linguistics

Students

Ellie Liebl, MA/PhD Linguistics

Alumni

Publications

Acknowledgements

This project is supported by NSF DEL/DLI grant BCS-2109578.

References

2024

  1. LREC-COLING
    kurdish.jpeg
    Language and Speech Technology for Central Kurdish Varieties
    Sina Ahmadi, Daban Jaff, Md Mahfuz Ibn Alam, and Antonios Anastasopoulos
    In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) code here , May 2024

2023

  1. ACL
    script.png
    Script Normalization for Unconventional Writing of Under-Resourced Languages in Bilingual Communities
    Sina Ahmadi, and Antonios Anastasopoulos
    In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) Code here , Jul 2023
  2. FieldMatters
    Approaches to Corpus Creation for Low-Resource Language Technology: the Case of Southern Kurdish and Laki
    Sina Ahmadi, Zahra Azin, Sara Belelli, and Antonios Anastasopoulos
    In Proceedings of the Second Workshop on NLP Applications to Field Linguistics Code here , May 2023
  3. VarDial
    PALI: A Language Identification Benchmark for Perso-Arabic Scripts
    Sina Ahmadi, Milind Agarwal, and Antonios Anastasopoulos
    In Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023) Code here , May 2023
  4. ACL
    bigc.png
    BIG-C: a Multimodal Multi-Purpose Dataset for Bemba
    Claytone Sikasote, Eunice Mukonde, Md Mahfuz Ibn Alam, and Antonios Anastasopoulos
    In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) Code here , Jul 2023

2022

  1. BEA
    mapuzugun.png
    Educational Tools for Mapuzugun
    Cristian Ahumada, Claudio Gutierrez, and Antonios Anastasopoulos
    In Proceedings of the 17th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2022) Code here , Jul 2022
  2. LREC
    bembaspeech.png
    BembaSpeech: A Speech Recognition Corpus for the Bemba Language
    Claytone Sikasote, and Antonios Anastasopoulos
    In Proceedings of the Thirteenth Language Resources and Evaluation Conference Code here , Jun 2022
  3. ACL
    leakage.png
    Revisiting the Effects of Leakage on Dependency Parsing
    Nathaniel Krasner*, Miriam Wanner*, and Antonios Anastasopoulos
    In Findings of the Association for Computational Linguistics: ACL 2022 Code here , May 2022