Language Documentation with an Artificial Intelligence (AI) Helper

funded by the NSF (2021-)

Project Description

Documentation of languages, especially endangered languages, is crucial for conserving humanity?s knowledge and cultural heritage, as well as for advancing an understanding of human language. Traditional documentation methods produce invaluable materials such as grammars, dictionaries, and annotated texts, but require more time than can be afforded to keep up with current language extinction rates. The most constructive response to this crisis is to complement documentation efforts by collecting data for as many languages as possible now and to make them accessible and interpretable so that they can be studied later by both linguists and members of the language communities. Digital technologies make it practical to obtain many hours of recordings in an endangered language along with translations. This project advances technologies for analyzing the recordings at the sub-word, word, and clause level so that they become accessible for a wide variety of documentary purposes.

The project makes the information in digital recordings more interpretable for further linguistic analysis in three ways. First, the team is devising computational methods to automatically derive a basic phonological understanding and produce phonetic representations for languages, even if they do not have an established writing system. Second, the team is developing methods to automatically analyze the internal structure of words in languages where this structure is highly complex. Third, the team uses knowledge of more widely spoken languages to analyze related endangered languages. The resulting tool, the AI-helper toolbox, will be packaged with software that is currently widely in use by linguists and language communities in the language documentation process. All tools will be accessible through a web-based interface and the source code will be publicly available through GitHub.

This is a collaborative project between George Mason University and the University of Notre Dame.

Participants

Faculty

Antonis Anastasopoulos, GMU

David Chiang, ND

Géraldine Walther, GMU Linguistics

Students

Ellie Liebl, MA/PhD Linguistics

Alumni

Sina Ahmadi

Publications

On Kurdish NLP and Documentation: (Ahmadi et al., 2024), (Ahmadi & Anastasopoulos, 2023), (Ahmadi et al., 2023), (Ahmadi et al., 2023)
On educational tools for Mapuzugun datasets: (Ahumada et al., 2022)
On Bemba corpora: (Sikasote & Anastasopoulos, 2022), (Sikasote et al., 2023)
Other: (Krasner* et al., 2022)

Acknowledgements

This project is supported by NSF DEL/DLI grant BCS-2109578.

References

2024

LREC-COLING
Language and Speech Technology for Central Kurdish Varieties

Sina Ahmadi, Daban Jaff, Md Mahfuz Ibn Alam, and Antonios Anastasopoulos

In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) code here , May 2024

Abs Bib PDF

Kurdish, an Indo-European language spoken by over 30 million speakers, is considered a dialect continuum and known for its diversity in language varieties. Previous studies addressing language and speech technology for Kurdish handle it in a monolithic way as a macro-language, resulting in disparities for dialects and varieties for which there are few resources and tools available. In this paper, we take a step towards developing resources for language and speech technology for varieties of Central Kurdish, creating a corpus by transcribing movies and TV series as an alternative to fieldwork. Additionally, we report the performance of machine translation, automatic speech recognition, and language identification as downstream tasks evaluated on Central Kurdish subdialects. Data and models are publicly available under an open license at https://github.com/sinaahmadi/CORDI.
@inproceedings{ahmadi-etal-2024-language, title = {Language and Speech Technology for {C}entral {K}urdish Varieties}, author = {Ahmadi, Sina and Jaff, Daban and Alam, Md Mahfuz Ibn and Anastasopoulos, Antonios}, editor = {Calzolari, Nicoletta and Kan, Min-Yen and Hoste, Veronique and Lenci, Alessandro and Sakti, Sakriani and Xue, Nianwen}, booktitle = {Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)}, month = may, year = {2024}, address = {Torino, Italia}, publisher = {ELRA and ICCL}, url = {https://aclanthology.org/2024.lrec-main.877/}, pages = {10034--10045}, }

2023

ACL
Script Normalization for Unconventional Writing of Under-Resourced Languages in Bilingual Communities

Sina Ahmadi, and Antonios Anastasopoulos

In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) Code here , Jul 2023

Abs DOI Bib PDF

The wide accessibility of social media has provided linguistically under-represented communities with an extraordinary opportunity to create content in their native languages. This, however, comes with certain challenges in script normalization, particularly where the speakers of a language in a bilingual community rely on another script or orthography to write their native language. This paper addresses the problem of script normalization for several such languages that are mainly written in a Perso-Arabic script. Using synthetic data with various levels of noise and a transformer-based model, we demonstrate that the problem can be effectively remediated. We conduct a small-scale evaluation of real data as well. Our experiments indicate that script normalization is also beneficial to improve the performance of downstream tasks such as machine translation and language identification.
@inproceedings{ahmadi-anastasopoulos-2023-script, title = {Script Normalization for Unconventional Writing of Under-Resourced Languages in Bilingual Communities}, author = {Ahmadi, Sina and Anastasopoulos, Antonios}, editor = {Rogers, Anna and Boyd-Graber, Jordan and Okazaki, Naoaki}, booktitle = {Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)}, month = jul, year = {2023}, address = {Toronto, Canada}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2023.acl-long.809/}, doi = {10.18653/v1/2023.acl-long.809}, pages = {14466--14487}, }
FieldMatters
Approaches to Corpus Creation for Low-Resource Language Technology: the Case of Southern Kurdish and Laki

Sina Ahmadi, Zahra Azin, Sara Belelli, and Antonios Anastasopoulos

In Proceedings of the Second Workshop on NLP Applications to Field Linguistics Code here , May 2023

Abs DOI Bib PDF

One of the major challenges that under-represented and endangered language communities face in language technology is the lack or paucity of language data. This is also the case of the Southern varieties of the Kurdish and Laki languages for which very limited resources are available with insubstantial progress in tools. To tackle this, we provide a few approaches that rely on the content of local news websites, a local radio station that broadcasts content in Southern Kurdish and fieldwork for Laki. In this paper, we describe some of the challenges of such under-represented languages, particularly in writing and standardization, and also, in retrieving sources of data and retro-digitizing handwritten content to create a corpus for Southern Kurdish and Laki. In addition, we study the task of language identification in light of the other variants of Kurdish and Zaza-Gorani languages.
@inproceedings{ahmadi-etal-2023-approaches, title = {Approaches to Corpus Creation for Low-Resource Language Technology: the Case of {S}outhern {K}urdish and {L}aki}, author = {Ahmadi, Sina and Azin, Zahra and Belelli, Sara and Anastasopoulos, Antonios}, editor = {Serikov, Oleg and Voloshina, Ekaterina and Postnikova, Anna and Klyachko, Elena and Vylomova, Ekaterina and Shavrina, Tatiana and Le Ferrand, Eric and Malykh, Valentin and Tyers, Francis and Arkhangelskiy, Timofey and Mikhailov, Vladislav}, booktitle = {Proceedings of the Second Workshop on NLP Applications to Field Linguistics}, month = may, year = {2023}, address = {Dubrovnik, Croatia}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2023.fieldmatters-1.7/}, doi = {10.18653/v1/2023.fieldmatters-1.7}, pages = {52--63}, }
VarDial
PALI: A Language Identification Benchmark for Perso-Arabic Scripts

Sina Ahmadi, Milind Agarwal, and Antonios Anastasopoulos

In Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023) Code here , May 2023

Abs DOI Bib PDF

The Perso-Arabic scripts are a family of scripts that are widely adopted and used by various linguistic communities around the globe. Identifying various languages using such scripts is crucial to language technologies and challenging in low-resource setups. As such, this paper sheds light on the challenges of detecting languages using Perso-Arabic scripts, especially in bilingual communities where “unconventional” writing is practiced. To address this, we use a set of supervised techniques to classify sentences into their languages. Building on these, we also propose a hierarchical model that targets clusters of languages that are more often confused by the classifiers. Our experiment results indicate the effectiveness of our solutions.
@inproceedings{ahmadi-etal-2023-pali, title = {{PALI}: A Language Identification Benchmark for {P}erso-{A}rabic Scripts}, author = {Ahmadi, Sina and Agarwal, Milind and Anastasopoulos, Antonios}, editor = {Scherrer, Yves and Jauhiainen, Tommi and Ljube{\v{s}}i{\'c}, Nikola and Nakov, Preslav and Tiedemann, J{\"o}rg and Zampieri, Marcos}, booktitle = {Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023)}, month = may, year = {2023}, address = {Dubrovnik, Croatia}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2023.vardial-1.8/}, doi = {10.18653/v1/2023.vardial-1.8}, pages = {78--90}, }
ACL
BIG-C: a Multimodal Multi-Purpose Dataset for Bemba

Claytone Sikasote, Eunice Mukonde, Md Mahfuz Ibn Alam, and Antonios Anastasopoulos

In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) Code here , Jul 2023

Abs DOI Bib PDF

We present BIG-C (Bemba Image Grounded Conversations), a large multimodal dataset for Bemba. While Bemba is the most populous language of Zambia, it exhibits a dearth of resources which render the development of language technologies or language processing research almost impossible. The dataset is comprised of multi-turn dialogues between Bemba speakers based on images, transcribed and translated into English. There are more than 92,000 utterances/sentences, amounting to more than 180 hours of audio data with corresponding transcriptions and English translations. We also provide baselines on speech recognition (ASR), machine translation (MT) and speech translation (ST) tasks, and sketch out other potential future multimodal uses of our dataset. We hope that by making the dataset available to the research community, this work will foster research and encourage collaboration across the language, speech, and vision communities especially for languages outside the “traditionally” used high-resourced ones. All data and code are publicly available: [\urlhttps://github.com/csikasote/bigc](\urlhttps://github.com/csikasote/bigc).
@inproceedings{sikasote-etal-2023-big, title = {{BIG}-{C}: a Multimodal Multi-Purpose Dataset for {B}emba}, author = {Sikasote, Claytone and Mukonde, Eunice and Alam, Md Mahfuz Ibn and Anastasopoulos, Antonios}, editor = {Rogers, Anna and Boyd-Graber, Jordan and Okazaki, Naoaki}, booktitle = {Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)}, month = jul, year = {2023}, address = {Toronto, Canada}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2023.acl-long.115/}, doi = {10.18653/v1/2023.acl-long.115}, pages = {2062--2078}, }

2022

BEA
Educational Tools for Mapuzugun

Cristian Ahumada, Claudio Gutierrez, and Antonios Anastasopoulos

In Proceedings of the 17th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2022) Code here , Jul 2022

Abs DOI Bib PDF

Mapuzugun is the language of the Mapuche people. Due to political and historical reasons, its number of speakers has decreased and the language has been excluded from the educational system in Chile and Argentina. For this reason, it is very important to support the revitalization of the Mapuzugun in all spaces and media of society. In this work we present a tool towards supporting educational activities of Mapuzugun, tailored to the characteristics of the language. The tool consists of three parts: design and development of an orthography detector and converter; a morphological analyzer; and an informal translator. We also present a case study with Mapuzugun students showing promising results. Short abstract in Mapuzugun: Tüfachi küzaw pegelfi kiñe zugun küzawpeyüm kelluaetew pu mapuzugun chillkatufe kimal kizu tañi zugun.
@inproceedings{ahumada-etal-2022-educational, title = {Educational Tools for Mapuzugun}, author = {Ahumada, Cristian and Gutierrez, Claudio and Anastasopoulos, Antonios}, editor = {Kochmar, Ekaterina and Burstein, Jill and Horbach, Andrea and Laarmann-Quante, Ronja and Madnani, Nitin and Tack, Ana{\"i}s and Yaneva, Victoria and Yuan, Zheng and Zesch, Torsten}, booktitle = {Proceedings of the 17th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2022)}, month = jul, year = {2022}, address = {Seattle, Washington}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2022.bea-1.23/}, doi = {10.18653/v1/2022.bea-1.23}, pages = {183--196}, }
LREC
BembaSpeech: A Speech Recognition Corpus for the Bemba Language

Claytone Sikasote, and Antonios Anastasopoulos

In Proceedings of the Thirteenth Language Resources and Evaluation Conference Code here , Jun 2022

Abs Bib PDF

We present a preprocessed, ready-to-use automatic speech recognition corpus, BembaSpeech, consisting over 24 hours of read speech in the Bemba language, a written but low-resourced language spoken by over 30% of the population in Zambia. To assess its usefulness for training and testing ASR systems for Bemba, we explored different approaches; supervised pre-training (training from scratch), cross-lingual transfer learning from a monolingual English pre-trained model using DeepSpeech on the portion of the dataset and fine-tuning large scale self-supervised Wav2Vec2.0 based multilingual pre-trained models on the complete BembaSpeech corpus. From our experiments, the 1 billion XLS-R parameter model gives the best results. The model achieves a word error rate (WER) of 32.91%, results demonstrating that model capacity significantly improves performance and that multilingual pre-trained models transfers cross-lingual acoustic representation better than monolingual pre-trained English model on the BembaSpeech for the Bemba ASR. Lastly, results also show that the corpus can be used for building ASR systems for Bemba language.
@inproceedings{sikasote-anastasopoulos-2022-bembaspeech, title = {{B}emba{S}peech: A Speech Recognition Corpus for the {B}emba Language}, author = {Sikasote, Claytone and Anastasopoulos, Antonios}, editor = {Calzolari, Nicoletta and B{\'e}chet, Fr{\'e}d{\'e}ric and Blache, Philippe and Choukri, Khalid and Cieri, Christopher and Declerck, Thierry and Goggi, Sara and Isahara, Hitoshi and Maegaard, Bente and Mariani, Joseph and Mazo, H{\'e}l{\`e}ne and Odijk, Jan and Piperidis, Stelios}, booktitle = {Proceedings of the Thirteenth Language Resources and Evaluation Conference}, month = jun, year = {2022}, address = {Marseille, France}, publisher = {European Language Resources Association}, url = {https://aclanthology.org/2022.lrec-1.790/}, pages = {7277--7283}, }
ACL
Revisiting the Effects of Leakage on Dependency Parsing

Nathaniel Krasner^*, Miriam Wanner^*, and Antonios Anastasopoulos

In Findings of the Association for Computational Linguistics: ACL 2022 Code here , May 2022

Abs DOI Bib PDF

Recent work by Søgaard (2020) showed that, treebank size aside, overlap between training and test graphs (termed \textitleakage) explains more of the observed variation in dependency parsing performance than other explanations. In this work we revisit this claim, testing it on more models and languages. We find that it only holds for zero-shot cross-lingual settings. We then propose a more fine-grained measure of such leakage which, unlike the original measure, not only explains but also correlates with observed performance variation. Code and data are available here: \urlhttps://github.com/miriamwanner/reu-nlp-project
@inproceedings{krasner-etal-2022-revisiting, title = {Revisiting the Effects of Leakage on Dependency Parsing}, author = {Krasner, Nathaniel and Wanner, Miriam and Anastasopoulos, Antonios}, editor = {Muresan, Smaranda and Nakov, Preslav and Villavicencio, Aline}, booktitle = {Findings of the Association for Computational Linguistics: ACL 2022}, month = may, year = {2022}, address = {Dublin, Ireland}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2022.findings-acl.230/}, doi = {10.18653/v1/2022.findings-acl.230}, pages = {2925--2934}, }