Unlocking Endangered Language Resources

funded by the NEH (2021-2025)

Project Description

The development of modern Optical Character Recognition and post-correction tools tailored for Indigenous Latin American languages through a multilingual benchmark, software package, web interface, and digitized data to be returned to the Archive of the Indigenous Languages of Latin America (AILLA).

This project will unlock endangered and low-resource language data that have already been collected in the past and are stored in linguistic archives like the Archive of the Indigenous Languages of Latin America (AILLA). To do so, we will combine modern machine learning tools with linguistic expertise to develop modern Optical Character Recognition and post-correction tools, tailored to the intricacies of these language data. The result will include a multilingual benchmark, a software package, a web interface, and digitized data that will be returned to AILLA for storage.

Participants

Faculty

PI Co-PI
Antonis Anastasopoulos, Computer Science Graham Neubig, Carnegie Mellon University, Language Technologies Institute

Students

Milind Agarwal, PhD CS

Alumni

Other Collaborators

Publications (from GMU authors)

Acknowledgements

This project is supported by NEH grant PR-276810-21 under the Preservation and Access: Research and Development.

References

2024

  1. AmericasNLP
    ocrsurvey.png
    A Concise Survey of OCR for Low-Resource Languages
    Milind Agarwal, and Antonios Anastasopoulos
    In Proceedings of the 4th Workshop on Natural Language Processing for Indigenous Languages of the Americas (AmericasNLP 2024) code here , Jun 2024

2023

  1. ComputEL
    kwakwala.png
    User-Centric Evaluation of OCR Systems for Kwak‘wala
    Shruti Rijhwani, Daisy Rosenblum, Michayla King, Antonios Anastasopoulos, and Graham Neubig
    In Proceedings of the Sixth Workshop on the Use of Computational Methods in the Study of Endangered Languages Code here , Mar 2023
  2. EACL
    alignment.png
    Noisy Parallel Data Alignment
    Ruoyu Xie, and Antonios Anastasopoulos
    In Findings of the Association for Computational Linguistics: EACL 2023 Code here , May 2023
  3. EMNLP
    limit.png
    LIMIT: Language Identification, Misidentification, and Translation using Hierarchical Models in 350+ Languages
    Milind Agarwal, Md Mahfuz Ibn Alam, and Antonios Anastasopoulos
    In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing Code here , Dec 2023
  4. VarDial
    PALI: A Language Identification Benchmark for Perso-Arabic Scripts
    Sina Ahmadi, Milind Agarwal, and Antonios Anastasopoulos
    In Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023) Code here , May 2023

2021

  1. TACL
    lexicallyocr.png
    Lexically Aware Semi-Supervised Learning for OCR Post-Correction
    Shruti Rijhwani, Daisy Rosenblum, Antonios Anastasopoulos, and Graham Neubig
    Transactions of the Association for Computational Linguistics Code here , May 2021