The development of modern Optical Character Recognition and post-correction tools tailored for Indigenous Latin American languages through a multilingual benchmark, software package, web interface, and digitized data to be returned to the Archive of the Indigenous Languages of Latin America (AILLA).
This project will unlock endangered and low-resource language data that have already been collected in the past and are stored in linguistic archives like the Archive of the Indigenous Languages of Latin America (AILLA). To do so, we will combine modern machine learning tools with linguistic expertise to develop modern Optical Character Recognition and post-correction tools, tailored to the intricacies of these language data. The result will include a multilingual benchmark, a software package, a web interface, and digitized data that will be returned to AILLA for storage.
Participants
Faculty
PI
Co-PI
Antonis Anastasopoulos, Computer Science
Graham Neubig, Carnegie Mellon University, Language Technologies Institute
Modern natural language processing (NLP) techniques increasingly require substantial amounts of data to train robust algorithms. Building such technologies for low-resource languages requires focusing on data creation efforts and data-efficient algorithms. For a large number of low-resource languages, especially Indigenous languages of the Americas, this data exists in image-based non-machine-readable documents. This includes scanned copies of comprehensive dictionaries, linguistic field notes, children‘s stories, and other textual material. To digitize these resources, Optical Character Recognition (OCR) has played a major role but it comes with certain challenges in low-resource settings. In this paper, we share the first survey of OCR techniques specific to low-resource data creation settings and outline several open challenges, with a special focus on Indigenous Languages of the Americas. Based on experiences and results from previous research, we conclude with recommendations on utilizing and improving OCR for the benefit of computational researchers, linguists, and language communities.
@inproceedings{agarwal-anastasopoulos-2024-concise,title={A Concise Survey of {OCR} for Low-Resource Languages},author={Agarwal, Milind and Anastasopoulos, Antonios},editor={Mager, Manuel and Ebrahimi, Abteen and Rijhwani, Shruti and Oncevay, Arturo and Chiruzzo, Luis and Pugh, Robert and von der Wense, Katharina},booktitle={Proceedings of the 4th Workshop on Natural Language Processing for Indigenous Languages of the Americas (AmericasNLP 2024)},month=jun,year={2024},address={Mexico City, Mexico},publisher={Association for Computational Linguistics},url={https://aclanthology.org/2024.americasnlp-1.10/},doi={10.18653/v1/2024.americasnlp-1.10},pages={88--102},}
2023
ComputEL
User-Centric Evaluation of OCR Systems for Kwak‘wala
Shruti Rijhwani, Daisy Rosenblum, Michayla King, Antonios Anastasopoulos, and Graham Neubig
In Proceedings of the Sixth Workshop on the Use of Computational Methods in the Study of Endangered Languages Code here , Mar 2023
@inproceedings{rijhwani-etal-2023-user,title={User-Centric Evaluation of {OCR} Systems for Kwak`wala},author={Rijhwani, Shruti and Rosenblum, Daisy and King, Michayla and Anastasopoulos, Antonios and Neubig, Graham},editor={Harrigan, Atticus and Chaudhary, Aditi and Rijhwani, Shruti and Moeller, Sarah and Arppe, Antti and Palmer, Alexis and Henke, Ryan and Rosenblum, Daisy},booktitle={Proceedings of the Sixth Workshop on the Use of Computational Methods in the Study of Endangered Languages},month=mar,year={2023},address={Remote},publisher={Association for Computational Linguistics},url={https://aclanthology.org/2023.computel-1.4/},pages={19--29},}
EACL
Noisy Parallel Data Alignment
Ruoyu Xie, and Antonios Anastasopoulos
In Findings of the Association for Computational Linguistics: EACL 2023 Code here , May 2023
An ongoing challenge in current natural language processing is how its major advancements tend to disproportionately favor resource-rich languages, leaving a significant number of under-resourced languages behind. Due to the lack of resources required to train and evaluate models, most modern language technologies are either nonexistent or unreliable to process endangered, local, and non-standardized languages. Optical character recognition (OCR) is often used to convert endangered language documents into machine-readable data. However, such OCR output is typically noisy, and most word alignment models are not built to work under such noisy conditions. In this work, we study the existing word-level alignment models under noisy settings and aim to make them more robust to noisy data. Our noise simulation and structural biasing method, tested on multiple language pairs, manages to reduce the alignment error rate on a state-of-the-art neural-based alignment model up to 59.6%.
@inproceedings{xie-anastasopoulos-2023-noisy,title={Noisy Parallel Data Alignment},author={Xie, Ruoyu and Anastasopoulos, Antonios},editor={Vlachos, Andreas and Augenstein, Isabelle},booktitle={Findings of the Association for Computational Linguistics: EACL 2023},month=may,year={2023},address={Dubrovnik, Croatia},publisher={Association for Computational Linguistics},url={https://aclanthology.org/2023.findings-eacl.111/},doi={10.18653/v1/2023.findings-eacl.111},pages={1501--1513},}
EMNLP
LIMIT: Language Identification, Misidentification, and Translation using Hierarchical Models in 350+ Languages
Milind Agarwal, Md Mahfuz Ibn Alam, and Antonios Anastasopoulos
In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing Code here , Dec 2023
Knowing the language of an input text/audio is a necessary first step for using almost every NLP tool such as taggers, parsers, or translation systems. Language identification is a well-studied problem, sometimes even considered solved; in reality, due to lack of data and computational challenges, current systems cannot accurately identify most of the world‘s 7000 languages. To tackle this bottleneck, we first compile a corpus, MCS-350, of 50K multilingual and parallel children‘s stories in 350+ languages. MCS-350 can serve as a benchmark for language identification of short texts and for 1400+ new translation directions in low-resource Indian and African languages. Second, we propose a novel misprediction-resolution hierarchical model, LIMIT, for language identification that reduces error by 55% (from 0.71 to 0.32) on our compiled children‘s stories dataset and by 40% (from 0.23 to 0.14) on the FLORES-200 benchmark. Our method can expand language identification coverage into low-resource languages by relying solely on systemic misprediction patterns, bypassing the need to retrain large models from scratch.
@inproceedings{agarwal-etal-2023-limit,title={{LIMIT}: Language Identification, Misidentification, and Translation using Hierarchical Models in 350+ Languages},author={Agarwal, Milind and Alam, Md Mahfuz Ibn and Anastasopoulos, Antonios},editor={Bouamor, Houda and Pino, Juan and Bali, Kalika},booktitle={Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing},month=dec,year={2023},address={Singapore},publisher={Association for Computational Linguistics},url={https://aclanthology.org/2023.emnlp-main.895/},doi={10.18653/v1/2023.emnlp-main.895},pages={14496--14519},}
VarDial
PALI: A Language Identification Benchmark for Perso-Arabic Scripts
Sina Ahmadi, Milind Agarwal, and Antonios Anastasopoulos
In Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023) Code here , May 2023
The Perso-Arabic scripts are a family of scripts that are widely adopted and used by various linguistic communities around the globe. Identifying various languages using such scripts is crucial to language technologies and challenging in low-resource setups. As such, this paper sheds light on the challenges of detecting languages using Perso-Arabic scripts, especially in bilingual communities where “unconventional” writing is practiced. To address this, we use a set of supervised techniques to classify sentences into their languages. Building on these, we also propose a hierarchical model that targets clusters of languages that are more often confused by the classifiers. Our experiment results indicate the effectiveness of our solutions.
@inproceedings{ahmadi-etal-2023-pali,title={{PALI}: A Language Identification Benchmark for {P}erso-{A}rabic Scripts},author={Ahmadi, Sina and Agarwal, Milind and Anastasopoulos, Antonios},editor={Scherrer, Yves and Jauhiainen, Tommi and Ljube{\v{s}}i{\'c}, Nikola and Nakov, Preslav and Tiedemann, J{\"o}rg and Zampieri, Marcos},booktitle={Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023)},month=may,year={2023},address={Dubrovnik, Croatia},publisher={Association for Computational Linguistics},url={https://aclanthology.org/2023.vardial-1.8/},doi={10.18653/v1/2023.vardial-1.8},pages={78--90},}
2021
TACL
Lexically Aware Semi-Supervised Learning for OCR Post-Correction
Shruti Rijhwani, Daisy Rosenblum, Antonios Anastasopoulos, and Graham Neubig
Transactions of the Association for Computational Linguistics Code here , May 2021
Much of the existing linguistic data in many languages of the world is locked away in non- digitized books and documents. Optical character recognition (OCR) can be used to produce digitized text, and previous work has demonstrated the utility of neural post-correction methods that improve the results of general- purpose OCR systems on recognition of less- well-resourced languages. However, these methods rely on manually curated post- correction data, which are relatively scarce compared to the non-annotated raw images that need to be digitized. In this paper, we present a semi-supervised learning method that makes it possible to utilize these raw images to improve performance, specifically through the use of self-training, a technique where a model is iteratively trained on its own outputs. In addition, to enforce consistency in the recognized vocabulary, we introduce a lexically aware decoding method that augments the neural post-correction model with a count-based language model constructed from the recognized texts, implemented using weighted finite-state automata (WFSA) for efficient and effective decoding. Results on four endangered languages demonstrate the utility of the proposed method, with relative error reductions of 15%–29%, where we find the combination of self-training and lexically aware decoding essential for achieving consistent improvements.
@article{rijhwani-etal-2021-lexically,title={Lexically Aware Semi-Supervised Learning for {OCR} Post-Correction},author={Rijhwani, Shruti and Rosenblum, Daisy and Anastasopoulos, Antonios and Neubig, Graham},editor={Roark, Brian and Nenkova, Ani},journal={Transactions of the Association for Computational Linguistics},volume={9},year={2021},address={Cambridge, MA},publisher={MIT Press},url={https://aclanthology.org/2021.tacl-1.76/},doi={10.1162/tacl_a_00427},pages={1285--1302},}