Building Language Technologies by Machine Reading Grammars

funded by the NSF (2023-)

Project Description

Recent years have seen incredible advances in natural language processing (NLP) technologies, which now make it possible to perform numerous tasks through, with, or on language data. However, this progress has been limited to the handful of languages for which abundant data are available, because the neural models that facilitate the recent improvements are particularly data hungry. This work suggests that we should move away from the current data-inefficient learning paradigm, and instead attempt to also model languages by relying on the human mode of describing them: the grammar of each language. Put simply, we will aim to incorporate the grammars of languages, as written by linguists and treated as symbolic knowledge bases, in the process of training neural language models.

Specifically, this work will focus on the first step towards this goal, namely extracting the necessary information from grammar descriptions and other linguistic documents. We will explore several alternative modeling approaches, first by relying on retrieval-based models. We will additionally attack the problem through a machine-reading and question-answering framework. Ultimately, the success of these methods will enable the creation of linguistically-informed models, which will in turn facilitate the creation of technologies especially for under-served language communities.

Participants

Faculty

PI
Antonis Anastasopoulos, Computer Science

Students

Anjishnu Mukherjee, PhD CS

Publications

Acknowledgements

This project is supported by NEH grant PR-276810-21 under the Preservation and Access: Research and Development.

References

2025

  1. WACV
    crossroads.png
    Crossroads of Continents: Automated Artifact Extraction for Cultural Adaptation with Large Multimodal Models
    Anjishnu Mukherjee, Ziwei Zhu, and Antonios Anastasopoulos
    In 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) Code here , Feb 2025

2024

  1. NAACL
    framing.png
    A Study on Scaling Up Multilingual News Framing Analysis
    Syeda Sabrina Akter, and Antonios Anastasopoulos
    In Findings of the Association for Computational Linguistics: NAACL 2024 code here , Jun 2024
  2. EMNLP
    llmeffect.png
    The LLM Effect: Are Humans Truly Using LLMs, or Are They Being Influenced By Them Instead?
    Alexander Choi*, Syeda Sabrina Akter*, J.p. Singh, and Antonios Anastasopoulos
    In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing code here , Nov 2024
  3. EMNLP
    grammar.png
    Back to School: Translation Using Grammar Books
    Jonathan Hus, and Antonios Anastasopoulos
    In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing code here , Nov 2024
  4. ACL
    mwe.png
    Dictionary-Aided Translation for Handling Multi-Word Expressions in Low-Resource Languages
    Antonios Dimakis, Stella Markantonatou, and Antonios Anastasopoulos
    In Findings of the Association for Computational Linguistics: ACL 2024 code here , Aug 2024
  5. NAACL
    globalgallery.png
    Global Gallery: The Fine Art of Painting Culture Portraits through Multilingual Instruction Tuning
    Anjishnu Mukherjee, Aylin Caliskan, Ziwei Zhu, and Antonios Anastasopoulos
    In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) code here , Jun 2024
  6. AIES
    breaking.png
    Breaking Bias, Building Bridges: Evaluation and Mitigation of Social Biases in LLMs via Contact Hypothesis
    Chahat Raj, Anjishnu Mukherjee, Aylin Caliskan, Antonios Anastasopoulos, and Ziwei Zhu
    Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society Code here , Oct 2024
  7. EMNLP
    biasdora.png
    BiasDora: Exploring Hidden Biased Associations in Vision-Language Models
    Chahat Raj, Anjishnu Mukherjee, Aylin Caliskan, Antonios Anastasopoulos, and Ziwei Zhu
    In Findings of the Association for Computational Linguistics: EMNLP 2024 code here , Nov 2024