Lost in Translation: Chemical LMs and the Misunderstanding of Molecule Structures

AIRI, Sber AI, Skoltech, HSE University, Sber AI Lab, ISP RAS Research Center for Trusted AI
EMNLP 2024 Findings

*
Indicates Equal Contribution

Abstract

The recent integration of chemistry with natural language processing (NLP) has advanced drug discovery. Molecule representation in language models (LMs) is crucial in enhancing chemical understanding. We propose \textbf{A}ugmented \textbf{Mo}lecular \textbf{Re}trieval (\heart AMORE), a flexible zero-shot framework that assesses trustworthiness of Chemical LMs of different natures: trained solely on molecules for chemical tasks and on a combined corpus of natural language texts and string-based structures. The framework relies on molecule augmentations that preserve an underlying chemical, such as kekulization and cycle replacements. We evaluate encoder-only and generative LMs by calculating a metric based on the similarity score between distributed representations of molecules and their augmentations. Our experiments on ChEBI-20 and QM9 benchmarks show that these models exhibit significantly lower scores than graph-based molecular models trained without language modeling objectives. Augmentation of SMILES representations leads to decreased performance on chemical property prediction tasks in the MoleculeNet benchmark. Additionally, our results on the molecule captioning task for cross-domain models, MolT5 and Text+Chem T5, demonstrate that our representation-based evaluation metrics significantly correlate with the classical text generation metrics like ROUGE and METEOR.

MY ALT TEXT

Our evaluation involves generating augmented molecules for all molecules in a dataset using one of four augmentation procedures. By encoding original molecules and augmented SMILES representations and calculating distances between embeddings, the study determines model performance based on top-1 accuracy, where the correct augmented SMILES must be retrieved at the top rank.

Poster

BibTeX

@inproceedings{ganeeva-etal-2024-lost,
    title = "Lost in Translation: Chemical Language Models and the Misunderstanding of Molecule Structures",
    author = "Ganeeva, Veronika  and
      Sakhovskiy, Andrey  and
      Khrabrov, Kuzma  and
      Savchenko, Andrey  and
      Kadurin, Artur  and
      Tutubalina, Elena",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2024",
    month = nov,
    year = "2024",
    address = "Miami, Florida, USA",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.findings-emnlp.760",
    pages = "12994--13013",
}