InfoLM: A New Metric to Evaluate Summarization & Data2Text Generation

Pierre Colombo, Chloé Clavel, Pablo Piantanida

[AAAI-22] Main Track
Abstract: Assessing the quality of natural language generation (NLG) systems through human annotation is very expensive. Additionally, human annotation campaigns are time-consuming and include non-reusable human labour. In practice, researchers rely on automatic metrics as a proxy of quality. In the last decade, many string-based metrics (\textit{e.g.}, BLEU or ROUGE) have been introduced. However, such metrics usually rely on exact matches and thus, do not robustly handle synonyms. In this paper, we introduce \texttt{InfoLM} a family of untrained metrics that can be viewed as a string-based metric that addresses the aforementioned flaws thanks to a pre-trained masked language model. This family of metrics also makes use of information measures allowing the possibility to adapt \texttt{InfoLM} to different evaluation criteria. Using direct assessment, we demonstrate that \texttt{InfoLM} achieves statistically significant improvement and two figure correlation gains in many configurations compared to existing metrics on both summarization and data2text generation tasks.

Introduction Video

Sessions where this paper appears

  • Poster Session 2

    Red 5

  • Poster Session 7

    Red 5

  • Oral Session 7

    Red 5