InfoLM: A New Metric to Evaluate Summarization & Data2Text Generation
Pierre Colombo, Chloé Clavel, Pablo Piantanida
[AAAI-22] Main Track
Abstract:
Assessing the quality of natural language generation (NLG) systems through human annotation is very expensive. Additionally, human annotation campaigns are time-consuming and include non-reusable human labour. In practice, researchers rely on automatic metrics as a proxy of quality. In the last decade, many string-based metrics (\textit{e.g.}, BLEU or ROUGE) have been introduced. However, such metrics usually rely on exact matches and thus, do not robustly handle synonyms. In this paper, we introduce \texttt{InfoLM} a family of untrained metrics that can be viewed as a string-based metric that addresses the aforementioned flaws thanks to a pre-trained masked language model. This family of metrics also makes use of information measures allowing the possibility to adapt \texttt{InfoLM} to different evaluation criteria. Using direct assessment, we demonstrate that \texttt{InfoLM} achieves statistically significant improvement and two figure correlation gains in many configurations compared to existing metrics on both summarization and data2text generation tasks.
Introduction Video
Sessions where this paper appears
-
Poster Session 2
Red 5 -
Poster Session 7
Red 5 -
Oral Session 7
Red 5