Context-Based Contrastive Learning for Scene Text Recognition

Xinyun Zhang, Binwu Zhu, Xufeng Yao, Qi Sun, Ruiyu Li, Bei Yu

[AAAI-22] Main Track
Abstract: Pursuing accurate and robust recognizers has been a long-lasting goal for scene text recognition (STR) researchers. Recently, attention-based methods have demonstrated their effectiveness and achieved impressive results on public benchmarks. The attention mechanism enables models to recognize scene text with severe visual distortions by leveraging contextual information. However, recent studies revealed that the implicit over-reliance of context leads to catastrophic out-of-vocabulary performance. On the contrary to the superior accuracy of the seen text, models are prone to misrecognize unseen text even with good image quality. We propose a novel framework, Context-based contrastive learning (ConCLR), to alleviate this issue. Our proposed method first generates characters with different contexts via simple image concatenation operations and then optimizes contrastive loss on their embeddings. By pulling together clusters of identical characters within various contexts and pushing apart clusters of different characters in embedding space, ConCLR suppresses the side-effect of overfitting to specific contexts and learns a more robust representation. Experiments show that ConCLR significantly improves out-of-vocabulary generalization and achieves state-of-the-art performance on public benchmarks together with attention-based recognizers.

Introduction Video

Sessions where this paper appears

  • Poster Session 6

    Blue 2

  • Poster Session 12

    Blue 2