CLIP-S, an image-captioning AI model developed by researchers at Adobe and the University of North Carolina (UNC), has been open sourced. In comparisons with captions generated by other models, human judges preferred CLIP-S captions the majority of the time.
A paper describing the model and experiments was submitted to the 2022 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL). CLIP-S generates captions from an input image using a Transformer model. The model uses CLIP during training to determine how well the generated caption describes the image; this score is used as a reward signal for reinforcement learning (RL).
The team fine-tuned CLIP with negative caption examples generated by randomly modifying reference captions to improve the grammar of the generated captions. To address the shortcomings of existing image-captioning evaluation methods, the team also created FineCapEval, a new benchmark dataset that includes more fine-grained image captions describing image backgrounds and object relationships.
Many image captioning models are trained on datasets containing input images and reference captions; the training objective uses metrics such as BLEU to measure the similarity of the generated caption to the reference caption. However, this frequently leads to models that generate generic captions that describe only the prominent objects in the image, ignoring fine details that distinguish the image.
To address this issue, the Adobe team chose to measure the accuracy of the generated captions using OpenAI’s CLIP model. CLIP calculates the similarity between an image and a text string; the higher the similarity, the more accurately the text describes the image. The CLIP score was used by the researchers to create a reward function, CLIP-S, for RL training in order to create their captioning model.
However, the team discovered that this model frequently produced grammatically incorrect captions, such as “several rows of planes parked outside a terminal window area with fog outside a terminal window motion position area motion.” Their solution was to fine-tune CLIP’s text-encoder by providing negative examples with tokens that were randomly repeated, inserted, or shuffled. They also included a two-layer perceptron classifier head that detects whether a sentence is grammatically correct, which they trained alongside the text-encoder fine-tuning.
FineCapEval, a new benchmark dataset for evaluating fine-grained image captioning models, was also created by the team. This dataset contains 500 images from the MS COCO test split and the Validation of Conceptual Captions split. Five human workers described the image background, the objects in the image, including shape and colour, the relationships among the objects, such as spatial relationships, and a detailed caption that included all three aspects. For each of those four criteria, the dataset contains 1k images with 5k captions.
The team used the COCO dataset as a benchmark to compare the captions of their model to those of several baseline models. Although a baseline model outperformed CLIP-S on text-based metrics like BLEU, CLIP-S outperformed on image-text and text-to-image retrieval metrics. It also outperformed baselines on the team’s new FineCapEval benchmark “significantly.” Finally, human judges “strongly” preferred CLIP-S-generated captions over baseline-model-generated captions.
Multimodal image-text AI models are currently being researched. DeepMind’s Flamingo model, which exhibits state-of-the-art few-shot learning capability on several image-text tasks, including image captioning, was recently featured in InfoQ. InfoQ covered Google’s ALIGN model and AliBaba’s M6 model last year, both of which can perform a variety of image-text tasks.
The CLIP-S code and the FineCapEval dataset can be downloaded from GitHub.