Models and Use of Contrastive Language-Image Pretraining

Overview

Under the direction of plain language, a neural network known as CLIP is adept at comprehending visual concepts. It focuses on a pretraining task that involves training both a text encoder and an image encoder at the same time while matching captions with corresponding images. Contrastive Language-Image Pretraining's design allows it to readily adapt to a variety of visual classification criteria. It accomplishes this by only being given the names of the visual categories that require identification, demonstrating "zero-shot" learning abilities akin to those observed in GPT-2 and GPT-3 models.

The Contrastive Language-Image Pretraining (CLIP) architecture is an essential part of modern computer vision. Training embedding models for image and video classification, image similarity computations, retrieval augmented generation (RAG), and other uses can be done with the CLIP architecture.

OpenAI's Contrastive Language-Image Pretraining architecture has been improved to use a number of public checkpoints on large datasets. Other companies, such as Apple and Meta AI, have trained their own CLIP models since OpenAI released the initial model utilizing the CLIP architecture. But instead of being tailored to a specific use case, these models are instead trained for general applications.

CLIP: What is it?

Contrastive Language-Image Pretraining (CLIP) is a multimodal vision model architecture developed by OpenAI. CLIP can be used for calculating text and image embeddings. CLIP models are trained using text-picture pairs. These pairings are used to train an embedding model that finds relationships between an image's elements based on the written caption.

Models of CLIP

The employment of CLIP models may be advantageous for numerous corporate applications. For example, it could assist with:

Sorting images of production line components
Video sorting in a media collection
Large-scale, real-time moderation of image content
Make sure to deduplicate the pictures before training large models.
And more

Contrastive Language-Image Pretraining models can run at several frames per second, depending on the hardware. Hardware for AI development, such the Intel Gaudi 2 accelerator, is ideal for models like CLIP. For example, researchers found that a single Intel Gaudi 2 accelerator could compute 66,211 CLIP vectors in 20 minutes. This speed can solve a lot of real-time applications; additional chips could be added for enhanced performance.

Typical CLIP models are not appropriate for highly specialized use cases or use cases involving corporate data, which would be tough for huge models to train on, even though OpenAI's Checkpoint and other unconventional CLIP models work well for many use cases.

In contrast Prior to training

Language-Image Contrastive In a batch of picture-text pairings, pretraining determines the dense cosine similarity matrix between each possible (text, image) candidate. The basic idea is to make incorrect pairings (shown in grey in the image) less similar and correct pairings (shown in blue in the figure below) more similar. They accomplish this by maximizing a symmetric cross-entropy loss across these similarity scores.

The objective, to put it simply, is to make the image and its corresponding caption more similar while decreasing the similarity between the image and the other captions. This logic is also used in the caption, which aims to make the caption and its connected image more similar while decreasing the similarity between all other photos.

Encoders for Text and Images

The different text and graphic encoders in CLIP's architecture give consumers the flexibility to choose what they want. By selecting different text encoders or substituting alternatives like ResNet for the traditional picture encoder, such a Vision Transformer, users can expand their versatility and experimentation. Naturally, changing one of the encoders will affect the embedding distribution, so you will have to train your model again.

Pretraining of Contrastive Language-Image Use cases
There are numerous uses for contrastive language-image pretraining. Here are a few notable use cases:

Similarity search, zero-shot image classification, and diffusion model conditioning.

Use

In real-world applications, input often consists of a picture and pre-defined classes. The Python example that goes with it demonstrates how to use the transformers library to run CLIP. In this instance, it aims to zero-shot classify the image below as either a dog or a cat.
After executing this code, the following probabilities were found:

from PIL import Image
import requests

from transformers import CLIPProcessor, CLIPModel

model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(
    text=["a photo of a cat", "a photo of a dog"],
    images=image,
    return_tensors="pt",
    padding=True,
)

outputs = model(**inputs)
logits_per_image = outputs.logits_per_image
probs = logits_per_image.softmax(dim=1)

After executing this code, it got the following probabilities:

“a cat picture”: 99.49%
“a dog picture”: 0.51%

Restrictions

Language-Image Contrastive For zero-shot classification, pretraining works well, but it probably won't outperform a tailored, specialized model. Its generalization abilities are also lacking, particularly when working with data or examples that were not utilized for training. Using trials on the Fairface dataset, the paper also shows how the choice of categories impacts CLIP's effectiveness and biases. The accuracy of the racial and gender classifications differed significantly, with the former being over 96% and the latter being over 93%.

In conclusion

Contrastive Language Image Pretraining (CLIP) models are embedding models that can be applied to a number of tasks, such as image-based retrieval augmented generation (RAG), zero-shot image and video categorization, semantic search applications, and dataset deduplication.

The multimodal field has seen a dramatic transformation thanks to OpenAI's CLIP paradigm. What sets CLIP apart is its proficiency with zero-shot learning, which enables it to classify images into categories it wasn't specifically trained on. Its remarkable ability to generalize is a result of its innovative training methodology, which teaches it to match images with text labels.