Image Similarity Search Based on HuggingFace Datasets and Transformers
Through this article, you will learn to build an image similarity search system using 🤗 Transformers. Finding similarities between a query image and potential candidate images is an important use case for information retrieval systems, such as reverse image search (i.e., finding the original image of a query image). The problem such systems try to answer is, given a query image and a set of candidate images, find which of the candidate images are most similar to the query image.
We will use 🤗 datasets library , as it seamlessly supports parallel processing, which comes in handy when building systems.
Although this paper uses a ViT-based model ( nateraw/vit-base-beans ) and specific ( Beans ) dataset, but it can be extended to other models that support visual modalities, and also to other image datasets. Some famous models that you can try are:
Furthermore, the approach presented in the article has the potential to be extended to other modalities as well.
To study the complete image similarity system, you can refer to this Colab Notebook.
How do we define similarity?
To build this system, we first need to define how we want to compute the similarity between two images. A widely popular approach is to first compute a dense representation (i.e., embedding) of a given image, and then use Cosine similarity metric to determine the similarity between two images.
In this article, we will use "embeddings" to represent images in vector space. It gives us a great way to meaningfully compress images from a high-dimensional pixel space (e.g. 224 × 224 × 3) to a much lower dimension (e.g. 768). The main advantage of doing this is to reduce computation time in subsequent steps.
Computational embedding
In order to compute an embedding for an image, we need to use a vision model that knows how to represent the input image in a vector space. This type of model is also often referred to as an image encoder.
We use the AutoModel class to load the model. It provides us with an interface to load any compatible model checkpoint from HuggingFace Hub. In addition to the model, we also load the processor associated with the model for data preprocessing.
from transformers import AutoFeatureExtractor, AutoModel model_ckpt = "nateraw/vit-base-beans" extractor = AutoFeatureExtractor.from_pretrained (model_ckpt) model = AutoModel.from_pretrained (model_ckpt)
The checkpoint used in this example is a checkpoint in beans data set fine-tuned ViT model.
Here are some questions you might ask:
Q1: Why don't we use AutoModelForImageClassification?
This is because we want a dense representation of the image, and AutoModelForImageClassification can only output discrete classes.
Q2: Why use this particular checkpoint?
As mentioned earlier, we use a specific dataset to build the system. Therefore, instead of using a generic model such as Model trained on ImageNet-1k dataset ), it is better to use a model that has been fine-tuned for the dataset used. In this way, the model can better understand the input image.
Note that you can also use checkpoint s obtained by self-supervised pre-training, not necessarily trained by supervised learning. In fact, if pretrained properly, self-supervised models can get Impressive retrieval performance.
Now that we have a model for computing embeddings, we need some candidate images to be queried.
Load candidate image dataset
Later, we build a hash table that maps candidate images to hash values. When querying, we will use these hash tables, which are discussed in detail later. Now, let's start with the beans data set The training set in to obtain a set of candidate images.
from datasets import load_dataset dataset = load_dataset ("beans")
A sample from the training set is shown below:
The three features of this dataset are as follows:
dataset ["train"].features >>> {'image_file_path': Value (dtype='string', id=None), 'image': Image (decode=True, id=None), 'labels': ClassLabel (names=['angular_leaf_spot', 'bean_rust', 'healthy'], id=None)}
In order for the image similarity system to be demonstrable, the overall running time of the system needs to be short, so we only use 100 images from the candidate image dataset here.
num_samples = 100 seed = 42 candidate_subset = dataset ["train"].shuffle (seed=seed).select (range (num_samples))
The process of finding similar images
The figure below shows the basic process of obtaining similar images.
Disassemble the above picture a little bit, we divide it into 4 steps:
- Extract embeddings from candidate images (candidate_subset), storing them in a matrix.
- Take the query image and extract its embedding.
- Iterate over the embedding matrix (obtained in step 1) and compute the similarity score between the query embedding and the current candidate embedding. We usually maintain a dictionary-like mapping between ID s and similarity scores of candidate images.
- Sort by similarity score and return the corresponding image ID. Finally, use these ID s to fetch candidate images.
We can write a simple utility function to compute the embedding and apply it to each image of the candidate image dataset using the map() method to compute the embedding efficiently.
import torch def extract_embeddings (model: torch.nn.Module): """Utility to compute embeddings.""" device = model.device def pp (batch): images = batch ["image"] # `transformation_chain` is a compostion of preprocessing # transformations we apply to the input images to prepare them # for the model. For more details, check out the accompanying Colab Notebook. image_batch_transformed = torch.stack ( [transformation_chain (image) for image in images] ) new_batch = {"pixel_values": image_batch_transformed.to (device)} with torch.no_grad (): embeddings = model (**new_batch).last_hidden_state [:, 0].cpu () return {"embeddings": embeddings} return pp
We can map extract_embeddings() like this:
device = "cuda" if torch.cuda.is_available () else "cpu" extract_fn = extract_embeddings (model.to (device)) candidate_subset_emb = candidate_subset.map (extract_fn, batched=True, batch_size=batch_size)
Next, we create a list of candidate image ID s for convenience.
candidate_ids = [] for id in tqdm (range (len (candidate_subset_emb))): label = candidate_subset_emb [id]["labels"] # Create a unique indentifier. entry = str (id) + "_" + str (label) candidate_ids.append (entry)
We use an embedding matrix containing all candidate images to compute a similarity score to the query image. We have computed candidate image embeddings before, here we just lump them into a matrix.
all_candidate_embeddings = np.array (candidate_subset_emb ["embeddings"]) all_candidate_embeddings = torch.from_numpy (all_candidate_embeddings)
we will use cosine similarity to compute the similarity score between two embedding vectors. We then use it to obtain similar candidate images for a given query image.
def compute_scores (emb_one, emb_two): """Computes cosine similarity between two vectors.""" scores = torch.nn.functional.cosine_similarity (emb_one, emb_two) return scores.numpy ().tolist () def fetch_similar (image, top_k=5): """Fetches the`top_k`similar images with`image`as the query.""" # Prepare the input query image for embedding computation. image_transformed = transformation_chain (image).unsqueeze (0) new_batch = {"pixel_values": image_transformed.to (device)} # Comute the embedding. with torch.no_grad (): query_embeddings = model (**new_batch).last_hidden_state [:, 0].cpu () # Compute similarity scores with all the candidate images at one go. # We also create a mapping between the candidate image identifiers # and their similarity scores with the query image. sim_scores = compute_scores (all_candidate_embeddings, query_embeddings) similarity_mapping = dict (zip (candidate_ids, sim_scores)) # Sort the mapping dictionary and return `top_k` candidates. similarity_mapping_sorted = dict ( sorted (similarity_mapping.items (), key=lambda x: x [1], reverse=True) ) id_entries = list (similarity_mapping_sorted.keys ())[:top_k] ids = list (map (lambda x: int (x.split ("_")[0]), id_entries)) labels = list (map (lambda x: int (x.split ("_")[-1]), id_entries)) return ids, labels
execute query
After the above preparations, we can perform a similarity search. We select a query image from the test set of the beans dataset to search:
test_idx = np.random.choice (len (dataset ["test"])) test_sample = dataset ["test"][test_idx]["image"] test_label = dataset ["test"][test_idx]["labels"] sim_ids, sim_labels = fetch_similar (test_sample) print (f"Query label: {test_label}") print (f"Top 5 candidate labels: {sim_labels}")
The result is:
Query label: 0 Top 5 candidate labels: [0, 0, 0, 0, 0]
It looks like our system got the correct set of similar images. Visualize the results as follows:
Further extensions and conclusions
We now have a working image similarity system. But real systems need to process much more candidate images than this. With this in mind, our current procedure has quite a few shortcomings:
- If we store the embeddings as-is, the memory requirements can add up quickly, especially when dealing with millions of candidate images. The embedding is 768-dimensional in our example, which may be relatively high-dimensional even for large-scale systems.
- High-dimensional embeddings have a direct impact on subsequent computations involved in the retrieval part.
If we can somehow reduce the dimensionality of embeddings without affecting their meaning, we can still maintain a good trade-off between speed and retrieval quality. This article The accompanying Colab Notebook Implements and demonstrates how to achieve a trade-off between random projection and locality-sensitive hashing (LSH).
🤗 Datasets provided with FAISS The direct integration of , further simplifies the process of building a similarity system. Suppose you have extracted embeddings of candidate images (beans dataset) and stored them in a feature called embedding. You can now easily use dataset 's add_faiss_index() method to build a dense index:
dataset_with_embeddings.add_faiss_index (column="embeddings")
After indexing, you can use the dataset_with_embeddings module's get_nearest_examples() method retrieves the nearest neighbors for a given query embedding:
scores, retrieved_examples = dataset_with_embeddings.get_nearest_examples ( "embeddings", qi_embedding, k=top_k )
This method returns the retrieval scores and their corresponding images. For more information, you can check official document and this notebook.
In this article, we get started quickly and build an image similarity system. If you found this post interesting, we strongly encourage you to continue building your system on top of the concepts we discussed so you can become more familiar with the inner workings.
Still want to know more? Here are some other resources that may be useful to you:
- Faiss: Efficient Similarity Search Library
- ScaNN: Efficient Vector Similarity Search
- Integrate an image search engine in a mobile application
Original English text: https://hf.co/blog/image-simi...
Translator: Matrix Yao (Yao Weifeng), Intel Deep Learning Engineer, working on the application of transformer-family models on various modal data and the training and reasoning of large-scale models.
Proofreading and typesetting: zhongdongy (Adong)