Vision-language foundation models such as CLIP have achieved tremendous results in global vision-language alignment, but still show some limitations in creating representations for specific image regions.
To address this problem, we propose MaskInversion, a method that leverages the feature representations of pre-trained foundation models, such as CLIP, to generate a context-aware embedding for a query image region specified by a mask at test time. MaskInversion starts with initializing an embedding token and compares its explainability map, derived from the foundation model, to the query mask. The embedding token is then subsequently refined to approximate the query region by minimizing the discrepancy between its explainability map and the query mask. During this process, only the embedding vector is updated, while the underlying foundation model is kept frozen allowing to use MaskInversion with any pre-trained model. As deriving the explainability map involves computing its gradient, which can be expensive, we propose a gradient decomposition strategy that simplifies this computation.
The learned region representation can be used for a broad range of tasks, including open-vocabulary class retrieval, referring expression comprehension, as well as for localized captioning and image generation. We evaluate the proposed method on all those tasks on several datasets such as PascalVOC, MSCOCO, RefCOCO, and OpenImagesV7 and show its capabilities compared to other SOTA approaches.
The proposed method, coined as MaskInversion, aims to learn a localized embedding or feature vector that encapsulates an object’s characteristics within an image specified by a query mask. This embedding should not solely represent the object’s intrinsic properties but also capture the broader context of the entire image.
To achieve this, we utilize representations provided by foundation models, such as CLIP. Our approach learns a token that captures the foundation model’s feature representation on the image region specified by the mask. Hence, the foundation model remains fixed during our process.
As shown in the Figure above 👆, we start with the initialization of an embedding vector that serves as a localized embedding token of the mask. This vector is then refined through an iterative optimization process guided by an explainability map generated from the foundation model. The explainability map provides a visual indication of the areas within the image that are most influential on the initial embedding, thereby allowing for targeted refinement. The optimization process is supervised by enforcing the generated explainability map to be similar to the query mask.The derivation of the explainability map necessitates the calculation of a gradient, and similarly, each gradient descent iteration requires the computation of a gradient with respect to the loss function L. Consequently, this iterative process requires the evaluation of second-order derivatives of the form $\frac{\partial \textbf{L}}{\partial LET_\textbf{m}^{(k)}}(LET_\textbf{m}^{(k)}, \nabla \textbf{A})$, which can be computationally intensive and numerically unstable.
To enhance the computational efficiency of this process, it is advantageous to obviate the need for backpropagation to generate explainability maps at each iteration. We propose a gradient decomposition strategy that simplifies the gradient computation associated with the explainability method.
The above figure 👆, shows that for high number of masks and/or high number of iterations, the proposed gradient decomposition strategy is significantly faster.
MaskInversion can be used for a diverse set of tasks, ranging from localized classification, captioning, image generation ...
@article{bousselham2024maskinversion,
author = {Bousselham, Walid and Chaybouti, Sofian and Rupprecht, Christian and Ferrari, Vittorio and Kuehne, Hilde}
title = {MaskInversion: Localized Embeddings via Optimization of Explainability Maps},
journal = {arXiv preprint arXiv:2407.xxxx},
year = {2024},
}