Walid Bousselham

I'm a PhD student at Tübingen AI Center, advised by Prof. Hilde Kuehne. I'm also participating in MIT-IBM Watson Sight and Sound Project.

My primary research area is deep learning for multimodal models. I am interested in various aspects of these models, ranging from improving their pretraining processes and understanding their internal prediction mechanisms to exploring zero-shot adaptation capabilities.

During my PhD, I did an internship in the Willow team at INRIA Paris with Cordelia Schmid. I also had the pleasure to be a student researcher at Google DeepMind, working with Ahmet Iscen, Mathilde Caron, Arsha Nagrani, and Cordelia Schmid.

Prior to this, I finished my Master of Engineering in Applied Mathematics at ENSTA Paris in France and my Master of Science in Statistics and applied Probabilities at the National University of Singapore (NUS) .

Email / Scholar / Twitter / Github

🔥 News

03.2026 Our paper VOLD was accepted at CVPR 2026!

01.2026 Our paper MaskInversion was accepted at ICLR 2026!

07.2025 Our paper LeGrad was accepted at ICCV 2025!

05.2024 I spend the summer 2024 at MiT CSAIL as a visiting scholar working with Hendrik Strobelt and Angie Boggust.

05.2024 I gave a talk at "Cohere For AI - Community Talks" regarding our latest work "LeGrad" in collaboration with MiT & IBM Research.

03.2024 Our paper Grounding Everything: Emerging Localization Properties in Vision-Language Transformers was accepted at CVPR 2024!.

01.2024 I gave an interview to the Computer Vision News magazine, that features our recent paper "Grounding Everything". [Link to the interview]

01.2024 I will be attending the BMVA Symposium on Vision and Language with an oral and a poster presenting our recent paper Grounding Everything.

🔬 Featured Research

	VOLD: Reasoning Transfer from LLMs to Vision-Language Models via On-Policy Distillation Walid Bousselham, Hilde Kuehne Cordelia Schmid CVPR, 2026 Project Page / arXiv
	MaskInversion: Localized Embeddings via Optimization of Explainability Maps Walid Bousselham, Sofian Chaybouti, Christian Rupprecht, Vittorio Ferrari, Hilde Kuehne ICLR, 2026 Project Page / Code / arXiv
	LeGrad: An Explainability Method for Vision Transformers via Feature Formation Sensitivity Walid Bousselham, Angie Boggust, Sofian Chaybouti, Hendrik Strobelt, Hilde Kuehne ICCV, 2025 Project Page / Code / arXiv / Demo
	DEX-AR: A Dynamic Explainability Method for Autoregressive Vision-Language Models Walid Bousselham, Angie Boggust, Hendrik Strobelt, Hilde Kuehne arXiv, 2026 Project Page / Code / arXiv
	Grounding Everything: Emerging Localization Properties in Vision-Language Transformers Walid Bousselham, Felix Petersen, Vittorio Ferrari, Hilde Kuehne CVPR, 2024 Code / arXiv / Demo
	Learning Situation Hyper-Graphs for Video Question Answering Aisha Urooj, Hilde Kuehne, Bo Wu, Kim Chheu, Walid Bousselham, Chuang Gan, Niels Lobo, Mubarak Shah CVPR, 2023 Code / arXiv
	Efficient Self-Ensemble for Semantic Segmentation Walid Bousselham, Guillaume Thibault, Lucas Pagano, Archana Machireddy, Joe Gray, Young Hwan Chang, Xubo Song BMVC, 2022 Code / arXiv / video

🛠️ Open-source Libraries

	MaskInversion A library for generating localized embeddings of CLIP-like models via optimization of explainability maps. `pip install maskinversion_torch` GitHub / PyPI
	LeGrad An explainability method for Vision Transformers that, given a text prompt, generates a heatmap localizing the part of the image that is important for the model to recognize the text prompt. `pip install legrad_torch` GitHub / PyPI
	GEM (Grounding Everything Method) A library for exploring emerging localization properties in Vision-Language Transformers. `pip install gem_torch` GitHub / PyPI
	Data Stream A Python tool for streaming data from remote servers to local compute resources, particularly useful for training models on large datasets stored remotely without requiring local storage (developed for internal use). `pip install data-streaming` GitHub / PyPI

📰 Media Coverage

	Talk at Cohere For AI - Community Talks Presented our latest work on LeGrad, discussing novel approaches to explainability in Vision Transformers. Watch Talk / LeGrad Project
	Computer Vision News Magazine Interview Featured interview discussing our paper "Grounding Everything" and its implications for Vision-Language models. Read Interview / GEM Project

Design and source code borrowed from Jon Barron's website.