Banner Banner

Vision through Language: Towards Open-world Recognition

Icon

October 15, 2024 Icon 10:00 - 11:30

Icon

TU Berlin, Room EN 148, Einsteinufer 17, 10587 Berlin,

Icon

Paolo Rota

BIFOLD Talk with Prof. Paolo Rota

©️Paolo Rota
Paolo Rota

Abstract:
The seminar explores how the integration of vision and language models is transforming recognition tasks across multiple domains. Central to this discussion is the challenge of recognizing and adapting to previously unseen or evolving categories in open-world settings, particularly without relying on pre-defined vocabularies or exhaustive training data. Leveraging pre-trained vision-language models (VLMs) such as CLIP, the presented works propose techniques for improving recognition and adaptation in various complex scenarios.

The seminar highlights the increasing shift towards unsupervised and training-free methods, addressing limitations in existing models that require extensive labeled data or specialized training. For example, AutoLabel proposes a way to automatically generate candidate class names in Open-set Unsupervised Video Domain Adaptation, overcoming the need for oracle knowledge of label names. Similarly, a novel Vocabulary-free Image Classification task introduces a framework for classifying images in an unconstrained semantic space, bypassing the restrictions of fixed vocabularies through dynamic category search methods.

Another key area is zero-shot temporal action localization (ZS-TAL), where models must identify unseen actions in videos without training on labeled data. Test-time adaptation emerges as a promising approach here, allowing models to adapt to new contexts without requiring pre-training. This mirrors the emphasis on flexible, real-time solutions also present in Automatic Programming of Experiments (APEx), a framework that automates the benchmarking process for large multimodal models, accelerating evaluation and hypothesis testing.

Short bio:
Paolo is an assistant professor at the Department of Information Engineering and Computer Science (DISI) and the Center for Mind and Brain (CIMeC) at the University of Trento.

He received his Ph.D. from the same university and has worked as a postdoctoral Marie Curie fellow at TU Wien and as a postdoc at the Istituto Italiano di Tecnologia in Genoa. He also worked as an ML researcher at the ProM Facility in Rovereto. He has been an assistant professor at the University of Trento since 2019 and started his tenure track in 2022. His research interests are focused on Image and Video Understanding in the context of Vision and Language.