Banner Banner

Evaluating and supervising vision models with multi-level similarity judgments

Lukas Muttenthaler
Frieda Born
Klaus Greff
Thomas Unterthiner
Andrew Lampinen
Klaus-Robert Müller
Mike Mozer

August 06, 2024

Vision foundation models are becoming increasingly pervasive. Despite their incredible success, it remains unclear to what degree they see the world the way humans do. A growing body of recent work investigates the alignment between human and model representations but has not systematically characterized this alignment across levels of conceptual abstraction. Here, we attempt to bridge this gap and collect a large human similarity judgment dataset of triplet odd-one-out choices on three levels of semantic abstraction: coarse-grained, fine-grained, and class-boundary. This multi-level behavioral dataset enables more nuanced comparisons between humans and computer vision models than has previously been possible. Models and people are best aligned on class-boundary and worst aligned on coarse-grained similarity judgments. Human alignment with various model types depends on the level of abstraction: image/text models match people best for superordinate categories, but self-supervised image models match best for fine-grained semantic categories. Our dataset facilitates the evaluation---and potentially the improvement---of vision foundation models.