Banner Banner

ConfigILM: A general purpose configurable library for combining image and language models for visual question answering

Leonard Hackel
 Kai Norman Clasen
 Begüm Demir

April 19, 2024

ConfigILM is an open-source Python library for rapid iterative development of image-language models for visual question answering in PyTorch. It provides a convenient implementation for seamlessly combining image and language models from two popular PyTorch libraries that are timm and huggingface. These libraries allow a variety of configurations of models without additional implementation effort. The monolithic interface provided by ConfigILM simplifies the exchange of components of a considered model and offers possibilities for developing new image-language models based on recombining the selected encoders. Additionally, the library provides pre-built and throughput-optimized PyTorch dataloaders. We also provide a guideline document that contains installation instructions, tutorial examples, and a complete discussion of the monolithic interface to the library. ConfigILM is released under the MIT License, encouraging its use in academic and commercial environments. The source code and documentation of ConfigILM are available at https://github.com/lhackel-tub/ConfigILM