Making data more accessible with LLMs
LLMs and other transformer-based applications have proven effective at embedding the semantic of unstructured data objects, as proven by their success in natural language understanding, image captioning, and information extraction. However, employing these models for unstructured data management still has open challenges: the non-deterministic nature of generative architectures, the cost and scalability of large models, the unclear programming paradigm in building complex pipelines. This talk will introduce some of his work to overcome these challenges.
The Speaker will discuss Palimpzest, a declarative system to build multi-step pipelines that can process large datasets with LLMs. Palimpzest allows users to specify logical pipelines, and automatically finds physical implementations of these pipelines that optimize their cost and runtime. Then, he will introduce the Caravaggio system, a semantic question answering system that can process multimodal datasets. Caravaggio overcomes some of the limitations of state-of-the-art RAG solutions by dynamically featurizing unstructured data at several abstraction levels.
The talk will conclude with an outlook into the open challenges and opportunities to make data (management) more accessible in the era of transformer-based models.