Banner Banner

BIFOLD Ph.D. Student Receives Software Campus Funding

Next Software Campus project associated to BIFOLD is planned to kick off at the beginning of 2025.

Muaid Mughrabi is a Ph.D. student in the Data Integration and Data Preparation group at BIFOLD, Technische Universität Berlin, under the supervision of Prof. Dr. Ziawasch Abedjan. His research focuses on harnessing the capabilities of foundation models for data processing and developing pipelines for machine learning applications. In particular, he creates systems that interact with foundation models to enhance their effectiveness in data extraction and question-answering tasks.

Muaid Mughrabi, in partnership with Celonis, has been accepted for one of the recent Software Campus batches. This acceptance will support him for about two years, including BMBF funding, training, mentoring, and networking opportunities. Together with Nils Schubert (link) there are altogether two BIFOLD students with a currently active Software Campus grant. 

Abstract

In today's data-driven world, ETL (Extract, Transform, Load) processes are essential for transforming raw data into meaningful insights. These processes, however, require significant manual effort and technical knowledge, making them prone to errors. As organizations accumulate more data, there is an increasing need to automate these tasks, reducing time and costs while improving efficiency.
Our project, Leveraging Large Language Models for Pipeline Generation, aims to bridge the gap between advanced AI capabilities and practical data management needs by automating ETL pipeline creation. Large Language Models (LLMs), which excel in understanding and generating human language, provide an opportunity to revolutionize ETL tasks. LLMs are adaptable and capable of generating queries and understanding instructions, making them suitable for automating data transformations. However, existing limitations must be addressed, such as capacity constraints and lack of verification.

The project's primary focus is simplifying the ETL pipeline generation process by integrating LLMs to reduce the manual effort and errors associated with pipeline creation. We aim to create a system where users can interact with the LLM to generate pipelines through natural language commands, feedback, and adjustments. This system will provide a user-friendly interface, allowing for dynamic, relatively fast pipeline generation without requiring deep technical knowledge of ETL processes.

Through collaboration between the Technical University of Berlin's D2IP research group and Celonis, a global leader in data analytics, this project will explore various approaches to enhance ETL pipelines using LLMs. We will experiment with techniques like fine-tuning models for tabular data, guided generation, and different types of optimization to improve the system's performance. The project will span 24 months, involving the development of a prototype, its enhancement, and evaluation. The final goal is to create a flexible, robust, and deployable solution that can be used in academic and industrial environments.

This innovative approach to ETL automation will significantly impact modern organizations, enabling them to leverage their data more effectively, remain competitive, and adapt quickly to evolving data challenges.

About Software Campus

The Software Campus (SWC), funded by the German Federal Ministry of Education and Research (BMBF), is an executive development program aimed at shaping tomorrow’s senior IT executives. It is directed at outstanding doctoral students in computer science who are interested in taking on leadership responsibilities in the future. In doing so, the program combines cutting-edge scientific research with hands-on management experience in a new and innovative concept. Awardees plan and lead their own research projects in collaboration with an industry partner over an approximately two-year period. Responsible for a research budget of up to 115.000€ each, SWC participants can accelerate their research and gain leadership experience by hiring research assistants, acquiring specialized hardware, and promoting their work at international conferences.