Home >


Foundations and Methods

BIFOLD research groups conduct fundamental research on a wide range of topics concerning foundations and methods of Artificial Intelligence (AI). This includes the management and processing of distributed and Big Data. As Machine Learning (ML) is one of the main fields for modern AI and the new wave of AI applications, we also focus on a variety of Machine Learning methods such as reinforced and Bayesian Machine Learning as well as unsupervised and recurrent Deep Learning.

Research Topics
  • Database Systems and Information Management
  • Intelligent Data Analysis and Information Management
  • Deep Learning
  • Unsupervised Deep Learning
  • Recurrent Deep Learning Models
  • Reinforcement Learning
  • Inference und Bayesian Machine Learning
  • Distributed Data Processing
  • Big Data Processing

Management of Data Science Processes and Systems

The Management of Data Science Processes is fundamental for the development and application of Artificial Intelligence. BIFOLD research aims to drastically improve the efficiency of data preparation and management processes. In addition, a research focus lays on the development of security and visualization tools for Big Data management as well as solutions for managing discretizations of distributed graph data streams, in particular managing the state of a graph evolving over time.

Research Topics
  • Information Integration and Data Quality
  • Data Management for the Machine Learning Lifecycle
  • Information Visualization and Visual Analytics
  • Big Data Security
  • Graph Data Management

Architectures and Technologies

BIFOLD research groups develop Big Data infrastructures and tools for programming and data extraction. We investigate architectures and algorithms for the scalable processing of Big Data as well as advanced Machine Learning. Furthermore, we develop methods and tools for the software-engineering paradigm of Software 2.0 / Neural Network Programming and other data programming languages. Our researchers will also explore new methods for mining data from massive text and other media collections.

Research Topics
  • Big Data Architectures
  • Big Data Engineering and Benchmarking
  • Engineering Software 2.0: Neural Network Programming
  • Data Programming Languages
  • Knowledge Discovery from Massive Text Data Collections
  • Knowledge Discovery from Massive Image, Audio- and Video Collections

Responsible AI

Security and Transparency of Machine Learning Processes are cornerstones for Responsible AI. BIFOLD will therefore conduct research on Explainable AI as well as secure machine learning to counter threats such as data poisoning, adversarial examples or model extraction. Our research groups investigate how Big Data analysis can be conducted in a privacy-preserving way. Additionally, we aim to improve bias detection in training data, ensuring transparency, fairness, and reproducibility of algorithms, and the ethical and legal frameworks that guide responsible handling of data and algorithms.

Research Topics
  • Explainable AI
  • Secure Machine Learning
  • Technological Enablers for Informational Self-Determination
  • Technology-Aware Data Ethics and Law

Systems and Tools for Novel Applications

BIFOLD will contribute to the development of novel AI Applications by conducting practice-oriented research on tools and infrastructures. Data is an integral part of a digitized society, fueling the algorithms of Machine Learning and Artificial Intelligence. Data infrastructures provide the technical foundation for offering broad access to data and processing capabilities. We will therefore investigate technical, economic and legal aspects for information marketplaces. BIFOLD research groups also conduct interdisciplinary research on the theory and application of coupling machine learning and simulation methods to identify potential for novel applications.

Research Topics
  • Information Marketplaces
  • Simulation and Machine Learning
  • Big Data and Machine Learning foundations for Medicine
  • Big Data and Machine Learning foundations for the Natural Sciences

Main Application Areas


Medicine forms a spectrum of interdisciplinary AI challenges in the medical research field that ranges from basic scientific questions to complex gene regulation mechanisms and networks. BIFOLD will focus on the integration of microscopic-histological image data and proteogenomic “omics” data in translational cancer research, radiological and proteomic data in cardiology. Another research focus is the integration of heterogeneous and distributed clinical and highly noisy real-time data from intensive care medicine. Problematics of data privacy in the analysis of geographically distributed medical data from different data centers are also a scientific challenge.

Digital Humanities

A central question in the Digital Humanities is how to efficiently use complex a priori knowledge for the development of powerful interactive methods to deal with highly structured heterogeneous data. Typically, patterns and pattern groups from characteristics of historical sources such as layout, images in texts, text-image constellations, word combinations or word clusters are to be recognized and statistically analyzed. On the one hand, this would allow the exploration of models and the simulation of their consequences, and on the other hand, move forward the generation of heuristics that emulate scientific analysis methods. Both significantly increase the predictive power of the models. Futhermore, the use of methods to interpret the models might enable the automatic formulation of hypotheses. Fundamental problematics of AI such as a priori knowledge and graphene structures are to be investigated in the Digital Humanities.


In the application area communication, AI methods for coordination, compression and caching in ultra-dense meshed networks are to be explored. Like this, reliability as well as spectral and energy efficiency and latency times shall be reduced. In many cases, only limited and decentral or online data is available. Therefore, the development and application of novel methods will be necessary to enable distributed learning from minimal data sets during runtime of the system. The practical boundary conditions of communication applications and the statistical processes to be considered require novel algorithms to overcome these problems. Scalable, real-time language technology for user interfaces can enable the analysis of multimodal, extremely heterogenous data for natural language understanding of dialogue oriented assistants.

Open Source Projects

BIFOLD leverages foundational research and builds readily available, open source technologies, tools, and systems, jointly with BIFOLD research partners. This system orientation and open innovation mindset creates a bridge between foundational research and applied research. BIFOLD partners have a long-standing experience in large-scale open source software development and successful technology transfer.

Flink / Stratosphere

“Apache Flink” [1] is a stream-processing framework for distributed, high-performing, always-available, and accurate data streaming applications. It originated from the joined research project „Stratosphere“ [2], funded by the Deutsche Forschungsgemeinschaft (DFG). After a successful incubator phase, Flink graduated to a top-level project of the Apache Foundation [3] and became one of the most important and promising projects within the Apache Big Data Stack. Flink has a big and lively community, numerous well-known users, such as Zalando, Alibaba, and Netflix, and features it’s own annually conference “FlinkForward” [4] taking place in Berlin and San Francisco.

[1] https://flink.apache.org

[2] http://stratosphere.eu/

[3] https://www.apache.org/

[4] https://flink-forward.org/


Emma is a quotation-based Scala DSL that enables holistic optimizations of data flow programs for scalable data analysis on Apache Flink and Spark.



A Hardware Adaptive Query Compiler

The performance of modern processors is primarily bound by a fixed energy budget. This power wall forces processor vendors to specialize their processors to certain applications to provide the speedups users expect.



Peel is a framework that helps you to define, execute, analyze, and share experiments for distributed systems and algorithms. A Peel package bundles together the configuration data, datasets, and workload applications required for the execution of a particular collection of experiments. Peel bundles can be largely decoupled from the underlying operational environment and easily migrated and reproduced to new environments.​



The Myriad Toolkit facilitates the specification of scalable data generation programs with complex statistical constraints via a special XML data generator prototyping language.

The Myriad Toolkit uses advanced PRNG algorithms to implement offset-based access to the elements of the generated domain type sequences within a bounded time. This feature facilitates an efficient data-parallel execution mode. Data generation programs created with the Myriad Toolkit therefore can be scaled-out in a massively parallel manner in order to quickly generate large synthetic datatets with complex statistical dependencies.