Earth Observation data for climate change research

Home >

Earth Observation data for climate change research

Earth Observation data for climate change research

AgoraEO: One platform integrates data from all over the world  
Visualization of sea surface temperature and salinity based on EO data.
(Copyright: European Space Agency)

Environmental reports on the dramatic retreat of the Arctic ice sheet, the ongoing deforestation of rain forests or the spread of forest fires are mostly based on the data analysis of satellite images. The analysis of large amounts of Earth Observation (EO) data plays a crucial role in understanding and quantifying climate change.

“The efficient use of these data makes it possible to monitor and predict the effects of climate change on a global scale with unprecedented reliability,” explains Prof. Dr. Begüm Demir, head of the Big Data Analytics for Earth Observation research group at the Berlin Institute for the Foundations of Learning and Data (BIFOLD) and professor of Remote Sensing Image Analysis at TU Berlin. Advances in satellite systems have massively increased the amount and the variety massively increased the amount and the variety as well as the spatial and spectral resolution of EO data. “Nowadays we possess huge EO data archives. The Sentinel satellites in the Copernicus program alone – Europe’s flagship EO satellite initiative – provide us with about 12 terabytes of satellite images per day,” says Begüm Demir.

The European Space Agency uses a multitude of satellites to create large ammounts of EO data.
(Copyright: European Space Agency)

The problem: There is no single platform that connects the different datasets of interest from all over the world intelligently. All existing analysis platforms rely on heterogeneous technologies with different interfaces and data formats, which prevents cross-platform use. For example, it is nearly impossible to apply an analytics procedure developed on one platform to another. “It’s like using Word on a PC without a Windows environment – meaning you have to instruct each computing operation individually. This ‘lock-in effect’ hinders innovation and thus the efficient use of the collected data for climate protection,” describes Dr. Jorge Quiané-Ruiz, head of the Big Data Systems research group at BIFOLD.

Overcoming these limitations in the use of EO data sets is the common goal of Begüm Demir and Jorge Quiané-Ruiz. Their project: AgoraEO: a universal Earth Observation ecosystem infrastructure for sharing, finding, assembling, and running datasets, algorithms, and other tools. While Begüm Demir brings expertise on remote sensing data processing and analysis, Jorge Quiané-Ruiz is an expert in data processing and data management. He develops the Agora infrastructure, a more general-purpose ecosystem for data science and AI innovation, on which AgoraEO is partially based on.

AgoraEO’s innovative infrastructure allows all interested parties to contribute both EO data as well as technologies without having to upload them to a common server. “Our goal is to create an infrastructure that enables federated analysis across different platforms, making modern Earth observation technology accessible to all scientists and society, thus promoting climate change innovation worldwide,” sais Jorge Quiané-Ruiz.

*This article appeared for the first time on 31.07.2021 in the supplement “Climate Research” of Der Tagesspiegel, Berlin.

More information is available at:


By loading the video, you agree to YouTube’s privacy policy.
Learn more

Load video

BIFOLD Junior Fellow Dr. Eleni Tzirita Zacharatou presents the vision for AgoraEO in a talk at TU Twente.

For more technical insights into the research in the Agora project, visit their blog on Medium:



Authors: Arne de Wall, Björn Deiseroth, Eleni Tzirita Zacharatou, Jorge-Arnulfo Quiané-Ruiz, Begüm Demir, Volker Markl

Today, interoperability among EO exploitation platforms is almost inexistent as most of them rely on a heterogeneous set of technologies with varying interfaces and data formats. Thus, it is crucial to enable cross-platform (federated) analytics to make EO technology easily accessible to everyone. We envision AgoraEO, an EO ecosystem for sharing, finding, composing, and executing EO assets, such as datasets, algorithms, and tools. Making AgoraEO a reality is challenging for several reasons, the main ones being that the ecosystem must provide interactive response times and operate seamlessly over multiple exploitation platforms. In this paper, we discuss the different challenges that AgoraEO poses as well as our ideas to tackle them. We believe that having an open, unified EO ecosystem would foster innovation and boost EOdata literacy for the entire population.

In proceedings of BiDS 2021

BIFOLD welcomes the first six Junior Fellows

Home >

BIFOLD welcomes the first six Junior Fellows

BIFOLD welcomes the first six Junior Fellows

The Berlin Institute for the Foundations of Learning and Data is very pleased to announce the first six BIFOLD Junior Fellows. They were selected for the excellence of their research and are already well-established researchers in the computer sciences. In addition, their research interests show exceptional potential for BIFOLD’s research goals, either by combining machine learning and data management or by bridging the two disciplines and other research areas. The first six Junior Fellows will cover a broad range of research topics during their collaboration with BIFOLD.

LTR: Dr. Kaustubh Beedkar, Dr. Jan Hermann, Dr. Marina Marie-Claire Höhne, Dr. Danh Le Phuoc, Dr. Kristof Schütt, Dr. Eleni Tzirita Zacharatou (© BIFOLD)

These excellent researchers will receive mentoring from a BIFOLD Research Group Leader or Fellow, as well as opportunities to mentor a graduate school student and additional funding to support their research activities.Their research in collaboration with BIFOLD will advance the sciences in the following areas:

Dr. Kaustubh Beedkar:
My research interest lies in exploring efficient and effective tools and methods for compliant geo-distributed data analytics. In contrast to traditional means, my work focuses on building data processing frameworks that enable decentralized data analytics. In particular, I research how to integrate legal constraints arising from regulatory bodies concerning data sovereignty and data movement, as well as disparate compute infrastructures at multiple sites into data processing frameworks. For example, processing data generated by autonomous cars in three different geographies, such as Europe, North America, and Asia may face different regulations: there may be legal requirements that only aggregated, anonymized data may be shipped out of Europe and no data whatsoever may be shipped out of Asia. In my research, I explore how to specify and enforce compliance to these types of constraints while generating efficient query execution plans.

“The BIFOLD junior fellowship offers excellent opportunities in the form of seminars, training, and mentorship from world-renowned researchers, to hone my research skills in order to pursue a successful career in academia.”

Dr. Jan Hermann:
Many modern technologies stand on our ability to design novel molecules and materials. This design can be greatly assisted by computer simulations of the chemical and physical properties of these novel compounds. Typical realistic simulations require a whole hierarchy of different computational methods. One central component of this hierarchy are methods that can directly model the behavior of electrons in molecules and materials. Currently, such methods are limited either by computational efficiency or by accuracy. Recently, machine learning techniques have been successfully used across physical sciences, but not to target the direct modeling of electrons. My research attempts to fill this gap through tight integration of machine learning into existing methods for simulation of electrons in molecules and materials. The goal is to lift the restrictions on the use of these methods by broadening the scope of materials for which they can be effectively used.

“BIFOLD offers the opportunity to cooperate with other experts at the intersection of computer sciences and the physical sciences. Through mentoring opportunities in the BIFOLD junior fellowship program, I also hope to inspire the next generation of scientists to engage in machine-learning-based research in the physical sciences.”

Dr. Marina Marie-Claire Höhne:
In order to bring the positive potential of artificial intelligence (AI) into real applications, for example in cancer diagnostics, where AI machines detect cancer cells within milliseconds, we need to understand the machine’s decision. My research together with my junior research group UMI lab (Understandable Machine Intelligence) aims to understand the highly complex learning machines and contributes to the field of explainable AI. In particular, I develop methods that enable a holistic understanding of AI models and, with the help of Bayesian networks, additionally transfer the prediction uncertainties from the AI model to the explanation.

For me, BIFOLD is one of the crucial building blocks that contribute to research at the level of excellence. In my opinion, the accumulation of expert knowledge from different areas of AI is the decisive criterion for attaining a higher level of knowledge, which can lead to a holistic understanding of AI and is important in order to exploit the enormous potential of AI innovations.

Dr. Danh Le Phuoc:
The core of my research interest is centered around the research problem of how to autonomously fuse different types of sensory streams to build high-level perception, planning and control components for robots, drones and autonomous vehicles. My approach for this problem is leveraging common-sense and domain knowledge to guide the learning and reasoning tasks of such data fusion operations. The common-sense and domain knowledge are used to emulate the reasoning ability of a human. To this end, my research mission in BIFOLD is to develop scalable reasoning and learning algorithms and systems which bring the best both worlds, neural networks and symbolic reasoning, called neural-symbolic fusion frameworks.

“The BIFOLD network will facilitate my collaboration with BIFOLD experts in machine learning, database and large-scale processing to push the frontiers of the neural-symbolic research which is among the emerging key foundations of data processing and learning.”

Dr. Kristof Schütt:
Finding molecules and materials with specific chemical properties is important for progress in many technological applications, such as drug design or the development of efficient batteries and solar cells. In my research, I aim to accelerate this process by using machine learning for the discovery, simulation and understanding of these chemical systems. To achieve this, I study neural network that predict the chemical properties of molecules, or even directly generate molecules that possess a desired property.

BIFOLD brings researchers from many fields together. Working interdisciplinary at the intersection of machine learning and quantum chemistry, I enjoy the exchange with other disciplines as it always results in new ideas and perspectives. Machine learning methods can often be applied to similar problems in quite different applications. Therefore, you can both learn from and contribute to many fields as a machine learning researcher – for which BIFOLD provides the ideal environment.

Dr. Eleni Tzirita Zacharatou:
Spatial data is at the heart of any human activity. Billions of mobile devices, cars, social networks, satellites, sensors, scientific simulations, and many other sources produce spatial data constantly. My research aims to provide efficient tools to store, process, and manage the wealth of spatial data available today, thereby enabling discoveries, better services, and new products. To that end, I apply a broad portfolio of techniques, from efficient use of modern hardware to approximation algorithms and workload-driven adaptation. My goal within BIFOLD is to create an efficient data ecosystem that brings the benefits of spatial big data analytics to a broad community, helps to boost data literacy, and contributes to citizen science.

“BIFOLD offers great collaboration opportunities in data management, machine learning, and earth observation that can help me advance my research. In addition, the BIFOLD junior fellowship program allows me to evolve my mentoring skills and provides further training opportunities, thereby contributing to the successful development of my academic career.”

New cutting-edge IT Infrastructure

Home >

New cutting-edge IT Infrastructure

New cutting-edge IT Infrastructure

A future-proof IT infrastructure is increasingly becoming a decisive competitive factor – this applies not only to companies, but especially to research. In recent months, BIFOLD has been able to invest around 1.8 million euros in new research hardware, thereby significantly increasing the institute’s computing capacity. This cutting-edge IT infrastructure was financed by the German Federal Ministry of Education and Research (BMBF). “If we want to continue to conduct world-class research, investments into infrastructure are an important prerequisite” describes BIFOLD Co-Director Prof. Dr. Klaus-Robert Müller.

(© Pixabay)

Current experiments in Machine Learning and Big Data management systems require hardware with very strong computing, storing and data transfer capabilities. The new systems include a specialized computer unit (node) and a full computer cluster both designed for simultaneous processing of a very large number of parallel workloads (massively parallel processing) with large main memory capacities as well as one cluster particularly suitable for the fast processing of sequential workloads. The central processing units (CPUs) in the latter cluster also support the so called Intel Software Guard Extension technology thereby enabling developers to create and execute code and data in a secure environment. The servers run high-performance file systems and will allow for the transfer of very large data with low latency. “We expect that this cutting-edge hardware will not only enrich our own research, but also enables us to establish new collaborations with our partners,” adds BIFOLD Co-Director Prof. Dr. Volker Markl.

In the group of Volker Markl, mainly two different projects benefit from the new possibilities: AGORA is a novel form of data management systems. It aims to construct an innovative unified ecosystem that brings together data, algorithms, models, and computational resources and provides them to a broad audience. The goal is easy creation and composition of data science pipelines as well as their scalable execution. In contrast to existing data management systems, Agora operates in a heavily decentralized and dynamic environment.
The NebulaStream platform is a general purpose, end-to-end data management system for the IoT. It provides an out-of-the box experience with rich data processing functionalities and a high ease-of-use. With the new IT infrastructure, both of these systems can be validated at a much larger scale and in a secure data processing environment.

High memory and parallel processing capabilities are also essential for large-scale Machine Learning simulations, e.g. solving high-dimensional linear problems, or training deep neural networks. Klaus-Robert Müller and his group will use the new hardware initially in three different projects: Specifically, it allows BIFOLD researchers to produce closed-form solutions of large dense linear systems, which are needed to describe correlations between large amounts of interacting particles in a molecule with high numerical precision. Researchers can also significantly extend the number of epigenetic profile regions that can be analyzed, thereby using significantly more information available in the data. It will also enable scientists to develop explainable AI techniques that incorporate internal explainability and feedback structures, and are significantly more complex to train than typical deep neural networks.

Specifications of the new hardware in the DIMA group

The first cluster consists of 60 servers, each with two processors (CPUs) with 16 2.1 GHz processor cores, 512 GB of RAM, twelve TBs of HD storage space, and additional fast SSD storage that holds approximately two TB of data. They are thus designed for simultaneous processing of a very large number of parallel workloads (massively parallel processing). The second cluster consists of 15 identical servers. Each has a CPU with eight 3.4 GHz processor cores, 128 GB of RAM and a combination of 12 TB of HD memory and 2 TB of SSD memory. They are particularly suitable for the fast processing of sequential workloads. Furthermore, these CPUs support the Intel Software Guard Extension to enable developers to develop and execute code and data in an environment secured by the CPU.

The performance of the server system is upgraded with two high-performance nodes, each with three Nividia A100 graphical processing units (GPUs), two 64-core CPUs, 2 TB of main memory and over 22 TB of HD memory capable of handling data analytics and AI applications on very large data sets.

Both clusters are also managed by two headnodes, which have advanced hardware features to improve fault tolerance. All systems have 100 GB Infiniband cards and are connected via two 100 GB/s Infiniband switches, enabling very fast data exchange between cluster hosts. The servers will use the Hadoop Distributed File System (HDFS), which supports Big Data analytics and will enable high-performance access to data.

Specifications of the new hardware in the Machine Learning group

The existing Cluster has been upgraded with 13 additional nodes. Twelve nodes have 786 GB main memory each and four Nvidia A100 GPUs connected via a 200 Gbit Inifiband network. This setup allows the simulation and computation of very large models. One special node runs with 6 TB of RAM and 72 processor cores, which enables massive parallel computing while holding very large models in the main memory.

The distributed High-Performance-Computing file system BeeGFS was expanded with three more file servers. 437 TB of storage capacity are distributed across six data servers, which are connected to the network with a 40 Gbit connection. All nodes are connected with at least 25 Gbit, many with 25 Gbit. Overall this setup is capable of handling operations with very large amounts of data.

New BIFOLD Research Groups established

Home >

New BIFOLD Research Groups established

New BIFOLD Research Groups established

The Berlin Institute for the Foundations of Learning and Data (BIFOLD) set up two new Research Training Groups, led by Dr. Stefan Chmiela and Dr. Steffen Zeuch. The goal of these new research units at BIFOLD is to enable a junior researcher to conduct independent research and prepare him for a leadership position. Initial funding includes their own position as well as two PhD students and/or research associates for three years.

One of the new Research Training Groups at BIFOLD led by Dr. Steffen Zeuch focuses on a general purpose, end-to-end data management system for the IoT.
(© Pixabay)

Steffen Zeuch is interested in how to overcome the data management challenges that the growing number of Internet of Things (IoT) devices bring: “Over the last decade, the amount of produced data has reached unseen magnitudes. Recently, the International Data Corporation estimated that by 2025, the global amount of data will reach 175 Zettabyte (ZB) and that 30 percent of this data will be gathered in real-time. In particular, the number of IoT devices increases exponentially such that the IoT is expected to grow as large as 20 billion connected devices in 2025.” The explosion in the number of devices will create novel data-driven applications in the near future. These applications require low-latency, location awareness, wide-spread geographical distribution, and real-time data processing on potentially millions of distributed data sources.

Dr. Steffen Zeuch
(© Steffen Zeuch)

“To enable these applications, a data management system needs to leverage the capabilities of IoT devices outside the cloud. However, today’s classical data management systems are not ready yet for these applications as they are designed for the cloud,” explains Steffen Zeuch. “The focus of my research lies in introducing the NebulaStream Platform – a general purpose, end-to-end data management system for the IoT.”

Stefan Chmiela concentrates on so-called many-body problems. This broad category of physical problems deals with systems of interacting particles, with the goal to accurately characterize their dynamic behavior. These types of problems arise in many disciplines, including quantum mechanics, structural analysis and fluid dynamics and generally require solving high-dimensional partial differential equations. “In my research group we will particularly focus on problems from quantum chemistry and condensed matter physics, as these fields of science rank among the most computationally intensive”, explains Stefan Chmiela. In these systems, highly complex collective behavior emerges from relatively simple physical laws for the motion of each individual particle. Because of this, the simulation of high-dimensional many-body problems requires extremely challenging computation capacities. There is a limit to how much computational efficiency can be gained through rigorous mathematical and physical approximations, yet fully empirical solutions are often too simplistic to be sufficiently predictive.

Dr. Stefan Chmiela
(© Stefan Chmiela)

The lack of simultaneously accurate and efficient approaches makes many phenomena hard to model reliably. “Reconciling these two contradicting aspects of accuracy and computational speed is our goal” states Stefan Chmiela. “Our idea is to incorporate readily available fundamental prior knowledge into modern learning machines. We leverage conservation laws – which are derivable for many symmetries of physical systems, in order to increase the models ability to be accurate with less data.”

BIFOLD Graduate School launches its first cohort with 12 PhD candidates

Home >

BIFOLD Graduate School launches its first cohort with 12 PhD candidates

BIFOLD Graduate School launches its first cohort with 12 PhD candidates

From top row LTR: Dr. Manon Grube, Dr. Tina Schwabe, Leon Klein, Gabriel Dernbach, Dominik Scheinert, Anastasiia Kozar, Lennart Behme, Niklas Gebauer, Hannah Marienwald, Kirill Bykov, Prof. Dr. Klaus-Robert Müller, Prof. Dr. Begüm Demir, Prof. Dr. Volker Markl, Björn Deiseroth.
Not in the picture: Leila Arras, Georgii Mikriukov, Mona Rams.

In time for the summer semester 2021, the Berlin Institute for the Foundations of Learning and Data (BIFOLD) announced the launch of its Graduate School (GS): 12 PhD students from France, Russia and Germany, among them four women, make up the first cohort. The scholarship holders have obtained their master’s degrees in physics, computer science or bioinformatics; two of them are currently researching at Freie Universität Berlin, one at Universität Potsdam and nine at Technische Universität Berlin.

“The Corona Pandemic didn’t make it any easier for the Graduate School to start. In the on-going seminar series, our PhD candidates each present their research topics. This will be followed by a course in good scientific practice in May. Later on we plan to offer courses on technical writing, presentation techniques, interdisciplinary teamwork as well as seminars on advanced scientific topics,” says Dr. Tina Schwabe, who together with Dr. Manon Grube, is responsible for the program of the newly founded BIFOLD GS. The second cohort will follow in October 2021.

The goal of the GS is to educate students in critical competence areas of Big Data (BD) systems and Machine Learning (ML); to enable them to devise novel Artificial Intelligence (AI) and Data Science (DS) solutions to scientific problems in an interdisciplinary environment; and to equip them with the necessary technical skills and scientific capabilities. All students are expected to finish their PhD within four years. In the near future, the GS will offer a fast-track Program that will allow outstanding Bachelor graduates to complete a newly designed, research-oriented Master’s degree en-route to the PhD.

“We are very happy to finally welcome the first cohort of PhD candidates at BIFOLD” greets Prof. Dr. Volker Markl, one of two directors of BIFOLD, the new PhD students. “We are convinced that excellent training of future Data Science and Machine Learning experts is an important investment in the future,” adds Prof. Dr. Klaus-Robert Müller, co-director of BIFOLD.

Using Math to Reduce Energy Consumption

Home >

Using Math to Reduce Energy Consumption

Using Math to Reduce Energy Consumption

Prof. Dr. Klaus-Robert Müller
(© Christian Kielmann)

Klaus-Robert Müller, professor of Machine Learning at TU Berlin and Co-Director of the Berlin Institute for the Foundations of Learning and Data (BIFOLD), discusses computation time as a climate killer and his predictions for science in 80 years.

Professor Müller, in our conversation prior to this interview about your vision for the future of the computer on the 80th anniversary of the invention of the Z3, you mentioned energy conservation as one of the major challenges we face. Why is this?

The world’s computer centers are major emitters of CO2. Huge amounts of fossil energy are still being used to power them. More and more calculations are performed, and the computation time required for these is increasing. It is not enough for us to go on Fridays for Future marches. We all have to try to do something in the areas where we have direct influence.

So, the work of your research group focuses directly on this topic?

Yes, but even more so our research at the Berlin Institute for the Foundations of Learning and Data, or BIFOLD for short, which was set up in 2020 as a part of the federal government’s AI strategy.

Where do you see possible solutions to significantly reduce the energy consumption of computer centers?

Solving a known image recognition problem uses about as much energy as a four-person household over a period of three months. One approach is to save computation time by using a different mathematical method. This could reduce energy consumption to the level of a four-person household for two months while achieving the same result. A greater saving would of course be better. We need to develop energy-saving methods of computing for AI. Data traffic for distributed learning requires a great deal of energy, so we are also looking to minimize this. My team has been able to demonstrate how smart mathematical solutions can reduce the requirement for data transfer from 40 terabytes to 5 or 6 gigabytes. Getting a good result is no longer the only issue; how you achieve that result is becoming increasingly important.

What for you were the most important milestones in the development of the computer over the past 80 years?

For me, it all began with Konrad Zuse and the Z3. I am fascinated how this computer with its three arithmetic calculations and a memory of just 64 words was able to give rise to the supercomputer. In the 1950s and 60s, some people were still able to perform calculations faster than computers. At the beginning of the 90s, around the time I received my doctorate, the first workstations became available. These marked the end of the time that you had to to log on to a mainframe computer. In 1994 while working as a postdoc in the USA, I had the opportunity to perform calculations using such a supercomputer, the Connection Machine CM5. The most recent major step are graphics processing units, or GPUs for short. These graphic processors not only allow you to have a mini supercomputer at your daily disposal for a small cost, their architecture also makes them ideal for machine learning and training large neural network models. This has led to many new scientific developments, which today form part of our lives. It really fascinates me how we have progressed in such a short time from a situation where people could perform calculations faster than a computer to today where I have a supercomputer under my desk. Although supercomputers aren’t everything.

How do you mean?

Three decades ago, I published a paper with another student on the properties of a neural network. There was another researcher working on the same topic who, unlike us, had access to the Cray supercomputer. We had to do perform our calculations using a work station. Well, we adapted our algorithm to this hardware and were able to achieve the same results using a simple computer as our colleague with access to the Cray XMP. This was greeted with amazement in our field. What I am getting at is that you can sometimes achieve good results with simpler equipment if you use a little more creativity.

This year marks the 80th anniversary of the invention of the computer. Are you able to predict what may be possible in the area of machine learning, in other words your area of research, within the next 80 years?

What I would say is that machine learning will become a standard tool in industry and science, for the humanities as well as natural sciences and medicine. To make this happen, we have to now train a generation of researchers to not only use these tools but also understand their underlying principles so as to prevent improper use of machine learning and thus false scientific findings. This includes an understanding of big data, as the data volumes required in science are becoming ever larger. These two areas – machine learning and big data – will become more and more closely connected with each other as well as with their areas of application. And this brings me back to BIFOLD: We see both areas as a single entity linked to its applications and it is precisely on this basis that we have now started to train a new generation of researchers.

Interview: Sybille Nitsche

ICDE 2021 honors BIFOLD researchers with Best Paper Award

Home >

ICDE 2021 honors BIFOLD researchers with Best Paper Award

ICDE 2021 honors BIFOLD researchers with Best Paper Award

The 37. IEEE International Conference on Data Engineering (ICDE) 2021 honored the paper “Efficient Control Flow in Dataflow Systems: When Ease-of-Use Meets High Performance” of six BIFOLD researchers with the Best Paper Award. Gábor E. Gévay, Tilmann Rabl, Sebastian Breß, Lorand Madai-Tahy, Jorge-Arnulfo Quiané-Ruiz and Volker Markl were honored during the award session of the conference on April 21, 2021.

In modern data analytics, companies often want to analyze large datasets. For example, a company might want to analyze its entire network of user interactions in order to better understand how its products are used. Scaling data analysis to large datasets is a widespread need in many different contexts. Modern dataflow systems, such as Apache Flink and Apache Spark are widely used to accomplish that need. But the kind of algorithms that are used for data analysis are getting more and more complex. Complex algorithms are often iterative in nature, meaning that they gradually refine the results by repeated execution of a computation. A well-known example is the PageRank algorithm, which is used for ranking the importance of nodes in a network, for example ranking websites in Google search results. Both dataflow systems Apache Flink and Apache Spark have weaknesses when implementing iterative algorithms: they are either hard to use, or have suboptimal performance.

This paper introduces a new system, which combines an easy-to-use language with efficient execution. It is able to keep the language simple by relying on techniques from the programming language research literature, in addition to the database and distributed systems research literature, which earlier systems relied on. The simpler language makes it easy for users to run advanced analytics on large datasets. This is important for data scientists, who can then concentrate on the analytics instead of needing to become experts on the internal workings of the systems.

The annual IEEE International Conference on Data Engineering (ICDE) is the flagship IEEE conference addressing research issues in designing, building, managing, and evaluating advanced data-intensive systems and applications. For over three decades, IEEE ICDE has been a leading forum for researchers, practitioners, developers, and users to explore cutting-edge ideas and to exchange techniques, tools, and experiences.

The paper in detail:
“Efficient Control Flow in Dataflow Systems: When Ease-of-Use Meets High Performance”

Gábor E. Gévay, Tilmann Rabl, Sebastian Breß, Lorand Madai-Tahy, Jorge-Arnulfo Quiané-Ruiz, Volker Markl

Modern data analysis tasks often involve control flow statements, such as iterations. Common examples are PageRank and K-means. To achieve scalability, developers usually implement data analysis tasks in distributed dataflow systems, such as Spark and Flink. However, for tasks with control flow statements, these systems still either suffer from poor performance or are hard to use. For example, while Flink supports iterations and Spark provides ease-of-use, Flink is hard to use and Spark has poor performance for iterative tasks. As a result, developers typically have to implement different workarounds to run their jobs with control flow statements in an easy and efficient way. We propose Mitos, a system that achieves the best of both worlds: it achieves both high performance and ease-of-use. Mitos uses an intermediate representation that abstracts away specific control flow statements and is able to represent any imperative control flow. This facilitates building the dataflow graph and coordinating the distributed execution of control flow in a way that is not tied to specific control flow constructs. Our experimental evaluation shows that the performance of Mitos is more than one order of magnitude better than systems that launch new dataflow jobs for every iteration step. Remarkably, it is also up to 10.5 times faster than Flink, which has native iteration support, while matching the ease-of-use of Spark.

To be published in the Proceedings of the 37th IEEE International Conference on Data Engineering, ICDE 2021, April 19 – 22

BTW 2021 Best Paper Award and Reproducibility Badge for TU Berlin Data Science Publication

Home >

BTW 2021 Best Paper Award and Reproducibility Badge for TU Berlin Data Science Publication

BTW 2021 Best Paper Award and Reproducibility Badge for TU Berlin Data Science Publication

The research paper “Fast CSV Loading Using GPUs and RDMA for In-Memory Data Processing” by Alexander Kumaigorodski, Clemens Lutz, and Volker Markl received the Best Paper Award of the 19th Symposium on Database Systems for Business, Technology and Web (BTW 2021). On top, the paper received the Reproducibility Badge, awarded for the first time by BTW 2021, for the high reproducibility of its results.

TU Berlin Master’s graduate Alexander Kumaigorodski and his co-authors from Prof. Dr. Volker Markl‘s Department of Database Systems and Information Management (DIMA) at TU Berlin and from the Intelligent Analytics for Massive Data (IAM) research area at the German Research Centre for Artificial Intelligence (DFKI) present a new approach to speed up loading and processing of tabular CSV data by orders of magnitude.

CSV is a very frequently used format for the exchange of structured data. For example, the City of Berlin publishes its structured datasets in the CSV format in the Berlin Open Data Portal. Such datasets can be imported into databases for data analysis. Accelerating this process allows users to handle the increasing amount of data and to decrease the time required for its data analysis. Each new generation of computer networks and storage media provides higher bandwidths and allows for faster reading times. However, current loading and processing approaches using main processors (CPU) cannot keep up with these hardware technologies and unnecessarily throttle loading times.

© Alexander Kumaigorodski

The procedure described in this paper uses a new approach where CSV data is read and processed by graphics processors (GPU) instead. The advantage of these graphics processors lies primarily in their strong parallel computing power and fast memory access. Using this approach, new hardware technologies can be fully made use of, e.g., NVLink 2.0 or InfiniBand with Remote Direct Memory Access (RDMA). In conclusion, CSV data can be read directly from main memory or the network and processed with multiple gigabytes per second.

The transparency of the tests performed and the independent confirmation of the results also led to the award of the first-ever BTW 2021 Reproducibility Badge. In the data science community, the reproducibility of research results is becoming increasingly important. It serves to verify results as well as to compare them with existing work and is thus an important aspect of scientific quality assurance. Leading international conferences have therefore already devoted special attention to this topic.

To ensure high reproducibility, the authors provided the reproducibility committee with source code, additional test data, and instructions for running the benchmarks. The execution of the tests was demonstrated in a live session and could then also be successfully replicated by a member of the committee. The Reproducibility Badge recognizes above all the good scientific practice of the authors.

The paper in detail:
“Fast CSV Loading Using GPUs and RDMA for In-Memory Data Processing”

Alexander Kumaigorodski, Clemens Lutz, Volker Markl

Comma-separated values (CSV) is a widely-used format for data exchange. Due to the format’s prevalence, virtually all industrial-strength database systems and stream processing frameworks support importing CSV input. However, loading CSV input close to the speed of I/O hardware is challenging. Modern I/O devices such as InfiniBand NICs and NVMe SSDs are capable of sustaining high transfer rates of 100 Gbit/s and higher. At the same time, CSV parsing performance is limited by the complex control flows that its semi-structured and text-based layout incurs. In this paper, we propose to speed-up loading CSV input using GPUs. We devise a new parsing approach that streamlines the control flow while correctly handling context-sensitive CSV features such as quotes. By offloading I/O and parsing to the GPU, our approach enables databases to load CSVs at high throughput from main memory with NVLink 2.0, as well as directly from the network with RDMA. In our evaluation, we show that GPUs parse real-world datasets at up to 60 GB/s, thereby saturating high-bandwidth I/O devices.

K.-U. Sattler et al. (Hrsg.): Datenbanksysteme für Business, Technologie und Web (BTW 2021),Lecture Notes in Informatics (LNI), Gesellschaft für Informatik, Bonn 2021

Tapping into Nature’s Wisdom

Home >

Tapping into Nature’s Wisdom

Tapping into Nature’s Wisdom

Cellulose biosensors are robust in practice

Electroencephalography (EEG), electrocardiography (ECG), electromyography (EMG) – all of these non-invasive medical diagnostic methods rely on an electrode to measure and record electrical signals or voltage fluctuations of muscle or nerve cells underneath the skin. Depending on the type of diagnostics, this can then be used to measure electrical brain waves, or the currents in the heart or muscles. Present methods use metal sensors which are attached to the skin using a special gel to ensure continuous contact. Researchers at the University of Korea and Technische Universität Berlin have now developed so-called biosensors made of the plant material cellulose. They not only offer better and more durable conductivity than conventional electrodes. They are also 100 percent natural, reusable, do not cause skin irritation like other gels and are biodegradable. The paper “Leaf inspired homeostatic cellulose biosensors” has now been published in the renowned journal Science Advances.

The crucial keyword for the new sensors is: homeostasis.  In biology, this refers to the maintenance of a state of equilibrium. This is how leaves, for example, regulate the osmotic pressure in their cells, i.e. how much water they store. This internal cell pressure depends on the water content of the neighboring cells, but also on the environment (dry or humid) and is constantly readjusted.

“Most people know the feeling of walking through a damp garden with bare feet. Leaves stick to the soles of our feet and simply don’t fall off, even when we move,” explains Professor Klaus-Robert Müller, head of the Machine Learning group at TU Berlin and director of the Berlin Institute for the Foundations of Learning and Data (BIFOLD). “The reason that leaves cling so effectively to our skin is due to the swelling properties of cellulose, the material that the cell walls of plants are made of, and is based on the principle of homeostasis.”

Sensors modeled on the structure of leaves

The image shows the structure of the biosensors, which is based on the leaf structure.
(© University of Korea)

Until now, cellulose has mainly appeared as a material for synthesis or filtration. Because cellulose itself is not conductive, it seemed unsuitable as potential material for electrodes. When cellulose fibers are placed in salty water, however, they swell and show great electrical conduction properties.

Inspired by the structure of leaves, the researchers have developed, analyzed and tested biosensors consisting of two layers of cellulose fibers that resemble the leaf structure and can be saturated with salt water. On top of the cellulose material lies a carrier membrane which in turn docks to a metal electrode with a cable.

Recyclable, skin-friendly and biodegradable

During an Electroencephalography (EEG) electrodes measure and record electrical signals or voltage fluctuations of nerve cells underneath the skin. (© Pixabay)

“These sensors showed continuously high-quality electrophysiological signals in various applications such as EEG, EMG, and ECG. They adhere excellently to different skin types – without the need for a synthetic gel. They also demonstrate good adhesion properties under stress, for example with sweating or moving test subjects,” explains Müller. Furthermore, these sensors feature a high transmission quality, low electrical resistance (impedance) and little resistance variance during long-term measurements.

The researchers have already tested the sensors in various application scenarios and on different skin types. “We were also able to demonstrate the versatility and robustness of the biosensors in combination with machine learning algorithms that were tested in challenging real-world situations. Tests have been conducted with test persons riding a bicycle or playing a computer game with a brain-computer interface, meaning the subjects were moving during the measurement, which can potentially generate artifacts,” says Müller.

Other advantages of the biosensors: They allow mass production in a simple and cost-effective process, are recyclable, skin-friendly and biodegradable. Klaus-Robert Müller is convinced: “These homeostatic cellulose biosensors are suitable for a broad range of clinical and non-clinical applications.”

The publication in detail:
“Leaf inspired homeostatic cellulose biosensors”

Ji-Yong Kim, Yong Ju Yun, Joshua Jeong, C.-Yoon Kim, Klaus-Robert Müller and Seong-Whan Lee

An incompatibility between skin homeostasis and existing biosensor interfaces inhibits long-term electrophysiological signal measurement. Inspired by the leaf homeostasis system, we developed the first homeostatic cellulose biosensor with functions of protection, sensation, self-regulation, and biosafety. Moreover, we find that a mesoporous cellulose membrane transforms into homeostatic material with properties that include high ion conductivity, excellent flexibility and stability, appropriate adhesion force, and self-healing effects when swollen in a saline solution. The proposed biosensor is found to maintain a stable skin-sensor interface through homeostasis even when challenged by various stresses, such as a dynamic environment, severe detachment, dense hair, sweat, and long-term measurement. Last, we demonstrate the high usability of our homeostatic biosensor for continuous and stable measurement of electrophysiological signals and give a showcase application in the field of brain-computer interfacing where the biosensors and machine learning together help to control real-time applications beyond the laboratory at unprecedented versatility.

Science Advances 7(16), eabe7432

Further information is available from:

Prof. Dr. Klaus-Robert Müller
TU Berlin
Machine Learning
Tel.: 030 314-78621

New workshop series “Trustworthy AI”

Home >

New workshop series “Trustworthy AI”

New workshop series “Trustworthy AI”

The AI for Good global summit is an all year digital event, featuring a weekly program of keynotes, workshops, interviews or Q&As. BIFOLD Fellow Dr. Wojciech Samek, head of department of Artificial Intelligence at Fraunhofer Heinrich Hertz Institute (HHI), is implementing a new online workshop series “Trustworthy AI” for this platform.

The AI for Good series is the leading action-oriented, global and inclusive United Nations platform on Artificial Intelligence (AI). The Summit is organized all year, always online, in Geneva by the International Telecommunication Union (ITU) – the United Nations specialized agency for information and communication technologies. The goal of the AI for Good series is to identify practical applications of AI and scale those solutions for global impact.

“AI systems have steadily grown in complexity, gaining predictivity often at the expense of interpretability, robustness and trustworthiness. Deep neural networks are a prime example of this development. While reaching ‘superhuman’ performances in various complex tasks, these models are susceptible to errors when confronted with tiny, adversarial variations of the input – variations which are either not noticeable or can be handled reliably by humans”

Dr. Wojciech Samek

The workshop series will discuss these challenges of current AI technology and will present new research aiming at overcoming these limitations and developing AI systems which can be certified to be trustworthy and robust.

The workshop series will cover the following topics:

  • Measuring Neural Network Robustness
  • Auditing AI Systems
  • Adversarial Attacks and Defences
  • Explainability & Trustworthiness
  • Poisoning Attacks on AI
  • Certified Robustness
  • Model and Data Uncertainty
  • AI Safety and Fairness

The first workshop is held by Nicholas Carlini, Research Scientist at Google AI, 25. March 2021, 5:00 pm CET: Trustworthy AI: Adversarially (non-)Robust Machine Learning.

Register here: