BIFOLD student Nithish Sankaranarayanan received the “Best Systems Related Master’s Thesis Award” for his work “Efficient Operator Placement for Mutating Query Plans”. He won the award for outstanding scientific contribution and open source contribution in the Big Data Management and Analytics (BDMA)Master’s Programme.
To use resources efficiently in the Internet-of-Things environment, sharing common computation amongst concurrently executing queries is necessary. However, such a sharing must not disrupt the execution of running queries. This thesis proposes a novel approach to deploy queries after identifying sharing opportunities in a disruption-free manner.
During the Summer School on Machine Learning for Quantum Physics and Chemistry, in September 2021 in Warsaw, BIFOLD PhD candidate Kim. A. Nicoli was awarded with the Best Poster Award. His poster was democratically selected by the participants and the scientific committee for being the best amongst more than 80 participants. The corresponding paper “Estimation of Thermodynamic Observables in Lattice Field Theories with Deep Generative Models” is a joint international effort of several BIFOLD researchers: Kim Nicoli, Christopher Anders, Pan Kessel, Shinichi Nakajima, as well as a group of researchers affiliated with DESY (Zeuthen) and other institutions. The work is published in Physics Review Letters.
“Modeling and understanding the interactions of quarks, fundamental subatomic, yet indivisible particles, which represent the smallest known units of matter, is the main goal of current ongoing research in the field of High Energy Physics. Deepening our understanding of such phenomena, leveraging on modern machine learning techniques, would have some important implications in many related fields of applied science and research, such as quantum computer devices, drug discoveries and many more.”
Preventing Image-Scaling Attacks on Machine Learning
Preventing Image-Scaling Attacks on Machine Learning
BIFOLD Fellow Prof. Dr. Konrad Rieck, head of the Institute of System Security at TU Braunschweig, and his colleagues provide the first comprehensive analysis of image-scaling attacks on machine learning, including a root-cause analysis and effective defenses. Konrad Rieck and his team could show that attacks on scaling algorithms like those used in pre-processing for machine learning (ML) can manipulate images unnoticeably, change their content after downscaling and create unexpected and arbitrary image outputs. “These attacks are a considerable threat, because scaling as a pre-processing step is omnipresent in computer vision,” knows Konrad Rieck. The work was presented at the USENIX Security Symposium 2020.
Machine learning is a rapidly advancing field. Complex ML methods do not only enable increasingly powerful tools, they are also entry gates for new forms of attacks. Research into security for ML usually focusses on the learning algorithms itself, although the first step of a ML process is the pre-processing of data. In addition to various cleaning and organizing operations in datasets, images are scaled down during pre-processing to speed up the actual learning process that follows. Konrad Rieck and his team could show that frequently used scaling algorithms are vulnerable to attacks. It is possible to manipulate input images in such a way that they are indistinguishable from the original to the human eye, but will look completely different after downscaling.
The vulnerability is rooted in the scaling process: Most scaling algorithms only consider a few high-weighed pixels of an image and ignore the rest. Therefore, only these pixels need to be manipulated to achieve drastic changes in the downscaled image. Most pixels of the input picture remain untouched – making the changes invisible to the human eye. In general, scaling attacks are possible wherever downscaling takes place without low-pass filtering – even in video and audio media formats. These attacks are model-independent and thus do not depend on knowledge of the learning model, features or training data.
“Image-scaling attacks can become a real threat in security related ML applications. Imagine manipulated images of traffic signs being introduced into the learning process of an autonomous driving system! In BIFOLD we develop methods for the effective detection and prevention of modern attacks like these.”
“Attackers don’t need to know the ML training model and can even succeed with image-scaling attacks in otherwise robust neural networks,” says Konrad Rieck. “Based on our analysis, we were able to identify a few algorithms that withstand image scaling attacks and introduce a method to reconstruct attacked images.”
Using Machine Learning in the Fight against COVID-19
Using Machine Learning in the Fight against COVID-19
BIFOLD Fellow Prof. Dr. Frank Noé, who leads the research group AI for the Sciences, together with an international team, identified a potential drug candidate for the therapy of COVID-19. Among other methods, they used deep learning models and molecular dynamics simulations in order to identify the drug Otamixaban as a potential inhibitor of the human target enzyme which is required by SARS-CoV-2 in order to enter into lung cells. According to their findings, Otamixaban works in synergy with other drugs such as Camostat and Nafamostat and may present an effective early treatment option for COVID-19. Their work was now published in Chemical Science.
While the availability of COVID-19 vaccines created some relief during the ongoing pandemic, there is still no effective therapy against the virus. One therapeutic approach pursues the strategy to prevent the virus from entering human cells.
In their publication, Frank Noé, who heads an interdisciplinary research unit at Freie Universität Berlin, and his colleagues at FU Berlin, German Primate Center, National Center for Advancing Translational Sciences (MD, USA), Fraunhofer Institute for Toxicology and Experimental Medicine, and Universität Göttingen could show that the late-stage drug candidate, Otamixaban, works as an effective inhibitor of SARS-Cov-2 lung cell entry by suppressing the activity of an enzyme called “transmembrane serine protease 2” (TMPRSS2). The SARS-CoV-2 virus uses its so-called spike protein (S-protein) to connect to an enzyme (ACE2) on the surface of a human lung cell. Subsequently the S-protein is cleaved by the enzyme TMPRSS2 thereby enabling the virus to enter the cell. Inhibiting TMPRSS2 with Otamixaban prevents the cell entry weakly, but this inhibitory effect is found to be profoundly amplified when combining Otamixaban with other known TMPRSS2-inhibiting drugs such as Nafamostat and Camostat.
Frank Noé and his team analyzed the inhibitory effects of Otamixaban in silico, i.e. by machine learning and computer simulation. They combined deep learning methods and molecular dynamics simulation in order to screen a database of druglike molecules for potential inhibitors of TMPRSS2. Otamixaban was one of the proposed candidates that was confirmed to be active in the experimental assay. Subsequently, the Noé group conducted extensive molecular dynamics simulations of the TMPRSS2-Otamixaban complex and applied big data analytics in order to understand the inhibition mechanism in detail, while in parallel the inhibitor effect of Otamixaban was confirmed in cells and lung tissue.
“The new machine learning methods that we develop at BIFOLD do not only help to solve fundamental problems in molecular and quantum physics, they are also increasingly important in application-oriented biochemical research. I believe it is very likely that if we hopefully end up with effective therapy options against COVID-19, machine learning will have played a key role in identifying them.”
Otaxamiban, originally developed for other medical conditions, is particularly interesting as it had already entered the third phase of clinical trials for a different indication, potential alleviating the trajectory towards clinical trials of the new formulation presented here. The researchers filed an EU patent application for the active agent combination.
The publications in detail:
Synergistic inhibition of SARS-CoV-2 cell entry by otamixaban and covalent protease inhibitors: pre-clinical assessment of pharmacological and molecular properties
Authors: Tim Hempel, Katarina Elez, Nadine Krüger, Lluís Raich, Jonathan H. Shrimp, Olga Danov, Danny Jonigk, Armin Braun, Min Shen, Matthew D. Hall, Stefan Pöhlmann, Markus Hoffmann, Frank Noé
Abstract: SARS-CoV-2, the cause of the COVID-19 pandemic, exploits host cell proteins for viral entry into human lung cells. One of them, the protease TMPRSS2, is required to activate the viral spike protein (S). Even though two inhibitors, camostat and nafamostat, are known to inhibit TMPRSS2 and block cell entry of SARS-CoV-2, finding further potent therapeutic options is still an important task. In this study, we report that a late-stage drug candidate, otamixaban, inhibits SARS-CoV-2 cell entry. We show that otamixaban suppresses TMPRSS2 activity and SARS-CoV-2 infection of a human lung cell line, although with lower potency than camostat or nafamostat. In contrast, otamixaban inhibits SARS-CoV-2 infection of precision cut lung slices with the same potency as camostat. Furthermore, we report that otamixaban’s potency can be significantly enhanced by (sub-) nanomolar nafamostat or camostat supplementation. Dominant molecular TMPRSS2-otamixaban interactions are assessed by extensive 109 μs of atomistic molecular dynamics simulations. Our findings suggest that combinations of otamixaban with supplemental camostat or nafamostat are a promising option for the treatment of COVID-19.
Molecular mechanism of inhibiting the SARS-CoV-2 cell entry facilitator TMPRSS2 with camostat and nafamostat
Authors: Tim Hempel, Lluís Raich, Simon Olsson, Nurit P. Azouz, Andrea M. Klingler, Markus Hoffmann, Stefan Pöhlmann, Marc E. Rothenberg, Frank Noé
Abstract: The entry of the coronavirus SARS-CoV-2 into human lung cells can be inhibited by the approved drugs camostat and nafamostat. Here we elucidate the molecular mechanism of these drugs by combining experiments and simulations. In vitro assays confirm that both drugs inhibit the human protein TMPRSS2, a SARS-Cov-2 spike protein activator. As no experimental structure is available, we provide a model of the TMPRSS2 equilibrium structure and its fluctuations by relaxing an initial homology structure with extensive 330 microseconds of all-atom molecular dynamics (MD) and Markov modeling. Through Markov modeling, we describe the binding process of both drugs and a metabolic product of camostat (GBPA) to TMPRSS2, reaching a Michaelis complex (MC) state, which precedes the formation of a long-lived covalent inhibitory state. We find that nafamostat has a higher MC population than camostat and GBPA, suggesting that nafamostat is more readily available to form the stable covalent enzyme–substrate intermediate, effectively explaining its high potency. This model is backed by our in vitro experiments and consistent with previous virus cell entry assays. Our TMPRSS2–drug structures are made public to guide the design of more potent and specific inhibitors.
SIGCOMM 2021 Best Paper: Internet Hypergiants Expand into End-User Networks
SIGCOMM 2021 Best Paper: Internet Hypergiants Expand into End-User Networks
BIFOLD Fellow Prof. Dr. Georgios Smaragdakis and his colleagues received the prestigious ACM SIGCOMM 2021 Best Paper Award for their research into the expansion of Hypergiant’s off-nets. They developed a methodology to measure how a few extremely large internet content providers deploy more and more servers in end-user networks over the last years. Their findings indicate changes in the structure of the internet, potentially impacting network end-user experience and neutrality regulations.
An increasing amount of the digital content delivered to Internet users originates from a few very large providers, like Google, Facebook or Netflix, the so-called Hypergiants (HGs). In 2007 thousands of autonomous networks (AS) – e.g. networks of an Internet service provider or university – were necessary to provide 50% of all content. In 2019 only five Hypergiants managed to originate half of the total Internet traffic alone. To cope with the unprecedented demand, most of these Hypergiants increased their network capacities in their own networks, but also installed and operated their servers, called off-nets, inside other networks. Such off-nets operate closer to the end user and, thus, accelerate content delivery to end-users as well as support applications, e.g., video streaming and edge computing (machine learning, artificial intelligence, and 5G).
In their paper “Seven years in the life of Hypergiants’ off-nets,” Georgios Smaragdakis, Professor of Cybersecurity at TU Delft, and his colleagues from University College of London, Microsoft, Columbia University and FORTH-ICS present a methodology to measure the increase of such off-nets footprints by analyzing massive public data sets that include active scans and server digital certificates (TLS) that span over seven years (2013-2021). By analyzing the ownership of the certificates over time, they were able to track the deployment of Hypergiants’ off-nets around the globe. These Internet analytics are important to understand how the structure and operation of the Internet and the data flow has changed. For this work the researchers received the prestigious Best Paper Award of the 2021 ACM Special Interest Group on Data Communication (SIGCOMM 2021) conference. SIGCOMM is the flagship conference of the Association of Computing Machinery (ACM) on the topics of internet architecture and networking.
“Internet infrastructures are the backbone of contemporary communication. Understanding developments in this sector is a key prerequisite for improving end-user experience, security, and privacy. We are very pleased that our efforts to monitor and explain changes in the Internet architectures are internationally recognized.”
“This is the first generic and scalable method to survey this development in the wild. We make publicly available the only extensive collection of data and visualizations that describe such Hypergiant off-net developments over seven years, from 2013 to 2021”, explains Georgios Smaragdakis. He and his colleagues found that large Hypergiants can serve large fractions of the world’s internet users directly from within the users’ networks. “While the deployment of off-nets can improve end-user performance and the introduction of encryption improve user privacy, our study shows that information about these deployments is leaked and can be potentially misused by adversaries or to gain business intelligence. In our work we suggest ways to address such issues”, says Georgios Smaragdakis. Prof. Smaragdakis and his colleagues believe that the insights by their data analysis and the release of public data can inform studies in other fields, including economics, political science, and regulation.
The publication in detail:
Seven years in the life of Hypergiants’ off-nets
Authors: Petros Gigis, Matt Calder, Lefteris Manassakis, George Nomikos, Vasileios Kotronis, Xenofontas Dimitropoulos, Ethan Katz-Bassett, Georgios Smaragdakis
Abstract: Content Hypergiants deliver the vast majority of Internet traffic to end users. In recent years, some have invested heavily in deploying services and servers inside end-user networks. With several dozen Hypergiants and thousands of servers deployed inside networks, these off-net (meaning outside the Hypergiant networks) deployments change the structure of the Internet. Previous efforts to study them have relied on proprietary data or specialized per-Hypergiant measurement techniques that neither scale nor generalize, providing a limited view of content delivery on today’s Internet. In this paper, we develop a generic and easy to implement methodology to measure the expansion of Hypergiants’ off-nets. Our key observation is that Hypergiants increasingly encrypt their traffic to protect their customers’ privacy. Thus, we can analyze publicly available Internet-wide scans of port 443 and retrieve TLS certificates to discover which IP addresses host Hypergiant certificates in order to infer the networks hosting off-nets for the corresponding Hypergiants. Our results show that the number of networks hosting Hypergiant off-nets has tripled from 2013 to 2021, reaching 4.5k networks. The largest Hypergiants dominate these deployments, with almost all of these networks hosting an off-net for at least one — and increasingly two or more — of Google, Netflix, Facebook, or Akamai. These four Hypergiants have off-nets within networks that provide access to a significant fraction of end user population.
Publication: Petros Gigis, Matt Calder, Lefteris Manassakis, George Nomikos, Vasileios Kotronis, Xenofontas A. Dimitropoulos, Ethan Katz-Bassett, Georgios Smaragdakis: Seven years in the life of Hypergiants’ off-nets. SIGCOMM 2021: 516-533 https://doi.org/10.1145/3452296.3472928
More information is available from:
Prof. Dr. Georgios Smaragdakis
TU Delft – Cybersecurity Van Mourik Broekmanweg 6 2628 XE Delft The Netherlands
VLDB2021: BOSS Workshop features Open Source Big Data Systems
VLDB2021: BOSS Workshop features Open Source Big Data Systems
BIFOLD researchers will present three full research papers as well as three demo papers at the 47th International Conference on Very Large Data Bases (VLDB 2021), which will take place from August 16 – 29, 2021. In conjunction with VLDB, BIFOLD researchers also co-organize the BOSS 2021 workshop on open source big data systems.
BIFOLD researchers will contribute to the leading international conference on the management and analysis of very large datasets, VLDB, with three full research papers and three demos of their latest database system management research. The paper “Automated Feature Engineering for Algorithmic Fairness”, authored by Ricardo Salazar Diaz, Felix Neutatz, and Ziawasch Abedjan proposes a highly accurate fairness-aware approach for machine learning. Condor, a high-performing dataflow system that integrates approximate summarizations, is presented in the second paper, “In the Land of Data Streams where Synopses are Missing, One Framework to Bring Them All” by Rudi Poepsel-Lemaitre, Martin Kiefer, Joscha von Hein, Jorge-Arnulfo Quiane-Ruiz, and Volker Markl. In their paper “Scotch: Generating FPGA-Accelerators for Sketching at Line Rate,” Martin Kiefer, Ilias Poulakis, Sebastian Bress, and Volker Markl present Scotch, a novel system for accelerating sketch main- tenance using FPGAs, that enables faster processing of compressed data. Additionally, two papers from the NebulaStream research program will be presented at VLDB’s VLIoT workshop.
BIFOLD Research Group Lead Dr. Quiané-Ruiz also co-organizes the Big Data Open Source Systems(BOSS) workshop, which is held in conjunction with VLDB. On August 16, BOSS 2021 will feature tutorials on open source big data systems like Apache Calcite, Apache Arrow, Apache AsterixDB, and a presentation on Apache Wayang by BIFOLD researcher Dr. Zoi Kaoudi. Highlight of the workshop will be the keynote on “Lessons learned from building and growing Apache Spark “ by Reynold Xin, co-founder of Databricks and one of the main developers of Apache Spark – one of the most important open source massive data analytics engine currently in use.
The publications in detail:
Full Research papers
Automated Feature Engineering for Algorithmic Fairness
Abstract: One of the fundamental problems of machine ethics is to avoid the perpetuation and amplification of discrimination through machine learning applications. In particular, it is desired to exclude the influence of attributes with sensitive information, such as gender or race, and other causally related attributes on the machine learning task. The state-of-the-art bias reduction algorithm Capuchin breaks the causality chain of such attributes by adding and removing tuples. However, this horizontal approach can be considered invasive because it changes the data distribution. A vertical approach would be to prune sensitive features entirely. While this would ensure fairness without tampering with the data, it could also hurt the machine learning accuracy. Therefore, we propose a novel multi-objective feature selection strategy that leverages feature construction to generate more features that lead to both high accuracy and fairness. On three well-known datasets, our system achieves higher accuracy than other fairness-aware approaches while maintaining similar or higher fairness.
Abstract: In pursuit of real-time data analysis, approximate summarization structures, i.e., synopses, have gained importance over the years. However, existing stream processing systems, such as Flink, Spark, and Storm, do not support synopses as first class citizens, i.e., as pipeline operators. Synopses’ implementation is upon users. This is mainly because of the diversity of synopses, which makes a unified implementation difficult. We present Condor, a framework that supports synopses as first class citizens. Condor facilitates the specification and processing of synopsis-based streaming jobs while hiding all internal processing details. Condor’s key component is its model that represents synopses as a particular case of windowed aggregate functions. An inherent divide and conquer strategy allows Condor to efficiently distribute the computation, allowing for high-performance and linear scalability. Our evaluation shows that Condor outperforms existing approaches by up to a factor of 75x and that it scales linearly with the number of cores.
Scotch: Generating FPGA-Accelerators for Sketching at Line Rate
Authors: Martin Kiefer, Ilias Poulakis, Sebastian Bress, Volker Markl
Abstract: Sketching algorithms are a powerful tool for single-pass data summarization. Their numerous applications include approximate query processing, machine learning, and large-scale network monitoring. In the presence of high-bandwidth interconnects or in-memory data, the throughput of summary maintenance over input data becomes the bottleneck. While FPGAs have shown admirable throughput and energy-efficiency for data processing tasks, developing FPGA accelerators requires a sophisticated hardware design and expensive manual tuning by an expert. We propose Scotch, a novel system for accelerating sketch maintenance using FPGAs. Scotch provides a domain-specific language for the user-friendly, high-level definition of a broad class of sketching algorithms. A code generator performs the heavy-lifting of hardware description, while an auto-tuning algorithm optimizes the summary size. Our evaluation shows that FPGA accelerators generated by Scotch outperform CPU- and GPU-based sketching by up to two orders of magnitude in terms of throughput and up to a factor of five in terms of energy efficiency.
Abstract: In this paper we present our work on compliant geo-distributed data processing. Our work focuses on the new dimension of dataflow constraints that regulate the movement of data across geographical or institutional borders. For example, European directives may regulate transferring only certain information fields (such as non personal information) or aggregated data. Thus, it is crucial for distributed data processing frameworks to consider compliance with respect to dataflow constraints derived from these regulations. We have developed a compliance-based data processing framework, which (i) allows for the declarative specification of dataflow constraints, (ii) determines if a query can be translated into a compliant distributed query execution plan, and (iii) executes the compliant plan over distributed SQL databases. We demonstrate our framework using a geo-distributed adaptation of the TPC-H benchmark data. Our framework provides an interactive dashboard, which allows users to specify dataflow constraints, and analyze and execute compliant distributed query execution plans.
Abstract: Parameter servers ease the implementation of distributed machine learning systems, but their performance can fall behind that of single machine baselines due to communication overhead. We demonstrate LAPSE, a parameter server with dynamic parameter allocation. Previous work has shown that dynamic parameter allocation can improve parameter server performance by up to two orders of magnitude and lead to near-linear speed-ups over single machine baselines. In this demonstration, attendees learn how they can use LAPSE and how LAPSE can provide order-of-magnitude speed-ups over other parameter servers. To do so, this demonstration interactively analyzes and visualizes how dynamic parameter allocation looks like in action.
Abstract: Distributed matrix computation is common in large-scale data processing and machine learning applications. Iterative-convergent algorithms involving matrix computation share a common property: parameters converge non-uniformly. This property can be exploited to avoid redundant computation via incremental evaluation. Unfortunately, existing systems that support distributed matrix computation, like SystemML, do not employ incremental evaluation. Moreover, incremental evaluation does not always outperform classical matrix computation, which we refer to as a full evaluation. To leverage the benefit of increments, we propose a new system called HyMAC, which performs hybrid plans to balance the trade-off between full and incremental evaluation at each iteration. In this demonstration, attendees will have an opportunity to experience the effect that full, incremental, and hybrid plans have on iterative algorithms.
Abstract: The Internet of Things (IoT) enables the usage of resources at the edge of the network for various data management tasks that are traditionally executed in the cloud. However, the heterogeneity of devices and communication methods in a multitiered IoT environment (cloud/fog/edge) exacerbates the problem of deciding which nodes to use for processing and how to route data. In addition, both decisions cannot be made only statically for the entire lifetime of an application, as an IoT environment is highly dynamic and nodes in the same topology can be both stationary and mobile as well as reliable and volatile. As a result of these different characteristics, an IoT data management system that spans across all tiers of an IoT network cannot meet the same availability assumptions for all its nodes. To address the problem of choosing ad-hoc which nodes to use and include in a processing workload, we propose a networking component that uses a-priori as well as ad-hoc routing information from the network. Our approach, called Rime, relies on keeping track of nodes at the gateway level and exchanging routing information with other nodes in the network. By tracking nodes while the topology evolves in a geo-distributed manner, we enable efficient communication even in the case of frequent node failures. Our evaluation shows that Rime keeps in check communication costs and message transmissions by reducing unnecessary message exchange by up to 82.65%.
Abstract: The Internet of Things (IoT) is rapidly growing into a network of billions of interconnected physical devices that constantly stream data. To enable data-driven IoT applications, data management systems like NebulaStream have emerged that manage and process data streams, potentially in combination with data at rest, in a heterogeneous distributed environment of cloud and edge devices. To perform internal optimizations, an IoT data management system requires a monitoring component that collects system metrics of the underlying infrastructure and application metrics of the running processing tasks. In this paper, we explore the applicability of existing cloud-based monitoring solutions for stream processing engines in an IoT environment. To this end, we provide an overview of commonly used approaches, discuss their design, and outline their suitability for the IoT. Furthermore, we experimentally evaluate different monitoring scenarios in an IoT environment and highlight bottlenecks and inefficiencies of existing approaches. Based on our study, we show the need for novel monitoring solutions for the IoT and define a set of requirements.
Earth Observation data for climate change research
Earth Observation data for climate change research
AgoraEO: One platform integrates data from all over the world
Environmental reports on the dramatic retreat of the Arctic ice sheet, the ongoing deforestation of rain forests or the spread of forest fires are mostly based on the data analysis of satellite images. The analysis of large amounts of Earth Observation (EO) data plays a crucial role in understanding and quantifying climate change.
“The efficient use of these data makes it possible to monitor and predict the effects of climate change on a global scale with unprecedented reliability,” explains Prof. Dr. Begüm Demir, head of the Big Data Analytics for Earth Observation research group at the Berlin Institute for the Foundations of Learning and Data (BIFOLD) and professor of Remote Sensing Image Analysis at TU Berlin. Advances in satellite systems have massively increased the amount and the variety massively increased the amount and the variety as well as the spatial and spectral resolution of EO data. “Nowadays we possess huge EO data archives. The Sentinel satellites in the Copernicus program alone – Europe’s flagship EO satellite initiative – provide us with about 12 terabytes of satellite images per day,” says Begüm Demir.
The problem: There is no single platform that connects the different datasets of interest from all over the world intelligently. All existing analysis platforms rely on heterogeneous technologies with different interfaces and data formats, which prevents cross-platform use. For example, it is nearly impossible to apply an analytics procedure developed on one platform to another. “It’s like using Word on a PC without a Windows environment – meaning you have to instruct each computing operation individually. This ‘lock-in effect’ hinders innovation and thus the efficient use of the collected data for climate protection,” describes Dr. Jorge Quiané-Ruiz, head of the Big Data Systems research group at BIFOLD.
Overcoming these limitations in the use of EO data sets is the common goal of Begüm Demir and Jorge Quiané-Ruiz. Their project: AgoraEO: a universal Earth Observation ecosystem infrastructure for sharing, finding, assembling, and running datasets, algorithms, and other tools. While Begüm Demir brings expertise on remote sensing data processing and analysis, Jorge Quiané-Ruiz is an expert in data processing and data management. He develops the Agora infrastructure, a more general-purpose ecosystem for data science and AI innovation, on which AgoraEO is partially based on.
AgoraEO’s innovative infrastructure allows all interested parties to contribute both EO data as well as technologies without having to upload them to a common server. “Our goal is to create an infrastructure that enables federated analysis across different platforms, making modern Earth observation technology accessible to all scientists and society, thus promoting climate change innovation worldwide,” sais Jorge Quiané-Ruiz.
*This article appeared for the first time on 31.07.2021 in the supplement “Climate Research” of Der Tagesspiegel, Berlin.
Abstract: Today, interoperability among EO exploitation platforms is almost inexistent as most of them rely on a heterogeneous set of technologies with varying interfaces and data formats. Thus, it is crucial to enable cross-platform (federated) analytics to make EO technology easily accessible to everyone. We envision AgoraEO, an EO ecosystem for sharing, finding, composing, and executing EO assets, such as datasets, algorithms, and tools. Making AgoraEO a reality is challenging for several reasons, the main ones being that the ecosystem must provide interactive response times and operate seamlessly over multiple exploitation platforms. In this paper, we discuss the different challenges that AgoraEO poses as well as our ideas to tackle them. We believe that having an open, unified EO ecosystem would foster innovation and boost EOdata literacy for the entire population.
The Berlin Institute for the Foundations of Learning and Data is very pleased to announce the first six BIFOLD Junior Fellows. They were selected for the excellence of their research and are already well-established researchers in the computer sciences. In addition, their research interests show exceptional potential for BIFOLD’s research goals, either by combining machine learning and data management or by bridging the two disciplines and other research areas. The first six Junior Fellows will cover a broad range of research topics during their collaboration with BIFOLD.
These excellent researchers will receive mentoring from a BIFOLD Research Group Leader or Fellow, as well as opportunities to mentor a graduate school student and additional funding to support their research activities.Their research in collaboration with BIFOLD will advance the sciences in the following areas:
Dr. Kaustubh Beedkar: My research interest lies in exploring efficient and effective tools and methods for compliant geo-distributed data analytics. In contrast to traditional means, my work focuses on building data processing frameworks that enable decentralized data analytics. In particular, I research how to integrate legal constraints arising from regulatory bodies concerning data sovereignty and data movement, as well as disparate compute infrastructures at multiple sites into data processing frameworks. For example, processing data generated by autonomous cars in three different geographies, such as Europe, North America, and Asia may face different regulations: there may be legal requirements that only aggregated, anonymized data may be shipped out of Europe and no data whatsoever may be shipped out of Asia. In my research, I explore how to specify and enforce compliance to these types of constraints while generating efficient query execution plans.
“The BIFOLD junior fellowship offers excellent opportunities in the form of seminars, training, and mentorship from world-renowned researchers, to hone my research skills in order to pursue a successful career in academia.”
Dr. Jan Hermann: Many modern technologies stand on our ability to design novel molecules and materials. This design can be greatly assisted by computer simulations of the chemical and physical properties of these novel compounds. Typical realistic simulations require a whole hierarchy of different computational methods. One central component of this hierarchy are methods that can directly model the behavior of electrons in molecules and materials. Currently, such methods are limited either by computational efficiency or by accuracy. Recently, machine learning techniques have been successfully used across physical sciences, but not to target the direct modeling of electrons. My research attempts to fill this gap through tight integration of machine learning into existing methods for simulation of electrons in molecules and materials. The goal is to lift the restrictions on the use of these methods by broadening the scope of materials for which they can be effectively used.
“BIFOLD offers the opportunity to cooperate with other experts at the intersection of computer sciences and the physical sciences. Through mentoring opportunities in the BIFOLD junior fellowship program, I also hope to inspire the next generation of scientists to engage in machine-learning-based research in the physical sciences.”
Dr. Marina Marie-Claire Höhne: In order to bring the positive potential of artificial intelligence (AI) into real applications, for example in cancer diagnostics, where AI machines detect cancer cells within milliseconds, we need to understand the machine’s decision. My research together with my junior research group UMI lab (Understandable Machine Intelligence) aims to understand the highly complex learning machines and contributes to the field of explainable AI. In particular, I develop methods that enable a holistic understanding of AI models and, with the help of Bayesian networks, additionally transfer the prediction uncertainties from the AI model to the explanation.
“For me, BIFOLD is one of the crucial building blocks that contribute to research at the level of excellence. In my opinion, the accumulation of expert knowledge from different areas of AI is the decisive criterion for attaining a higher level of knowledge, which can lead to a holistic understanding of AI and is important in order to exploit the enormous potential of AI innovations.“
Dr. Danh Le Phuoc: The core of my research interest is centered around the research problem of how to autonomously fuse different types of sensory streams to build high-level perception, planning and control components for robots, drones and autonomous vehicles. My approach for this problem is leveraging common-sense and domain knowledge to guide the learning and reasoning tasks of such data fusion operations. The common-sense and domain knowledge are used to emulate the reasoning ability of a human. To this end, my research mission in BIFOLD is to develop scalable reasoning and learning algorithms and systems which bring the best both worlds, neural networks and symbolic reasoning, called neural-symbolic fusion frameworks.
“The BIFOLD network will facilitate my collaboration with BIFOLD experts in machine learning, database and large-scale processing to push the frontiers of the neural-symbolic research which is among the emerging key foundations of data processing and learning.”
Dr. Kristof Schütt: Finding molecules and materials with specific chemical properties is important for progress in many technological applications, such as drug design or the development of efficient batteries and solar cells. In my research, I aim to accelerate this process by using machine learning for the discovery, simulation and understanding of these chemical systems. To achieve this, I study neural network that predict the chemical properties of molecules, or even directly generate molecules that possess a desired property.
“BIFOLD brings researchers from many fields together. Working interdisciplinary at the intersection of machine learning and quantum chemistry, I enjoy the exchange with other disciplines as it always results in new ideas and perspectives. Machine learning methods can often be applied to similar problems in quite different applications. Therefore, you can both learn from and contribute to many fields as a machine learning researcher – for which BIFOLD provides the ideal environment.“
Dr. Eleni Tzirita Zacharatou: Spatial data is at the heart of any human activity. Billions of mobile devices, cars, social networks, satellites, sensors, scientific simulations, and many other sources produce spatial data constantly. My research aims to provide efficient tools to store, process, and manage the wealth of spatial data available today, thereby enabling discoveries, better services, and new products. To that end, I apply a broad portfolio of techniques, from efficient use of modern hardware to approximation algorithms and workload-driven adaptation. My goal within BIFOLD is to create an efficient data ecosystem that brings the benefits of spatial big data analytics to a broad community, helps to boost data literacy, and contributes to citizen science.
“BIFOLD offers great collaboration opportunities in data management, machine learning, and earth observation that can help me advance my research. In addition, the BIFOLD junior fellowship program allows me to evolve my mentoring skills and provides further training opportunities, thereby contributing to the successful development of my academic career.”
A future-proof IT infrastructure is increasingly becoming a decisive competitive factor – this applies not only to companies, but especially to research. In recent months, BIFOLD has been able to invest around 1.8 million euros in new research hardware, thereby significantly increasing the institute’s computing capacity. This cutting-edge IT infrastructure was financed by the German Federal Ministry of Education and Research (BMBF). “If we want to continue to conduct world-class research, investments into infrastructure are an important prerequisite” describes BIFOLD Co-Director Prof. Dr. Klaus-Robert Müller.
Current experiments in Machine Learning and Big Data management systems require hardware with very strong computing, storing and data transfer capabilities. The new systems include a specialized computer unit (node) and a full computer cluster both designed for simultaneous processing of a very large number of parallel workloads (massively parallel processing) with large main memory capacities as well as one cluster particularly suitable for the fast processing of sequential workloads. The central processing units (CPUs) in the latter cluster also support the so called Intel Software Guard Extension technology thereby enabling developers to create and execute code and data in a secure environment. The servers run high-performance file systems and will allow for the transfer of very large data with low latency. “We expect that this cutting-edge hardware will not only enrich our own research, but also enables us to establish new collaborations with our partners,” adds BIFOLD Co-Director Prof. Dr. Volker Markl.
In the group of Volker Markl, mainly two different projects benefit from the new possibilities: AGORA is a novel form of data management systems. It aims to construct an innovative unified ecosystem that brings together data, algorithms, models, and computational resources and provides them to a broad audience. The goal is easy creation and composition of data science pipelines as well as their scalable execution. In contrast to existing data management systems, Agora operates in a heavily decentralized and dynamic environment. The NebulaStream platform is a general purpose, end-to-end data management system for the IoT. It provides an out-of-the box experience with rich data processing functionalities and a high ease-of-use. With the new IT infrastructure, both of these systems can be validated at a much larger scale and in a secure data processing environment.
High memory and parallel processing capabilities are also essential for large-scale Machine Learning simulations, e.g. solving high-dimensional linear problems, or training deep neural networks. Klaus-Robert Müller and his group will use the new hardware initially in three different projects: Specifically, it allows BIFOLD researchers to produce closed-form solutions of large dense linear systems, which are needed to describe correlations between large amounts of interacting particles in a molecule with high numerical precision. Researchers can also significantly extend the number of epigenetic profile regions that can be analyzed, thereby using significantly more information available in the data. It will also enable scientists to develop explainable AI techniques that incorporate internal explainability and feedback structures, and are significantly more complex to train than typical deep neural networks.
Specifications of the new hardware in the DIMA group
The first cluster consists of 60 servers, each with two processors (CPUs) with 16 2.1 GHz processor cores, 512 GB of RAM, twelve TBs of HD storage space, and additional fast SSD storage that holds approximately two TB of data. They are thus designed for simultaneous processing of a very large number of parallel workloads (massively parallel processing). The second cluster consists of 15 identical servers. Each has a CPU with eight 3.4 GHz processor cores, 128 GB of RAM and a combination of 12 TB of HD memory and 2 TB of SSD memory. They are particularly suitable for the fast processing of sequential workloads. Furthermore, these CPUs support the Intel Software Guard Extension to enable developers to develop and execute code and data in an environment secured by the CPU.
The performance of the server system is upgraded with two high-performance nodes, each with three Nividia A100 graphical processing units (GPUs), two 64-core CPUs, 2 TB of main memory and over 22 TB of HD memory capable of handling data analytics and AI applications on very large data sets.
Both clusters are also managed by two headnodes, which have advanced hardware features to improve fault tolerance. All systems have 100 GB Infiniband cards and are connected via two 100 GB/s Infiniband switches, enabling very fast data exchange between cluster hosts. The servers will use the Hadoop Distributed File System (HDFS), which supports Big Data analytics and will enable high-performance access to data.
Specifications of the new hardware in the Machine Learning group
The existing Cluster has been upgraded with 13 additional nodes. Twelve nodes have 786 GB main memory each and four Nvidia A100 GPUs connected via a 200 Gbit Inifiband network. This setup allows the simulation and computation of very large models. One special node runs with 6 TB of RAM and 72 processor cores, which enables massive parallel computing while holding very large models in the main memory.
The distributed High-Performance-Computing file system BeeGFS was expanded with three more file servers. 437 TB of storage capacity are distributed across six data servers, which are connected to the network with a 40 Gbit connection. All nodes are connected with at least 25 Gbit, many with 25 Gbit. Overall this setup is capable of handling operations with very large amounts of data.
The Berlin Institute for the Foundations of Learning and Data (BIFOLD) set up two new Research Training Groups, led by Dr. Stefan Chmiela and Dr. Steffen Zeuch. The goal of these new research units at BIFOLD is to enable a junior researcher to conduct independent research and prepare him for a leadership position. Initial funding includes their own position as well as two PhD students and/or research associates for three years.
Steffen Zeuch is interested in how to overcome the data management challenges that the growing number of Internet of Things (IoT) devices bring: “Over the last decade, the amount of produced data has reached unseen magnitudes. Recently, the International Data Corporation estimated that by 2025, the global amount of data will reach 175 Zettabyte (ZB) and that 30 percent of this data will be gathered in real-time. In particular, the number of IoT devices increases exponentially such that the IoT is expected to grow as large as 20 billion connected devices in 2025.” The explosion in the number of devices will create novel data-driven applications in the near future. These applications require low-latency, location awareness, wide-spread geographical distribution, and real-time data processing on potentially millions of distributed data sources.
“To enable these applications, a data management system needs to leverage the capabilities of IoT devices outside the cloud. However, today’s classical data management systems are not ready yet for these applications as they are designed for the cloud,” explains Steffen Zeuch. “The focus of my research lies in introducing the NebulaStream Platform – a general purpose, end-to-end data management system for the IoT.”
Stefan Chmiela concentrates on so-called many-body problems. This broad category of physical problems deals with systems of interacting particles, with the goal to accurately characterize their dynamic behavior. These types of problems arise in many disciplines, including quantum mechanics, structural analysis and fluid dynamics and generally require solving high-dimensional partial differential equations. “In my research group we will particularly focus on problems from quantum chemistry and condensed matter physics, as these fields of science rank among the most computationally intensive”, explains Stefan Chmiela. In these systems, highly complex collective behavior emerges from relatively simple physical laws for the motion of each individual particle. Because of this, the simulation of high-dimensional many-body problems requires extremely challenging computation capacities. There is a limit to how much computational efficiency can be gained through rigorous mathematical and physical approximations, yet fully empirical solutions are often too simplistic to be sufficiently predictive.
The lack of simultaneously accurate and efficient approaches makes many phenomena hard to model reliably. “Reconciling these two contradicting aspects of accuracy and computational speed is our goal” states Stefan Chmiela. “Our idea is to incorporate readily available fundamental prior knowledge into modern learning machines. We leverage conservation laws – which are derivable for many symmetries of physical systems, in order to increase the models ability to be accurate with less data.”