SIGMOD Reproducibility Award

Home >

SIGMOD Reproducibility Award

Researchers in Prof. Abedjan’s group win SIMGOD Reproducibility Award

The paper “Raha: A Configuration-Free Error Detection System” by Mohammad Mahdavi, Ziawasch Abedjan, Raul Castro Fernandez, Samuel Madden, Mourad Ouzzani, Michael Stonebraker, and Nan Tang won the ACM SIGMOD Most Reproducible Paper Award.

 The code and datasets are available here.

Join the MICCAI 2020 CADA challenge!

Home >

Join the MICCAI 2020 CADA challenge!

Join the MICCAI 2020 CADA challenge!

This year’s MICCAI Conference on Medical Image Computing and Computer Assisted Intervention in Peru will feature Grand Challenges in biomedical image analysis. In Partnership with Charité, Fraunhofer MEVIS and Helios, BIFOLD supports the CADA challenge on the automated and semi-automated analysis of image data of cerebral aneurysms.

Cerebral aneurysms are local dilations of arterial blood vessels caused by a weakness of the vessel wall. Subarachnoid hemorrhage (SAH) caused by the rupture of a cerebral aneurysm is a life-threatening condition associated with high mortality and morbidity. The mortality rate is above 40%, and even in case of survival cognitive impairment can affect patients for a long time.

It is therefore highly desirable to detect aneurysms early and decide about the appropriate rupture prevention strategy. Diagnosis and treatment planning is based on angiographic imaging using MRI, CT, or X-ray rotation angiography.

Major goals in image analysis are the detection and risk assessment of aneurysms. The challenge therefore consist of three separate tasks (detection, segmentation and rupture risk estimation).

For more information and to register for the challenge, please visit https://cada.grand-challenge.org/Introduction/

Three papers presented at EDBT 2020

Home >

Three papers presented at EDBT 2020

BIFOLD systems papers Presented at EDBT 2020

Researchers in TU Berlin’s Database Systems and Information Management (DIMA) Group [1] and DFKI’s Intelligent Analytics for Massive Data Group [2] presented three systems papers at EDBT 2020, the 23rd International Conference on Extending Database Technology, held from March 30 to April 2. Originally planned to take place in Copenhagen, Denmark, this year’s EDBT conference was held online instead.

DIMA PhD Student, Haralampos Gavriilidis presented “Scaling a Public Transport Monitoring System to Internet of Things Infrastructures [3]. In the talk, Harry casts a public transport problem under an IoT scenario, discusses some IoT data management challenges, motivates the need for the development of a novel platform for the end-to-end data management for the IoT (NebulaStream [4]), and demonstrates how an interactive map can be used to monitor a public transport system. The paper and video of the demo is available at https://www.nebula.stream/publications/gavriilidis_demo.html.

DIMA PhD Student, Ankit Chaudhary presented “Governor: Operator Placement for a Unified Fog-Cloud Environment [5].” In his talk, Ankit motivates the need for a unified fog-cloud environment in the IoT, presents the operator placement problem in light of service-level agreements, introduces Governor, a novel operator placement approach for a unified fog-cloud environment, and discusses Governor Policies (GP) to optimize operator placement in user queries and enable administrators to control the operator placement process. In addition, he offers a demonstration to highlight the impact GP have on operator placement for varying queries. The paper and video of the demo is available at https://www.nebula.stream/publications/governor.html.

Former DIMA Master’s student, Lawrence Benson presented “Disco: Efficient Distributed Window Aggregation [6],” a short paper based on his Master’s thesis. Disco is a distributed complex window aggregation approach designed to process complex window types on multiple independent nodes, while efficiently aggregating incoming data streams. In his talk, Lawrence highlights the advantages Disco offers over centralized solutions, including the throughput scales linearly with the number of nodes as well as significantly reducing network costs. The paper and video of the talk is available at https://www.nebula.stream/publications/disco.html.

All of the research conducted in these works were conducted under the auspices of the Berlin Institute for the Foundations of Learning and Data (BIFOLD).

References

[1] The TU Berlin Database Systems and Information Management Group, https://www.dima.tu-berlin.de/.

[2] The German Research Center for Artificial Intelligence (DFKI) Intelligent Analytics for Massive Data Group, https://www.dfki.de/en/web/research/research-departments/intelligent-analytics-for-massive-data/.

[3] “Scaling a Public Transport Monitoring System to Internet of Things Infrastructures,” Haralampos Gavriilidis, Adrian Michalke, Laura Mons, Steffen Zeuch, and Volker Markl.

[4]TheNebulaStream Platform, https://www.nebula.stream/.

[5] “Governor: Operator Placement for a Unified Fog-Cloud Environment,” Ankit Chaudhary, Steffen Zeuch, and Volker Markl. [6]“Disco: Efficient Distributed Window Aggregation,” Lawrence Benson, Philipp M. Grulich, Steffen Zeuch, Volker Markl, and Tilmann Rabl.

[6]“Disco: Efficient Distributed Window Aggregation,” Lawrence Benson, Philipp M. Grulich, Steffen Zeuch, Volker Markl, and Tilmann Rabl.

ELLIS Unit established at TU Berlin

Home >

ELLIS Unit established at TU Berlin

European AI research network ELLIS established a new Unit at TU Berlin

In positive response to a request by Prof. Dr. Klaus-Robert Müller (head of the Machine Learning Department at TU Berlin and one of the directors of BIFOLD) and other scientists, the Technische Universität Berlin became part of the European AI research network European Laboratory for Learning and Intelligent Systems (ELLIS).

ELLIS aims to strengthen and connect AI research efforts across Europe. The new ELLIS Berlin unit will build upon the strong AI research ecosystem of Berlin, which is also represented in BIFOLD’s partner network and also contribute in focus research areas of BIFOLD, like explainable Artificial Intelligence, scalable Machine Learning and data management or deep learning. Furthermore, ELLIS Berlin will actively support the ELLIS network with research programs, events and workshops.

Find more information in TU Berlin’s official press release (in German).

Papers accepted at SIGMOD 2020

Home >

Papers accepted at SIGMOD 2020

Four papers authored by TU Berlin and DFKI researchers have been accepted at SIGMOD 2020

Data management systems researchers in the Database Systems and Information Management (DIMA) Group at TU Berlin and the Intelligent Analytics for Massive Data (IAM) Group at DFKI (the German Research Institute for Artificial Intelligence) were informed that their papers have been accepted at the 2020 ACM SIGMOD/PODS International Conference on the Management of Data.

The “Rhino: Efficient Management of Very Large Distributed State for Stream Processing Engines,” paper authored by Del Monte et al. addresses the problem of large state migration and on-the-fly query reconfiguration, to support resource elasticity, fault-tolerance, and runtime optimization (e.g., for load balancing). A stream processing engine equipped with Rhino is capable of attaining lower latency processing and achieving continuous operation, even in the presence of failures.

The “Optimizing Machine Learning Workloads in Collaborative Environments, paper authored by Derakhshan et al. presents a system that is capable of optimizing the execution of machine learning workloads in collaborative environments. This accomplishment is achieved by exploiting an experiment graph of stored artifacts drawn from previously performed operations and results.

The “Grizzly: Efficient Stream Processing Through Adaptive Query Compilation,” paper authored by Grulich et al. presents a novel adaptive query compilation-based stream processing engine that enables highly-efficient query execution on modern hardware and is able to dynamically adjust to changing data characteristics at runtime.

The Pump Up the Volume: Processing Large Data on GPUs with Fast Interconnects,” paper authored by Lutz et al. provides an in-depth analysis of the new NVLink 2.0 interconnect technology, which enables users to overcome data transfer bottlenecks and efficiently process large datasets stored in main-memory on GPUs.

The parallel acceptance of these four publications at one of the top data management conferences is not only a great success for TU Berlin’s DIMA Group and DFKI’S IAM Group, it also shows that BIFOLD, the Berlin Institute for the Foundations of Learning and Data continues to positively impact international artificial intelligence and data management research.

The Papers in detail:

Pump Up the Volume: Processing Large Data on GPUs with Fast Interconnects

Authors: Clemens Lutz, Sebastian Breß, Steffen Zeuch, Tilmann Rabl, Volker Markl

Abstract: GPUs have long been discussed as accelerators for database query processing because of their high processing power and memory bandwidth. However, two main challenges limit the utility of GPUs for large-scale data processing: (1) the onboard memory capacity is too small to store large data sets, yet (2) the interconnect bandwidth to CPU main-memory is insufficient for ad-hoc data transfers. As a result, GPU-based systems and algorithms run into a transfer bottleneck and do not scale to large data sets. In practice, CPUs process large-scale data faster than GPUs with current technology. In this paper, we investigate how a fast interconnect can resolve these scalability limitations using the example of NVLink 2.0. NVLink 2.0 is a new interconnect technology that links dedicated GPUs to a CPU. The high bandwidth of NVLink 2.0 enables us to overcome the transfer bottleneck and to efficiently process large data sets stored in main-memory on GPUs. We perform an in-depth analysis of NVLink 2.0 and show how we can scale a no-partitioning hash join beyond the limits of GPU memory. Our evaluation shows speedups of up to 18× over PCI-e 3.0 and up to 7.3× over an optimized CPU implementation. Fast GPU interconnects thus enable GPUs to efficiently accelerate query processing.

https://doi.org/10.1145/3318464.3389705

Blog post by Clemens Lutz

Presentation video by Clemens Lutz

Optimizing Machine Learning Workloads in Collaborative Environments

Authors: Behrouz Derakhshan, Alireza Rezaei Mahdiraji, Ziawasch Abedjan, Tilmann Rabl, and Volker Markl

Abstract: Effective collaboration among data scientists results in high-quality and efficient machine learning (ML) workloads. In a collaborative environment, such as Kaggle or Google Colabratory, users typically re-execute or modify published scripts to recreate or improve the result. This introduces many redundant data processing and model training operations. Reusing the data generated by the redundant operations leads to the more efficient execution of future workloads. However, existing collaborative environments lack a data management component for storing and reusing the result of previously executed operations.
In this paper, we present a system to optimize the execution of ML workloads in collaborative environments by reusing previously performed operations and their results. We utilize a so-called Experiment Graph (EG) to store the artifacts, i.e., raw and intermediate data or ML models, as vertices and operations of ML workloads as edges. In theory, the size of EG can become unnecessarily large, while the storage budget might be limited. At the same time, for some artifacts, the overall storage and retrieval cost might outweigh the recomputation cost. To address this issue, we propose two algorithms for materializing artifacts based on their likelihood of future reuse. Given the materialized artifacts inside EG, we devise a linear-time reuse algorithm to find the optimal execution plan for incoming ML workloads. Our reuse algorithm only incurs a negligible overhead and scales for the high number of incoming ML workloads in collaborative environments. Our experiments show that we improve the run-time by one order of magnitude for repeated execution of the workloads and 50% for the execution of modified workloads in collaborative environments.

Preprint

Rhino: Efficient Management of Very Large Distributed State for Stream Processing Engines

Authors: Bonaventura Del Monte, Steffen Zeuch, Tilmann Rabl, Volker Markl

Abstract: Scale-out stream processing engines (SPEs) are powering large big data applications on high velocity data streams. Industrial setups require SPEs to sustain outages, varying data rates, and low-latency processing. SPEs need to transparently reconfigure stateful queries during runtime. However, state-of-the-art SPEs are not ready yet to handle on-the-fly reconfigurations of queries with terabytes of state due to three problems. These are network overhead for state migration, consistency, and overhead on data processing. In this paper, we propose Rhino, a library for efficient reconfigurations of running queries in the presence of very large distributed state. Rhino provides a handover protocol and a state migration protocol to consistently and efficiently migrate stream processing among servers. Overall, our evaluation shows that Rhino scales with state sizes of up to TBs, reconfigures a running query 15 times faster than the state-of- the-art, and reduces latency by three orders of magnitude upon a reconfiguration.

Preprint

Grizzly: Efficient Stream Processing Through Adaptive Query Compilation

Authors: Philipp M. Grulich, Sebastian Breß, Steffen Zeuch, Jonas Traub, Janis von Bleichert, Zongxiong Chen, Tilmann Rabl, Volker Markl

Abstract: Stream Processing Engines (SPEs) execute long-running queries on unbounded data streams. They rely on managed runtimes, an interpretation-based processing model, and do not perform runtime optimizations. Recent research states that this limits the utilization of modern hardware and neglects changing data characteristics at runtime. In this paper, we present Grizzly, a novel adaptive query compilation-based SPE to enable highly efficient query execution on modern hardware. We extend query-compilation and task-based parallelization for the unique requirements of stream processing and apply adaptive compilation to enable runtime re-optimizations. The combination of light-weight statistic gathering with just-in-time compilation enables Grizzly to dynamically adjust to changing data-characteristics at runtime. Our experiments show that Grizzly achieves up to an order of magnitude higher throughput and lower latency compared to state-of-the-art interpretation-based SPEs.

Preprint

BIFOLD officially announced

Home >

BIFOLD officially announced

Official announcement of BIFOLD in Berlin

Copyright: TU Berlin / Felix Noak
Prof. Dr. Christian Thomsen (President of TU Berlin), Anja Karliczek (Federal Minister of Education and Research), Prof. Dr. Klaus-Robert Müller and Prof. Dr. Volker Markl (Directors of BIFOLD) and Michael Müller (Mayor of Berlin)

On January 15, 2020 the Berlin Institute for the Foundations of Learning and Data (BIFOLD) was officially announced at Forum Digital Technologies in Berlin. Please also see the official press release of the Federal Ministry of Education and Research and Technische Universität Berlin (both in German).

Message from the Directors

Statement of Prof. Dr. Volker Markl

As you know, data has become an enormously important production factor. Together with intelligent algorithms, they form the cornerstone of Artificial Intelligence. It is only through the combination of Big Data and Machine Learning that the great successes of AI have become possible, which we have seen in recent years and will continue to see in the future.

In Berlin, too, with the two competence centers BBDC and BZML, we have already achieved internationally highly regarded successes, from basic research and open source software development to very successful spin-offs.

With BIFOLD, Berlin now has a technological research beacon around which an entire ecosystem of spin-offs and application-oriented research labs can develop. This will enable us to attract top international talent to Berlin and make AI a relevant economic factor for Berlin. 

The special thing about BIFOLD is that we are avoiding the mistake that unfortunately is commonly made in science, namely, to consider partial aspects of AI in isolation. For example, the best algorithms will not help us, if we do not simultaneously research and develop the underlying technologies and systems in which real data is efficiently provided and processed jointly with analysis algorithms.

With respect to data – i.e., my research area in BIFOLD – important challenges lie, for example, in the processing of widely geographically distributed data, i.e., in some cases globally distributed data, which cannot always be physically combined on an infrastructure for analysis due to data protection laws as well as for technical reasons.

Think, for example, of the globally distributed vehicle data of an automobile manufacturer or patient data that is collected across hospitals. Thus, we need new data processing architectures, that on the one hand handle the growing data streams efficiently, and on the other hand reliably protect the privacy and rights of data producers.

An additional challenge is the exponential growth of sensor data, the complete capture of which would quickly exceed the capacities of our global cloud infrastructures and is neither necessary nor sensible.

We are therefore developing new approaches to preprocess data at the source, at the so-called edge, in such a way that we only transfer and store data that is relevant for a particular analysis. This is not only economically more efficient, but also ecologically more sensible and less questionable in terms of data protection.

The systems that we develop should make ideal use of the growing variety of memory and chip technologies and at the same time be so easy to operate, i.e., function in a largely automated manner, that would not require users to hold a five-year computer science degree, in order to work with them.

Because computer scientists, as you all know, are currently a painful bottleneck in the job market.

You see, especially for the commercialization and economic success of basic research, it is extremely important to look at the entire stack of hardware, software, data, algorithms, and the broad ecosystem of applications holistically, and preferably together in a research institute of critical size.

And that’s why I am particularly pleased as a database researcher and thank the German Federal Ministry of Education and Research (BMBF) and the State of Berlin, that with BIFOLD we are now creating the conditions to be able to do exactly this in Berlin.

Statement of Prof. Dr. Klaus-Robert Müller

We would like to thank you very much for the confidence you have placed in us to establish our BIFOLD AI Center! And I would like to assure you that the money is in good hands, because Berlin has always been a stronghold of AI research and has been so  for a quarter of a century. From my ranks alone, 33 professors have emerged and I am not the only one in this center who has produced successful young scientists! My esteemed namesake has already spoken about the many spin-offs.

What is it all about: the technical foundations of AI are machine learning and big data. This is exactly what we are researching here and, as you have already heard, this is a unique combination. We want to advance the basics of AI. Why? In engineering disciplines, for example, in the automotive sector, it sometimes takes a decade for a clever invention to find its way into our new car. AI is different: progress in the fundamentals translates very quickly into a new product or service, and fast can mean just a few weeks. Fortunately, for someone as well versed in mathematics as I am, this means that the saying really applies here: there is nothing more practical than a good theory. 

I would like to give an example of our research. Until about 4 years ago, everyone was always complaining that machine learning methods like deep learning are black boxes, one doesn‘t know what happens in them – a real absurdity for an application (just imagine a medical diagnosis without transparency)! We could change that here in our centre, establishing explainable AI (made in Germany/Berlin) as we were able to finally solve a complicated mathematical problem. Now everyone can use our new technique to understand, improve and make their AI methods safe and trustworthy. Another important task of the center is the broad application of AI in the sciences of physics, chemistry, medicine and the digital humanities – something particularly new all with very strong partners in Berlin — researchers of international top standing only a subway ride away. 

Our country urgently needs AI professionals. There are still very few of them, so we have to train far more than ever before – a great challenge for our center, where we will happily include the new professorships to be created. Only 5 years ago I had about 50 students in my special lecture on machine learning, now 637 are registered. With this exponential increase, soon half of Berlin will be sitting in my class …

If we want to create many new jobs, where will all the applicants come from? From all over the world and of course from Germany and Berlin. Everyone wants to go to Berlin, that’s our incredible location advantage and everyone loves this city (me too) and this city inspires us all to create new ideas.