BIFOLD Colloquium 2022/02/28

Home >

BIFOLD Colloquium 2022/02/28

State Management in Cloud-Native Streaming Systems

Speaker: Yingjun Wu (Singularity Data)

Venue: Virtual event

Date and time: February 28, 2022. 3:30 pm – 4:15 pm.

Registration: If you are interested in participating, please contact: coordination@bifold.berlin

Abstract:

Streaming systems are becoming increasingly essential for extracting business value from data in real-time. To achieve different SLAs demanded by customers under constantly changing workloads, it is a must to take advantage of the scalable, resilient, diversified resources in the cloud. New demand opens new opportunities and challenges for state management, which is at the core of streaming databases. Existing approaches typically use embedded key-value storage so that each worker can access it locally to enjoy its high performance. However, it requires an external durable file system for checkpointing, is complicated and time-consuming to redistribute state during scaling and migration, and is prone to performance throttling. Therefore, we propose shared storage based on LSM-tree. State gets stored at cloud object storage and seamlessly makes itself durable, and the high bandwidth of cloud storage enables fast recovery. The location of a partition of the state decouples with compute nodes thus making scaling straightforward and more efficient. Compaction in this shared LSM-tree is now globally coordinated with opportunistic server-less boosting instead of relying on individual compute nodes. We design a streaming-aware compaction and caching strategy to achieve smoother and better end-to-end performance.

Speaker:

Yingjun Wu is the founder and CEO of Singularity Data (https://www.singularity-data.com/), a startup innovating next-generation database systems. Before starting his adventure, he was a software engineer at the Redshift team, Amazon Web Services, and a researcher at the Database group, IBM Almaden Research Center. He received his PhD degree from National University of Singapore, where he was affiliated with the Database Group (advisor: Kian-Lee Tan). He was also a visiting PhD student at the Database Group, Carnegie Mellon University (host advisor: Andrew Pavlo). He earned his bachelor’s degree from South China University of Technology. He is passionate about integrating research into real-world system products. During his time in AWS, he was responsible for boosting Amazon Redshift performance using advanced vectorization and compression techniques. Before that, he participated in the development of IBM Db2 Event Store’s indexing structure and transaction processing mechanism. During his PhD, he developed two main-memory DBMS prototypes, namely Peloton and Cavalia.He was also an early contributor to Stratosphere, which is now widely known as Apache Flink.

BIFOLD Colloquium 2022/01/24

Home >

BIFOLD Colloquium 2022/01/24

“The evolution of Apache Kafka”

Speaker: Jun Rao

Venue: Virtual event

Time and date: 4:00 pm. January 24, 2022

Registration: If you are interested in participating, please contact: coordination@bifold.berlin

Abstract:

Apache Kafka is a popular event streaming platform. It has been adopted by hundreds of thousands of organizations across the globe and by more than 80% of Fortune 500 companies. In this talk, I will first describe the history of Kafka: how it was invented, what problem it was trying to solve and how it has been evolving. Then I will describe a couple of real world use cases. Finally, I will outline some of the future work on this platform.

Speaker:
(Copyright: private)

Jun Rao is co-founder of Confluent, a company that provides an event streaming platform based on Apache Kafka. Before Confluent, Jun Rao was a senior staff engineer at LinkedIn where he led the development of Kafka. Before LinkedIn, Jun Rao was a researcher at IBM’s Almaden research center, where he conducted research on database and distributed systems. Jun Rao is the PMC chair of Apache Kafka and a committer of Apache Cassandra. He is the co-author of more than 20 referenced research papers, and the co-inventor of more than a dozen U.S. software patents.

BIFOLD Research Talk 06/01/2022

Home >

BIFOLD Research Talk 06/01/2022

“Darwin: Scale-In Stream Processing”

Speaker: Lawrence Benson

Venue: Virtual event

Time and Date: January 06, 2022. 2 pm – 2:30 pm.

Registration: If you are interested in participating, please contact: coordination@bifold.berlin.

Abstract:

Companies increasingly rely on stream processing engines (SPEs) to quickly analyze data and monitor infrastructure. These systems enable continuous querying of data at high rates. Current production-level systems, such as Apache Flink and Spark, rely on clusters of servers to scale out processing capacity. Yet, these scale-out systems are resource inefficient and cannot fully utilize the hardware. As a solution, hardware-optimized, single-server, scale-up SPEs were developed. To get the best performance, they neglect essential features for industry adoption, such as larger-than-memory state and recovery. This requires users to choose between high performance or system availability. While some streaming workloads can afford to lose or reprocess large amounts of data, others cannot, forcing them to accept lower performance. Users also face a large performance drop once their workloads slightly exceed a single server and force them to use scale-out SPEs. To acknowledge that real-world stream processing setups have drastically varying performance and availability requirements, we propose scale-in processing. Scale-in processing is a new paradigm that adapts to various application demands by achieving high hardware utilization on a wide range of single- and multi-node hardware setups, reducing overall infrastructure requirements. In contrast to scaling-up or -out, it focuses on fully utilizing the given hardware instead of demanding more or ever-larger servers. We present Darwin, our scale-in SPE prototype that tailors its execution towards arbitrary target environments through compiling stream processing queries while recoverable larger-than-memory state management. Early results show that Darwin achieves an order of magnitude speed-up over current scale-out systems and matches processing rates of scale-up systems.

Publication [PDF]

Speaker bio:

Lawrence Benson is a PhD student in the Data Engineering Systems Group at the Hasso Plattner Institute (HPI) in Potsdam under the supervision of Prof. Dr. Tilmann Rabl.

His research focus is on data management with modern hardware. He is passionate about efficiently leveraging hardware in novel system designs. Currently, he is working on persistent memory and next-gen stream processing systems, with multiple published papers at top venues (VLDB, SIGMOD, CIDR, EDBT).

Before his PhD, Lawrence completed his M.Sc. at HPI with a strong focus on databases and stream processing. He wrote his thesis in collaboration with the DIMA Group @ TU Berlin. During my studies he did two internships at Google in California and New York, working on stream processing.

Lifting the curse of dimensionality for statistics in ML

Home >

Lifting the curse of dimensionality for statistics in ML

Lifting the curse of dimensionality for statistics in ML

The paper “Beyond Smoothness: Incorporating Low-Rank Analysis into Nonparametric Density Estimation” by BIFOLD researcher Dr. Robert A. Vandermeulen and his colleague Dr. Antoine Ledent, Technical University Kaiserslautern, was presented at the Conference on Neural Information Processing Systems (NeurIPS 2021). Their paper provides the first solid theoretical foundations for applying low-rank methods to nonparametric density estimation.

The more features there are in the data, the more difficult machine learning tasks become.
(Copyright: Unsplash)

It is well-known that problems in statistics and machine learning become more difficult as the dimensionality, or number of features in the data grows larger. For example, when using collected data to make a predictor for the likelihood of a given type of cancer, patient information about their age, gender, weight and alcohol consumption would be fairly useful (four dimensions). If in addition 50.000 pieces of genetic data would be included, the information this data contains is potentially far more effective for predicting cancer than the four health markers. However, an enormous amount of patient data would be required to discern which variations of these 50.000 markers are relevant for predicting cancer. This kind of problem occurs in virtually all machine learning and statistics with a large number of dimensions and is termed ”the curse of dimensionality.” One reason for the current interest in neural networks is their ability to circumvent this problem. However researchers still have essentially no mathematical understanding as to why this is the case.

Robert A. Vandermeulen
(Copyright: R. Vandermeulen)

“Lifting the curse of dimensionality for nonparametric density estimations is a breakthrough. Being able to combine a high number of features with flexible statistical methods opens up many possibilities for new machine learning applications.”

Outside of neural networks in deep learning,  there are methods for obviating the curse of dimensionality that are well-understood mathematically. A very popular method for this is known as ”compressed sensing.” This method was invented partially by Terrence Tao, a Fields Medalist and potentially the most famous living mathematician. The method has been applied with great success to linear methods however its underlying principles have yet to be extended to the highly flexible ”nonparametric” statistical methods, which are capable of utilizing or estimating complex dependencies between features. In this paper the researchers were able to prove that applying low-rank methods to nonparametric density estimation can completely eliminate the curse of dimensionality. This is a result that may be valid for other areas of nonparametric statistics as well.

The publication in detail:

Robert A. Vandermeulen, Antoine Ledent: Beyond Smoothness: Incorporating Low-Rank Analysis into Nonparametric Density Estimation. NeurIPS 2021

Abstract

The construction and theoretical analysis of the most popular universally consistent nonparametric density estimators hinge on one functional property: smoothness. In this paper we investigate the theoretical implications of incorporating a multi-view latent variable model, a type of low-rank model, into nonparametric density estimation. To do this we perform extensive analysis on histogram style estimators that integrate a multi-view model. Our analysis culminates in showing that there exists a universally consistent histogram style estimator that converges to any multi-view model with a finite number of Lipschitz continuous components at a rate of ˜O(1/3√n) in L1 error, compared to the standard histogram estimator which can converge at a rate slower than 1/d√n on the same class of densities. Beyond this we also introduce a new type of nonparametric latent variable model based on the Tucker decomposition. A very rudimentary experimental implementation of the ideas in our paper demonstrates considerable practical improvements over the standard histogram estimator. We also provide a thorough analysis of the sample complexity of our Tucker decomposition based model. Thus, our paper provides solid first theoretical foundations for extending low-rank techniques to the nonparametric setting.

Two BIFOLD Papers Ranked as ESI Highly Cited and Hot Papers

Home >

Two BIFOLD Papers Ranked as ESI Highly Cited and Hot Papers

Two BIFOLD Papers Ranked as ESI Highly Cited and Hot Papers

Two machine learning papers by BIFOLD researchers received the “Essential Science indicators” (ESI) “Highly Cited” and “Hot Papers” labels for their impact in the science community.

The paper “A Unifying Review of Deep and Shallow Anomaly Detection,” authored by Lukas Ruff, Jacob R. Kauffmann, Robert A. Vandermeulen, Grégoire Montavon, Wojciech Samek, Marius Kloft, Thomas G. Dietterich, and Klaus-Robert Müller was ranked as “Highly Cited”. This means it was among the top one percent of most cited papers in the subject area of “Engineering”.

Additionally, the paper “Robust and Communication-Efficient Federated Learning From Non-i.i.d. Data,” authored by Felix Sattler, Simon Wiedemann, Klaus-Robert Müller, and Wojciech Samek was marked as a “Hot Paper”. Hot Papers are papers published in the last two years that are receiving citations quickly after publication. These papers have been cited enough times in the most recent bimonthly period to place them in the top 0.1% when compared to papers in the same field and added to the database in the same period.

More information is available at Fraunhofer HHI.

BIFOLD Colloquium 2022/01/05

Home >

BIFOLD Colloquium 2022/01/05

[postponed] “Storing and Analyzing Viral Sequences through Data-driven Genomic Computing”

Speaker: Prof. Stefano Ceri (Politecnico di Milano)

Venue: Virtual event

Time and date: This event will be postponed!

Registration: If you are interested in participating, please contact: coordination@bifold.berlin.

Abstract:

Prof. Stefano Ceri will give a s imple and data-inspired illustration of what is a viral sequence, what are mutations, how mutated sequences become organized forming a “variant”, what are the effects of individual mutations and of variants. He will illustrate the process of deposition of viral sequences to public repositories (GenBank, COGUK, GISAID). In the second part of the seminar, Stefano Ceri wants to discuss the systems that were developed within his group. Specifically, he will illustrate (i) ViruSurf, a search system enabling free meta-data driven search over the integrated and curated databases, now hitting about 3 million SARS-CoV-2 sequences, continuously updated from the above repositories; (ii) VirusViz, a data visualization tool for comparatively analyzing query results; (iii) VirusLab, a tool for exploring user-provided viral sequences; (iv) EpiSurf, a tool for intersecting viral sequences with epitopes – used in vaccine design. He will also hint at ongoing projects for viral surveillance and for exploring a knowledge base of viral resources.

Speaker:
(Copyright: Stefano Ceri)

Stefano Ceri is a professor of Data Management at Politecnico di Milano. His main research interests are extending data management and then acting as data scientists in numerous domains – including social analytics, fake news detection, genomics for biology and for precision medicine, and recently studies concerning the SARS-CoV-2 viral genome. He is the recipient of two ERC Advanced Grants, “Search Computing” (2008-2013) and “data-driven Genomic Computing” (2016-2021). He is an ACM Fellow and received the ACM-SIGMOD “Edward T. Codd Innovation Award” (June 2013).

Learning about Population Health from Twitter Texts

Home >

Learning about Population Health from Twitter Texts

Learning about Population Health from Twitter Texts

Is it possible to learn about the health status of a population and potential side effect of medications by analyzing social media conversations? BIFOLD researchers tackled the challenge of making social media posts of medical laypersons concerning diseases and medications understandable for machines. At the BioCreative VII Challenge Evaluation Workshop 2021, they recently explored how a combination of background knowledge and a language transformer model can increase the precision of medical information extraction from Twitter texts.

Social media platforms like Twitter are hubs for many discussions, including health topics. The vast amount of health-related information published there can be processed to learn about population health.
(Copyright: Unsplash)

People share information about many aspects of their lives online – family, lifestyle, work, but also information about their health, medical drug intake and corresponding adverse drug reactions. Automatic extraction of health-related information on social media could potentially lead to valuable insights about population health and help to identify risk factors and unwanted side effects of commonly used medications.

The challenge for the automatic processing of social media texts lies in the casual language used. While machine learning algorithms can be very powerful, their accuracy relies on high quality data to be trained on. In a medical context these often are large annotated data sets including well-written sentences in medical expert’s expressions – often in latin. “The medical layman’s terms used on social media have to be either linked to or transformed into the corresponding precise medical expressions before we can gain insights from it with machine learning methods”, explains BIFOLD researcher Dr. Philippe Thomas.

Researchers in the group of BIFOLD Fellow Prof. Dr. Sebastian Möller, head of the Speech and Language Technology Research Department at the German Research Center for Artificial Intelligence (DFKI), do both. Dr. Philippe Thomas, Dr. Roland Roller and Lisa Raithel focus on the development of multilingual information extraction techniques for the detection of adverse drug reactions in social media and disease-specific forums. They create new annotated datasets as well as transformer models that translate natural language expressions into the corresponding medical terms.

Prof. Dr. Sebastian Möller
(Copyright: Philipp Arnoldt)

“Our multilingual learning approach enables Natural Lange Proccessing applications where large corpora or manually crafted language ressources are missing. Medical information extraction is just one of many applications made possible by our research at BIFOLD.”

More recently, BIFOLD researchers Roland Roller, Ammer Ayach, and Lisa Raithel tackled the “Automatic extraction of medication names in tweets” track challenge of the 2021 BioCreative VII Challenge Evaluation Workshop in their paper “Boosting Transformers using Background Knowledge, or how to detect Drug Mentions in Social Media using Limited Data. The goal was to extract drug and medication mentions in Twitter posts by pregnant women. The provided data sets were especially challenging as the short nature of Tweets leads to very low context information and the data included very few actual mentions of medications. To handle these data limitations, the researchers boosted the performance of a pre-trained language transformer model by introducing background knowledge. They re-mapped medical annotations from the given data to unlabeled texts with string-matching. While string matching is very precise, it is limited to already existing labels. The transformer model on the other hand can detect new medical mentions by taking context information into account. The combination of both approaches significantly enhanced the information extraction performance. “You could compare this to a student who relies on a textbook for learning, but also on the advice of a tutor who has to explain complex application cases – both complement each other to gain a better understanding,” says Roland Roller.

The publication in detail:

Roland Roller, Ammer Ayach, Lisa Raithel: Boosting Transformers using Background Knowledge, or how to detect Drug Mentions in Social Media using Limited Data. BioCreative VII Challenge Evaluation Workshop 2021: 189

Abstract

To process natural language and to extract information from text, transformers are currently the model of choice for many different tasks. Conversely, if the number of training examples is very limited, fine-tuning might not achieve the expected results, similarly as for other machine learning methods. In the past, a large range of different techniques have been presented to overcome this challenge, such as data augmentation or using distantly labelled data. In this work, we present our contribution to the drug mention detection of the BioCreative VII Challenge (Track 3), which includes a large number of negative, but only a small proportion of positive documents. In course of this, we explore different techniques to boost performance of a pre-trained transformer model. The combination of our transformer model and usage of background knowledge achieved the best results for our use case.

Contact:

Prof. Dr. Sebastian Möller

Deutsches Forschungszentrum für Künstliche Intelligenz (DFKI) GmbH
DFKI Project Office
Speech and Language Technology
Alt-Moabit 91c
D-10559 Berlin

Email: sebastian.moeller@dfki.de

BIFOLD Researchers Honored with BBAW and acatech Memberships

Home >

BIFOLD Researchers Honored with BBAW and acatech Memberships

BIFOLD Researchers Honored with BBAW and acatech Memberships

BIFOLD Co-Director Prof. Dr. Volker Markl and BIFOLD Group Leader Prof. Dr. Frank Noé have been named full members of the Berlin-Brandenburg Academy of Sciences and Humanities (BBAW). They are two of seven outstanding researchers named to the academy on 26 November 2021. The BBAW is one of the most important scientific institutions in the Berlin-Brandenburg area and beyond. It elects its members from all over Germany and from abroad. Members are chosen on the basis of outstanding scientific achievements. The BBAW is a learned society with a three-hundred-year-old tradition of uniting outstanding scholars and scientists across national and disciplinary boundaries. 80 Nobel Prize winners have shaped its history.

Additionally, BIFOLD Co-Director Prof. Dr. Klaus-Robert Müller became a regular member of acatech, the National Academy of Science and Engineering. acatech advises policymakers and society on issues relating to the future of technology science and technology policy.

A recording of the announcement of new BBAW members at Einsteintag 2021 is available at: https://www.youtube.com/watch?v=rr-lUFsLj7I

Machine Learning Consultation

Home >

Machine Learning Consultation

Machine Learning Consultation

Machine learning (ML) and artificial intelligence (AI) have permeated the sciences and large parts of working life. Today many people use machine learning techniques without being a proven expert. Consequently, many questions and problems arise while using these techniques. The Berlin Institute for the Foundations of Learning and Data (BIFOLD) accommodates distinguished machine learning experts from different areas and offers a weekly consultation on machine learning for students, but also for companies and institutions.

BIFOLD offers a weekly ML consultation hour: Every Wednesday from 11:00 am – 12:00.
(Copyright: Unsplash)

While machine learning was only used by specialists a few years ago, such methods have now found application in various sciences, but also in companies: Doctors are supported in their decision making by ML models that analyze the content of tissue sections or laboratory data. Historians use ML to search for patterns and ways in which knowledge has spread around the world. In the engineering sciences, ML techniques are used, among other things, in process technology or control engineering; in chemistry, these methods support the modeling of chemical reactions. Social scientists, on the other hand, analyze the effects of applied machine learning methods on society.

In addition to students and research assistants who use such techniques in their theses or scientific research, problems and question around ML come up in small and medium-sized enterprises (SMEs) as well as other institutions.

Weekly consultation hours

How to translate a concrete application need into a well-posed data collection and machine learning workflow? What ML algorithm is most suitable for a given dataset? Why does an algorithm work well on current data but poorly on new data? How can you visualize or understand what a machine learning model has learned? For all questions around algorithms, deep learning, semantic speech recognition, image analysis or explainable artificial intelligence, BIFOLD offers a weekly ML consultation hour: Every Wednesday from 11:00 am – 12:00 pm, ML experts are available to support students with their specific problems in the field of ML.

Wednesdays 11:00am – 12:00 pm
Marchstr. 23, 10587 Berlin

Room MAR 4057

Companies or other institutions with questions concerning the application of machine learning methods can also get access to the scientific expertise for a fee. Please register for an appointment.

Email: coordination@bifold.berlin

BIFOLD Colloquium “Scalable and Fast Cloud Data Management”

Home >

BIFOLD Colloquium “Scalable and Fast Cloud Data Management”

“Scalable and Fast Cloud Data Management”

Speakers: Norbert Ritter (University of Hamburg), Felix Gessert (Baqend), and Wolfram Wingerath (Baqend)

Venue: Virtual event

Time and date: December 06, 2021: 4 pm – 6 pm

Registration: If you are interested in participating, please contact: coordination@bifold.berlin


Abstract:
Database research at the University of Hamburg is centered around scalable technologies for cloud data management and connects the dots between traditional database systems, web caching, and continuous data analytics. In this presentation, we provide a rundown of our research topics throughout the years and explain how we turned them into practice at the Software-as-a-Service company Baqend.
We first present an overview over the system space that we are concerned with and the high-level goals we pursue in our work. We then go into detail on how the Orestes architecture combines web caching with traditional data management techniques to accelerate primary key access in globally distributed setups. Next, we cover the InvaliDB architecture that employs continuous stream processing to extend the Orestes approach to complex database queries. Finally, we explain how the cloud service Speed Kit turns our research into practice by accelerating more than 100 million users per month. We close with ongoing and future work, including the Beaconnect project that revolves around continuous analytics over real-user tracking data with Apache Flink.

Speakers:

Norbert Ritter is a full professor of computer science at the University of Hamburg, where he heads the databases and information systems group (DBIS). He received his PhD from the University of Kaiserslautern in 1997. His research interests include distributed and federated database systems, transaction processing, caching, cloud data management, information integration, and autonomous database systems. He has been teaching NoSQL topics in various database courses for several years. Seeing the many open challenges for NoSQL systems, he, Wolfram Wingerath and Felix Gessert have been organizing the annual Scalable Cloud Data Management Workshop to promote research in this area.

Felix Gessert is CEO and co-founder of the Software-as-a-Service company Baqend. During his PhD studies at the University of Hamburg, he developed the core technology behind Baqend’s web performance service. Felix is passionate about making the web faster by turning research results into real-world applications. He frequently talks at conferences about exciting technology trends in data management and web performance. As a Junior Fellow of the German Informatics Society (GI), he is working on new ideas to facilitate the research transfer of academic computer science innovation into practice.

Wolfram Wingerath is the leading data engineer at Baqend where he is responsible for data analytics and all things related to real-time query processing. Starting in 2022, he will take over the Data Science professorship at the University of Oldenburg and will therefore transition into the Head of Research position at Baqend. During his PhD studies at the University of Hamburg, he conceived the scalable design behind Baqend’s real-time query engine and thereby also developed a strong background in real-time databases and related technology such as scalable stream processing, NoSQL database systems, cloud computing, and Big Data analytics.