During the Summer School on Machine Learning for Quantum Physics and Chemistry, in September 2021 in Warsaw, BIFOLD PhD candidate Kim. A. Nicoli was awarded with the Best Poster Award. His poster was democratically selected by the participants and the scientific committee for being the best amongst more than 80 participants. The corresponding paper “Estimation of Thermodynamic Observables in Lattice Field Theories with Deep Generative Models” is a joint international effort of several BIFOLD researchers: Kim Nicoli, Christopher Anders, Pan Kessel, Shinichi Nakajima, as well as a group of researchers affiliated with DESY (Zeuthen) and other institutions. The work is published in Physics Review Letters.
“Modeling and understanding the interactions of quarks, fundamental subatomic, yet indivisible particles, which represent the smallest known units of matter, is the main goal of current ongoing research in the field of High Energy Physics. Deepening our understanding of such phenomena, leveraging on modern machine learning techniques, would have some important implications in many related fields of applied science and research, such as quantum computer devices, drug discoveries and many more.”
Preventing Image-Scaling Attacks on Machine Learning
Preventing Image-Scaling Attacks on Machine Learning
BIFOLD Fellow Prof. Dr. Konrad Rieck, head of the Institute of System Security at TU Braunschweig, and his colleagues provide the first comprehensive analysis of image-scaling attacks on machine learning, including a root-cause analysis and effective defenses. Konrad Rieck and his team could show that attacks on scaling algorithms like those used in pre-processing for machine learning (ML) can manipulate images unnoticeably, change their content after downscaling and create unexpected and arbitrary image outputs. “These attacks are a considerable threat, because scaling as a pre-processing step is omnipresent in computer vision,” knows Konrad Rieck. The work was presented at the USENIX Security Symposium 2020.
Machine learning is a rapidly advancing field. Complex ML methods do not only enable increasingly powerful tools, they are also entry gates for new forms of attacks. Research into security for ML usually focusses on the learning algorithms itself, although the first step of a ML process is the pre-processing of data. In addition to various cleaning and organizing operations in datasets, images are scaled down during pre-processing to speed up the actual learning process that follows. Konrad Rieck and his team could show that frequently used scaling algorithms are vulnerable to attacks. It is possible to manipulate input images in such a way that they are indistinguishable from the original to the human eye, but will look completely different after downscaling.
The vulnerability is rooted in the scaling process: Most scaling algorithms only consider a few high-weighed pixels of an image and ignore the rest. Therefore, only these pixels need to be manipulated to achieve drastic changes in the downscaled image. Most pixels of the input picture remain untouched – making the changes invisible to the human eye. In general, scaling attacks are possible wherever downscaling takes place without low-pass filtering – even in video and audio media formats. These attacks are model-independent and thus do not depend on knowledge of the learning model, features or training data.
“Image-scaling attacks can become a real threat in security related ML applications. Imagine manipulated images of traffic signs being introduced into the learning process of an autonomous driving system! In BIFOLD we develop methods for the effective detection and prevention of modern attacks like these.”
“Attackers don’t need to know the ML training model and can even succeed with image-scaling attacks in otherwise robust neural networks,” says Konrad Rieck. “Based on our analysis, we were able to identify a few algorithms that withstand image scaling attacks and introduce a method to reconstruct attacked images.”
Artificial intelligence (AI) has found its way into many work routines – be it the development of hiring procedures, the granting of loans, or even law enforcement. However, the machine learning (ML) systems behind these procedures repeatedly attract attention by distorting results or even discriminating against people on the basis of gender or race. “Accuracy is one essential factor of machine learning models, but fairness and robustness are at least as important,” knows Felix Neutatz, a BIFOLD doctoral student in the group of Prof. Dr. Ziawasch Abedjan, BIFOLD researcher and former professor at TU Berlin who recently moved to Leibniz Universität Hannover. Together with Ricardo Salazar Diaz they published “Automated Feature Engineering for Algorithmic Fairness“, a paper on fairness of machine learning models in Proceedings of the VLDB Endowment.
Algorithms might reinforce biases against groups of people that have been historically discriminated against. Examples include gender bias in machine learning applications on online advertising or recruitment procedures.
The paper presented at VLDB 2021 specifically considers algorithmic fairness. “Previous machine learning models for hiring procedures, usually discriminate systematically against women”, knows Felix Neutatz: “Why? Because they learn on old datasets derived from times when fewer women were employed.” Currently, there are several ways to improve the fairness of such algorithmic decisions. One is to specify that attributes such as gender, race or age are not to be considered in the decision. However, it turns out that other attributes also allow conclusions to be drawn about these sensitive characteristics.
The state-of-the-art bias reduction algorithms simply drop sensitive features and create new artificial non-sensitive instances to counterbalance the loss in the dataset. In case of recruiting procedures, this would mean simply adding lots of artificially generated data from hypothetical female employees to the training dataset. While this approach successfully removes bias it might lead to fairness overfitting and is likely to influence the classification accuracy because of potential information loss.
“There are several important metrics that determine the quality of machine learning models,” Felix Neutatz knows, “these include, for example, privacy, robustness to external attacks, interpretability, and also fairness. The goal of our research is to automatically influence and balance these metrics.”
The researchers developed a new approach that addresses the problem with a feature-wise, strategy. “To achieve both, high accuracy and fairness, we propose to extract as much unbiased information as possible from all features using feature construction (FC) methods that apply non-linear transformations. We use FC first to generate more possible candidate features and then drop sensitive features and optimize for fairness and accuracy”, explains Felix Neutatz. “If we stick to the example of the hiring process, each employee has different attributes depending on the dataset, such as gender, age, experience, education level, hobbies, etc. We generate many new attributes from these real attributes by a large number of transformations. For example, such a new attribute is generated by dividing age by gender or multiplying experience by education level. We show that we can extract unbiased information from biased features by applying human-understandable transformations.”
Finding a unique feature set that optimizes the trade-off between fairness and accuracy is challenging. In their paper, the researchers not only demonstrated a way to extract unbiased information from biased features. They also propose an approach where the ML system and the user collaborate to balance the trade-off between accuracy and fairness and validate this approach by a series of experiments on known datasets.
One of the fundamental problems of machine ethics is to avoid the perpetuation and amplification of discrimination through machine learning applications. In particular, it is desired to exclude the influence of attributes with sensitive information, such as gender or race, and other causally related attributes on the machine learning task. The state-of-the-art bias reduction algorithm Capuchin breaks the causality chain of such attributes by adding and removing tuples. However, this horizontal approach can be considered invasive because it changes the data distribution. A vertical approach would be to prune sensitive features entirely. While this would ensure fairness without tampering with the data, it could also hurt the machine learning accuracy. Therefore, we propose a novel multi-objective feature selection strategy that leverages feature construction to generate more features that lead to both high accuracy and fairness. On three well-known datasets, our system achieves higher accuracy than other fairness-aware approaches while maintaining similar or higher fairness.
Using Machine Learning in the Fight against COVID-19
Using Machine Learning in the Fight against COVID-19
BIFOLD Fellow Prof. Dr. Frank Noé, who leads the research group AI for the Sciences, together with an international team, identified a potential drug candidate for the therapy of COVID-19. Among other methods, they used deep learning models and molecular dynamics simulations in order to identify the drug Otamixaban as a potential inhibitor of the human target enzyme which is required by SARS-CoV-2 in order to enter into lung cells. According to their findings, Otamixaban works in synergy with other drugs such as Camostat and Nafamostat and may present an effective early treatment option for COVID-19. Their work was now published in Chemical Science.
While the availability of COVID-19 vaccines created some relief during the ongoing pandemic, there is still no effective therapy against the virus. One therapeutic approach pursues the strategy to prevent the virus from entering human cells.
In their publication, Frank Noé, who heads an interdisciplinary research unit at Freie Universität Berlin, and his colleagues at FU Berlin, German Primate Center, National Center for Advancing Translational Sciences (MD, USA), Fraunhofer Institute for Toxicology and Experimental Medicine, and Universität Göttingen could show that the late-stage drug candidate, Otamixaban, works as an effective inhibitor of SARS-Cov-2 lung cell entry by suppressing the activity of an enzyme called “transmembrane serine protease 2” (TMPRSS2). The SARS-CoV-2 virus uses its so-called spike protein (S-protein) to connect to an enzyme (ACE2) on the surface of a human lung cell. Subsequently the S-protein is cleaved by the enzyme TMPRSS2 thereby enabling the virus to enter the cell. Inhibiting TMPRSS2 with Otamixaban prevents the cell entry weakly, but this inhibitory effect is found to be profoundly amplified when combining Otamixaban with other known TMPRSS2-inhibiting drugs such as Nafamostat and Camostat.
Frank Noé and his team analyzed the inhibitory effects of Otamixaban in silico, i.e. by machine learning and computer simulation. They combined deep learning methods and molecular dynamics simulation in order to screen a database of druglike molecules for potential inhibitors of TMPRSS2. Otamixaban was one of the proposed candidates that was confirmed to be active in the experimental assay. Subsequently, the Noé group conducted extensive molecular dynamics simulations of the TMPRSS2-Otamixaban complex and applied big data analytics in order to understand the inhibition mechanism in detail, while in parallel the inhibitor effect of Otamixaban was confirmed in cells and lung tissue.
“The new machine learning methods that we develop at BIFOLD do not only help to solve fundamental problems in molecular and quantum physics, they are also increasingly important in application-oriented biochemical research. I believe it is very likely that if we hopefully end up with effective therapy options against COVID-19, machine learning will have played a key role in identifying them.”
Otaxamiban, originally developed for other medical conditions, is particularly interesting as it had already entered the third phase of clinical trials for a different indication, potential alleviating the trajectory towards clinical trials of the new formulation presented here. The researchers filed an EU patent application for the active agent combination.
The publications in detail:
Synergistic inhibition of SARS-CoV-2 cell entry by otamixaban and covalent protease inhibitors: pre-clinical assessment of pharmacological and molecular properties
Authors: Tim Hempel, Katarina Elez, Nadine Krüger, Lluís Raich, Jonathan H. Shrimp, Olga Danov, Danny Jonigk, Armin Braun, Min Shen, Matthew D. Hall, Stefan Pöhlmann, Markus Hoffmann, Frank Noé
Abstract: SARS-CoV-2, the cause of the COVID-19 pandemic, exploits host cell proteins for viral entry into human lung cells. One of them, the protease TMPRSS2, is required to activate the viral spike protein (S). Even though two inhibitors, camostat and nafamostat, are known to inhibit TMPRSS2 and block cell entry of SARS-CoV-2, finding further potent therapeutic options is still an important task. In this study, we report that a late-stage drug candidate, otamixaban, inhibits SARS-CoV-2 cell entry. We show that otamixaban suppresses TMPRSS2 activity and SARS-CoV-2 infection of a human lung cell line, although with lower potency than camostat or nafamostat. In contrast, otamixaban inhibits SARS-CoV-2 infection of precision cut lung slices with the same potency as camostat. Furthermore, we report that otamixaban’s potency can be significantly enhanced by (sub-) nanomolar nafamostat or camostat supplementation. Dominant molecular TMPRSS2-otamixaban interactions are assessed by extensive 109 μs of atomistic molecular dynamics simulations. Our findings suggest that combinations of otamixaban with supplemental camostat or nafamostat are a promising option for the treatment of COVID-19.
Molecular mechanism of inhibiting the SARS-CoV-2 cell entry facilitator TMPRSS2 with camostat and nafamostat
Authors: Tim Hempel, Lluís Raich, Simon Olsson, Nurit P. Azouz, Andrea M. Klingler, Markus Hoffmann, Stefan Pöhlmann, Marc E. Rothenberg, Frank Noé
Abstract: The entry of the coronavirus SARS-CoV-2 into human lung cells can be inhibited by the approved drugs camostat and nafamostat. Here we elucidate the molecular mechanism of these drugs by combining experiments and simulations. In vitro assays confirm that both drugs inhibit the human protein TMPRSS2, a SARS-Cov-2 spike protein activator. As no experimental structure is available, we provide a model of the TMPRSS2 equilibrium structure and its fluctuations by relaxing an initial homology structure with extensive 330 microseconds of all-atom molecular dynamics (MD) and Markov modeling. Through Markov modeling, we describe the binding process of both drugs and a metabolic product of camostat (GBPA) to TMPRSS2, reaching a Michaelis complex (MC) state, which precedes the formation of a long-lived covalent inhibitory state. We find that nafamostat has a higher MC population than camostat and GBPA, suggesting that nafamostat is more readily available to form the stable covalent enzyme–substrate intermediate, effectively explaining its high potency. This model is backed by our in vitro experiments and consistent with previous virus cell entry assays. Our TMPRSS2–drug structures are made public to guide the design of more potent and specific inhibitors.
Modern algorithms open up new possibilities for historians
In the past, scholars used to pore over dusty tomes. Today Dr. Matteo Valleriani, group leader at the Max Planck Institute for the History of Science as well as honorary professor at TU Berlin and fellow at the Berlin Institute for the Foundations of Learning and Data (BIFOLD), uses algorithms to group and analyze digitized data from historical works. The term used to describe this process is computational history. One of the goals of Valleriani’s research is to unlock the mechanisms involved in the homogenization of cosmological knowledge in the context of studies in the history of science.
The project is co-financed by BIFOLD and researches the evolutionary path of the European scientific system as well as the establishment of a common scientific identity in Europe between the 13th and 17th centuries. Dr. Valleriani is working with fellow researchers from the Max Planck Institute for the Physics of Complex Systems to develop and implement empirical, multilayer networks to enable the analysis of huge quantities of data.
In Paris in the first half of the 13th century, Johannes de Sacrobosco compiled an elementary text on geocentric cosmology entitled Tractatus de sphaera. This manuscript is a simple, late medieval description of the geocentric cosmos based on a synthesis of Aristotelian and Ptolemaic worldviews.
“This compilation of the knowledge of its time is the result of an emerging intellectual interest in Europe. In the 13th century, a need arose for a knowledge of astronomy and cosmology on a qualitative and descriptive basis – parallel to and driven by the emergence of a network of new universities,” explains Valleriani. Over the following decades, the Tractatus de sphaera was commented on, extended, and revised many times, but continued to be a mandatory text at all European universities until the 17th century. Digitized copies of 359 printed textbooks featuring modified forms of the Tractatus de sphaera from the period 1472 until 1650 are now available to researchers. During this period of about 180 years, some 30 new universities were founded in Europe.
The universal language of scholars at that time was Latin, which contributed significantly to the high mobility of knowledge even in this period. “An introductory course in astronomy was mandatory for students in Europe at that time,” explains Valleriani. “As a committed European, I am mainly interested in how this led to the emergence of a shared scientific knowledge in Europe.”
Taken together, these 359 books contain some 74,000 pages – a quantity of text and images that it is not possible for any individual person to examine and analyze. Working with machine learning experts from BIFOLD, the research team first had to clean, sort, and standardize this colossal data corpus drawn from a wide range of digital sources to make it accessible for algorithms. The first step was to sort the data into texts, images, and tables. The texts were then broken down into recurring textual parts and organized according to a specific semantic taxonomy reflecting early modern modes of production of scientific knowledge.
Each of the more than 20,000 scientific illustrations had to be linked to the extensive metadata of the editions and their textual parts. In addition, more than 11,000 tables were identified in the Sphaera corpus. “To analyze the tables, we developed an algorithm to divide them into several groups with similar characteristics. This allows us to now use further analyses to compare these groups with each other,” explains Valleriani. This process may sound simple, but in fact involves countless technical difficulties: “Developing suitable algorithms is made more difficult by four error sources. The books from this period contain many printer errors. This and the fact that the conditions of the books vary greatly makes them at times hard to digitize. Then there is the problem of the differing quality of the electronic copies. We also have to remember that at that time every printer used their own typeface, meaning that our algorithms have to be effectively trained for each printer to be able to even recognize the data.” In order to track the transformation process of the original text in the 359 books dating from this 180-year period and formalize this process of knowledge, the researchers need to understand precisely how knowledge changed, ultimately becoming more and more homogenous.
“To achieve an understanding based upon data requires an intelligent synthesis of machine learning and the working practices of historians. The algorithms which we will now publish are the first capable of analyzing such data. We are also looking forward to develop further algorithms as part of our continuing cooperation with BIFOLD,” Valleriani explains.
Abstract: Many learning algorithms such as kernel machines, nearest neighbors, clustering, or anomaly detection, are based ondistances or similarities. Before similarities are used for training an actual machine learning model, we would like to verify that they arebound to meaningful patterns in the data. In this paper, we propose to make similarities interpretable by augmenting them with anexplanation. We develop BiLRP, a scalable and theoretically founded method to systematically decompose the output of an alreadytrained deep similarity model on pairs of input features. Our method can be expressed as a composition of LRP explanations, whichwere shown in previous works to scale to highly nonlinear models. Through an extensive set of experiments, we demonstrate thatBiLRP robustly explains complex similarity models, e.g. built on VGG-16 deep neural network features. Additionally, we apply ourmethod to an open problem in digital humanities: detailed assessment of similarity between historical documents such as astronomicaltables. Here again, BiLRP provides insight and brings verifiability into a highly engineered and problem-specific similarity model.
Abstract: We investigated the evolution and transformation of scientific knowledge in the early modern period, analyzing more than 350 different editions of textbooks used for teaching astronomy in European universities from the late fifteenth century to mid-seventeenth century. These historical sources constitute the Sphaera Corpus. By examining different semantic relations among individual parts of each edition on record, we built a multiplex network consisting of six layers, as well as the aggregated network built from the superposition of all the layers. The network analysis reveals the emergence of five different communities. The contribution of each layer in shaping the communities and the properties of each community are studied. The most influential books in the corpus are found by calculating the average age of all the out-going and in-coming links for each book. A small group of editions is identified as a transmitter of knowledge as they bridge past knowledge to the future through a long temporal interval. Our analysis, moreover, identifies the most impactful editions. These books introduce new knowledge that is then adopted by almost all the books published afterwards until the end of the whole period of study. The historical research on the content of the identified books, as an empirical test, finally corroborates the results of all our analyses.
BIFOLD Fellow Dr. Wojciech Samek heads newly established AI research department at Fraunhofer HHI
BIFOLD Fellow Dr. Wojciech Samek heads newly established AI research department at Fraunhofer HHI
The Fraunhofer Heinrich Hertz Institute (HHI) has established a new research department dedicated to “Artificial Intelligence”. The AI expert and BIFOLD Fellow Dr. Wojciech Samek, previously leading the research group “Machine Learning” at Fraunhofer HHI, will head the new department. With this move Fraunhofer HHI aims at expanding the transfer of its AI research on topics such as Explainable AI and neural network compression to the industry.
Dr. Wojciech Samek: “The mission of our newly founded department is to make today’s AI truly trustable and in all aspects practicable. To achieve this, we will very closely collaborate with BIFOLD in order to overcome the limitations of current deep learning models regarding explainability, reliability and efficiency.“
“Congratulations, I look forward to a continued successful teamwork with BIFOLD fellow Wojciech Samek, who is a true AI hot shot.”
BIFOLD Director Prof. Dr. Klaus-Robert Müller
The new department further strengthens the already existing close connection between basic AI research at BIFOLD and applied research at Fraunhofer HHI and is a valuable addition to the dynamic AI ecosystem in Berlin.
“The large Berlin innovation network centered around BIFOLD is unique in Germany. This ensures that the latest research results will find their way into business, science and society.”