A future-proof IT infrastructure is increasingly becoming a decisive competitive factor – this applies not only to companies, but especially to research. In recent months, BIFOLD has been able to invest around 1.8 million euros in new research hardware, thereby significantly increasing the institute’s computing capacity. This cutting-edge IT infrastructure was financed by the German Federal Ministry of Education and Research (BMBF). “If we want to continue to conduct world-class research, investments into infrastructure are an important prerequisite” describes BIFOLD Co-Director Prof. Dr. Klaus-Robert Müller.
Current experiments in Machine Learning and Big Data management systems require hardware with very strong computing, storing and data transfer capabilities. The new systems include a specialized computer unit (node) and a full computer cluster both designed for simultaneous processing of a very large number of parallel workloads (massively parallel processing) with large main memory capacities as well as one cluster particularly suitable for the fast processing of sequential workloads. The central processing units (CPUs) in the latter cluster also support the so called Intel Software Guard Extension technology thereby enabling developers to create and execute code and data in a secure environment. The servers run high-performance file systems and will allow for the transfer of very large data with low latency. “We expect that this cutting-edge hardware will not only enrich our own research, but also enables us to establish new collaborations with our partners,” adds BIFOLD Co-Director Prof. Dr. Volker Markl.
In the group of Volker Markl, mainly two different projects benefit from the new possibilities: AGORA is a novel form of data management systems. It aims to construct an innovative unified ecosystem that brings together data, algorithms, models, and computational resources and provides them to a broad audience. The goal is easy creation and composition of data science pipelines as well as their scalable execution. In contrast to existing data management systems, Agora operates in a heavily decentralized and dynamic environment.
The NebulaStream platform is a general purpose, end-to-end data management system for the IoT. It provides an out-of-the box experience with rich data processing functionalities and a high ease-of-use. With the new IT infrastructure, both of these systems can be validated at a much larger scale and in a secure data processing environment.
High memory and parallel processing capabilities are also essential for large-scale Machine Learning simulations, e.g. solving high-dimensional linear problems, or training deep neural networks. Klaus-Robert Müller and his group will use the new hardware initially in three different projects: Specifically, it allows BIFOLD researchers to produce closed-form solutions of large dense linear systems, which are needed to describe correlations between large amounts of interacting particles in a molecule with high numerical precision. Researchers can also significantly extend the number of epigenetic profile regions that can be analyzed, thereby using significantly more information available in the data. It will also enable scientists to develop explainable AI techniques that incorporate internal explainability and feedback structures, and are significantly more complex to train than typical deep neural networks.
>
The first cluster consists of 60 servers, each with two processors (CPUs) with 16 2.1 GHz processor cores, 512 GB of RAM, twelve TBs of HD storage space, and additional fast SSD storage that holds approximately two TB of data. They are thus designed for simultaneous processing of a very large number of parallel workloads (massively parallel processing). The second cluster consists of 15 identical servers. Each has a CPU with eight 3.4 GHz processor cores, 128 GB of RAM and a combination of 12 TB of HD memory and 2 TB of SSD memory. They are particularly suitable for the fast processing of sequential workloads. Furthermore, these CPUs support the Intel Software Guard Extension to enable developers to develop and execute code and data in an environment secured by the CPU.
The performance of the server system is upgraded with two high-performance nodes, each with three Nividia A100 graphical processing units (GPUs), two 64-core CPUs, 2 TB of main memory and over 22 TB of HD memory capable of handling data analytics and AI applications on very large data sets.
Both clusters are also managed by two headnodes, which have advanced hardware features to improve fault tolerance. All systems have 100 GB Infiniband cards and are connected via two 100 GB/s Infiniband switches, enabling very fast data exchange between cluster hosts. The servers will use the Hadoop Distributed File System (HDFS), which supports Big Data analytics and will enable high-performance access to data.
The first cluster consists of 60 servers, each with two processors (CPUs) with 16 2.1 GHz processor cores, 512 GB of RAM, twelve TBs of HD storage space, and additional fast SSD storage that holds approximately two TB of data. They are thus designed for simultaneous processing of a very large number of parallel workloads (massively parallel processing). The second cluster consists of 15 identical servers. Each has a CPU with eight 3.4 GHz processor cores, 128 GB of RAM and a combination of 12 TB of HD memory and 2 TB of SSD memory. They are particularly suitable for the fast processing of sequential workloads. Furthermore, these CPUs support the Intel Software Guard Extension to enable developers to develop and execute code and data in an environment secured by the CPU.
The performance of the server system is upgraded with two high-performance nodes, each with three Nividia A100 graphical processing units (GPUs), two 64-core CPUs, 2 TB of main memory and over 22 TB of HD memory capable of handling data analytics and AI applications on very large data sets.
Both clusters are also managed by two headnodes, which have advanced hardware features to improve fault tolerance. All systems have 100 GB Infiniband cards and are connected via two 100 GB/s Infiniband switches, enabling very fast data exchange between cluster hosts. The servers will use the Hadoop Distributed File System (HDFS), which supports Big Data analytics and will enable high-performance access to data.
>
The existing Cluster has been upgraded with 13 additional nodes. Twelve nodes have 786 GB main memory each and four Nvidia A100 GPUs connected via a 200 Gbit Inifiband network. This setup allows the simulation and computation of very large models. One special node runs with 6 TB of RAM and 72 processor cores, which enables massive parallel computing while holding very large models in the main memory.
The distributed High-Performance-Computing file system BeeGFS was expanded with three more file servers. 437 TB of storage capacity are distributed across six data servers, which are connected to the network with a 40 Gbit connection. All nodes are connected with at least 25 Gbit, many with 25 Gbit. Overall this setup is capable of handling operations with very large amounts of data.
The existing Cluster has been upgraded with 13 additional nodes. Twelve nodes have 786 GB main memory each and four Nvidia A100 GPUs connected via a 200 Gbit Inifiband network. This setup allows the simulation and computation of very large models. One special node runs with 6 TB of RAM and 72 processor cores, which enables massive parallel computing while holding very large models in the main memory.
The distributed High-Performance-Computing file system BeeGFS was expanded with three more file servers. 437 TB of storage capacity are distributed across six data servers, which are connected to the network with a 40 Gbit connection. All nodes are connected with at least 25 Gbit, many with 25 Gbit. Overall this setup is capable of handling operations with very large amounts of data.