ML meets systems biology: Leveraging domain knowledge from differential equation-based synthetic data
During this Lunch Talk, Katharina Baum from the Data Integration in the Life Sciences (DILiS) research group at FU Berlin will dive into her research on applying Machine Learning in systems biology.
Abstract: Machine learning (ML) has revolutionized forecasting also in the biological and health domains. However, training sophisticated ML models requires large datasets that are especially expensive to collect in the biomolecular context. Informed ML has been proposed recently as a means to include prior knowledge and reduce the amount of required data. In particular, if information about system dynamics is available, mechanistic representations, such as from ordinary differential equation (ODE) models, can generate synthetic data to supplement real-world data or pre-train the ML model in a transfer learning approach. However, tools are lacking to systematically use the valuable systems biology domain knowledge from these models for priming ML models. Therefore, we developed the open-source Python tool SimbaML (simulation-basedML) that unifies realistic synthetic dataset generation from ODE-based simulations and the direct analysis and inclusion in ML pipelines [1]. It enables data augmentation and can improve temporal forecasting in scarce data settings such as, for example, at the beginning of an epidemic infection scenario. In addition, it can serve to assess the required dataset size before data collection. In addition, we developed a framework to systematically scrutinize and optimize ODE-based synthetic dataset characteristics for their success in informing ML models for time series forecasting [2]. In our experiments with different systems and datasets, the optimal synthetic dataset characteristics heavily rely on the coherence between the real-world data and the dynamics of the synthetic data.
This makes our optimization approach crucial for appropriately informing ML. Our versatile frameworks enable leveraging insights from ODEs, a rich source of domain knowledge that may help overcome data scarcity and improve ML-based predictions in various fields of biology and medicine.
Related manuscripts:
[1] Kleissl, M., et al., SimbaML: Connecting Mechanistic Models and Machine Learning with Augmented Data. ICLR Tiny Paper, 2023: https://openreview.net/forum?id=1wtUadpmVzu
[2] Zabbarov, J. & Witzke, S., et al., Optimizing ODE-derived Synthetic Data for Transfer Learning in Dynamical Biological Systems. bioRxiv 2024.03.25.586390; doi: 10.1101/2024.03.25.586390
The BIFOLD Lunch Talk series gives BIFOLD members and external partners the opportunity to engage in dialogue about their research in Machine Learning and Big Data. Each Lunch Talk offers BIFOLD members, fellows and external researchers and guests the chance to present their research and to network with each other. The Lunch Talk takes place at the TU Berlin. For further information on the Lunch Talks and registration, contact Dr. Laura Wollenweber via email.