Computational Molecular Biology 2025, Vol.15, No.3, 122-130 http://bioscipublisher.com/index.php/cmb 122 Feature Review Open Access Building an Integrated Multi-Omics Database for Rare Diseases Huixian Li, Jingqiang Wang Institute of Life Science, Jiyang College of Zhejiang A&F University, Zhuji, 311800, China Corresponding author: jingqiang.wang@jicat.org Computational Molecular Biology, 2025, Vol.15, No.3 doi: 10.5376/cmb.2025.15.0012 Received: 11 Mar., 2025 Accepted: 22 Apr., 2025 Published: 14 May, 2025 Copyright © 2025 Li and Wang, This is an open access article published under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.6 Preferred citation for this article: Li H.X., and Wang J.Q., 2025, Building an integrated multi-omics database for rare diseases, Computational Molecular Biology, 15(3): 122-130 (doi: 10.5376/cmb.2025.15.0012) Abstract Rare diseases are diverse in types and have a small number of patients with each type, but they cumulatively affect hundreds of millions of patients worldwide. Current research on rare diseases is confronted with challenges such as scattered data, inconsistent standards and difficulties in sharing. This article reviews the characteristics of the existing major rare disease databases (such as Orphanet, RD-Connect, MONDO, etc.), discusses the progress and limitations of multi-omics data integration methods, and introduces the new trend of data-driven rare disease research in the era of precision medicine. The application prospects of this database in discovering disease markers and therapeutic targets, supporting clinical decision-making and patient stratification, integrating artificial intelligence prediction models and drug reuse, etc. were explored. The contributions and main findings of this study were summarized. The potential impact of this integrated database on rare disease research and clinical translation was emphasized, and ideas for future expansion and sustainable development were proposed. Keywords Rare diseases; Multi-omics; Data integration; Database architecture; Duchenne muscular dystrophy 1 Introduction Rare diseases refer to those that affect very few people. In the European Union, it is defined as a disease with a prevalence rate of less than 1 in 2000, while in the United States, it refers to a disease affecting fewer than 200 000 people. It is known that there are over 7,000 rare diseases. Although each disease has a small number of patients, the total number of patients affected by it amounts to 263 to 446 million, accounting for approximately 3.5% to 5.9% of the global population. Most rare diseases are genetic disorders, with about 70 to 80 percent having genetic causes, and they often occur in childhood. Due to the wide variety of diseases and complex and diverse symptoms, patients with rare diseases often encounter problems such as difficult diagnosis and delayed diagnosis (Casas-Alba et al., 2022). According to statistics, it takes an average of many years from the appearance of symptoms to a confirmed diagnosis, and one has to visit multiple departments. This is called "diagnostic roaming". The number of patients with rare diseases is small and their distribution is scattered. A single center often finds it difficult to collect sufficient samples, resulting in severe data fragmentation and isolation. The problem of "information silos" is prominent. The patient registration systems, sample banks and databases established by different research institutions are independent of each other and lack unified standards, making it difficult to share data. As Marsh et al. pointed out, "Data silos are hindering drug development and harming patients with rare diseases". This fragmentation limits researchers' understanding of the full picture of the disease and also hinders large-sample studies across centers. Meanwhile, the data types of rare diseases are diverse and highly heterogeneous, and there are technical obstacles in integrated analysis, including different omics data such as genomic variations, transcriptional expression, protein abundance, metabolite profiles, and epigenetic modifications (Hesterlee et al., 2021). These data have different measurement methods, inconsistent data formats and scales, and require complex normalization processing and coordination. Even within the same data category, the technical platforms and analysis processes adopted by different studies may vary. For instance, differences in sequencing depth, mass spectrometry instruments, and data preprocessing methods can lead to batch effects and noise. High-dimensional and high-noise
RkJQdWJsaXNoZXIy MjQ4ODYzNA==