CMB_2025v15n3

Computational Molecular Biology 2025, Vol.15, No.3, 122-130 http://bioscipublisher.com/index.php/cmb 123 data pose challenges to statistical analysis and machine learning. Multi-omics data often have problems such as missing values, small sample sizes but tens of thousands of variables (Liu et al., 2022). If not integrated properly, it may "increase complexity rather than improve performance". Therefore, how to maintain biological information while reducing noise and filling in the gaps is a major difficulty in data integration. 2 Current Situation of Rare Diseases 2.1 Overview of the existing rare disease databases Before building a new multi-omics database for rare diseases, it is essential to review existing databases to learn from their strengths and identify remaining gaps. Orphanet is a comprehensive international portal for rare diseases, supported by the EU. It has developed the OFA numbering system and provides detailed clinical, genetic, and treatment information for over 6 000 rare diseases (Mitani and Haneuse, 2020). Its major strength lies in high-quality, expert-curated clinical data and standardized ontologies. OMIM (online mendelian inheritance in man) focuses on Mendelian genetic disorders and serves as a gene-centric reference for researchers. It provides gene-disease associations, locus information, and mutation types, which are critical for identifying candidate genes for rare diseases. However, like Orphanet, OMIM does not host raw omics data and often needs to be integrated with external sequencing or expression databases. Other emerging platforms such as RareDDB and eRAM offer integrated views by combining disease annotations with SNPs, genes, phenotypes, and even drug links, offering promising resources for precision medicine applications (Jia et al., 2018). 2.2 Progress and limitations of multi-omics integration methods Multi-omics integration has emerged as a powerful approach in rare disease research but presents multiple challenges, especially due to small sample sizes and high dimensionality. Rare disease datasets often have very few samples and tens of thousands of variables. This makes traditional machine learning algorithms prone to overfitting and limits generalizability. Batch effects from different platforms (e.g., mass spectrometry, sequencing depth) can introduce noise. Tools like Combat and Harmony can help reduce this noise in single-omics, but a unified framework for multi-omics correction remains lacking (Olexiouk, 2023). Interpretability is another key limitation. Although complex models like deep neural networks or multi-omics graphs can achieve high accuracy, their outputs are not easily mapped to biological pathways or mechanisms without additional analysis (Braconi et al., 2021; Zaghlool and Attallah, 2022). This hinders clinical translation and undermines trust among medical practitioners. 2.3 New trends in precision medicine and data-driven rare disease research With the rise of precision medicine, large-scale genomic projects like the UK’s 100,000 Genomes Project have incorporated rare diseases into public health systems, significantly improving diagnosis rates (Figure 1) (Kerr et al., 2020). Countries are increasingly promoting the interconnection of clinical and research data platforms, such as UDNI, which facilitates global case sharing. Therapeutically, the emergence of platform-based technologies like AAV gene therapy and antisense oligonucleotides supports rapid adaptation for different single-gene disorders. However, their development still relies heavily on integrative databases capable of connecting genotype to phenotype and underlying biological pathways (Pahelkar et al., 2024). Ultimately, building a unified, multi-omics rare disease knowledge platform is not just a research goal-it is a critical enabler of faster diagnosis, improved therapy development, and clinical decision-making in a data-driven healthcare ecosystem. 3 Data Sources and Collection Strategies 3.1 Data types and schemas The data scope of this integrated database covers typical "five major" omics data types and related clinical phenotypic data. The following respectively introduces each data type and their significance in rare disease

RkJQdWJsaXNoZXIy MjQ4ODYzNA==