CMB_2025v15n3

Computational Molecular Biology 2025, Vol.15, No.3, 122-130 http://bioscipublisher.com/index.php/cmb 125 construction to balance promoting scientific research and protecting patients' rights and interests. For the patient data we obtain from our partner hospitals/institutions, informed consent forms were signed by the patients themselves or their legal representatives at the time of collection, explicitly agreeing to the use of the data for scientific research and allowing it to be shared with the database anonymously. The informed consent form complies with local regulations and explains the purpose of the data, potential risks, and the rights of the subjects, including their right to withdraw the data at any time (Gainotti et al., 2016). We only use patient data with explicit consent and respect the patients' right to make decisions. This is particularly important in rare diseases, as the patient population is small and they are more likely to be identified due to data breaches. We fully inform and obtain consent to ensure ethical legitimacy (Koromina et al., 2021). All patient data entering the database undergo strict de-identification processing before being stored. Direct identity information such as name, ID number and contact details will never be stored. Information that may indirectly identify an identity (such as rare geographical locations, biometric photos, etc.) is also not included in the database. For genomic sequence data, we follow the commonly used methods internationally, encoding and storing individual data, and using random ids to replace real identities. When users obtain data, we also take measures to prevent re-identification caused by cross-comparison. For example, limit the amount of fine-grained data returned for each query to prevent malicious users from piecing together identities through step-by-step queries. For extremely rare data (such as data of only individual patients worldwide), we may only provide aggregated information or must access it in a specially controlled environment to minimize the risk of exposure (Hansson et al., 2016; Takashima et al., 2018). 4 Data Integration and Computing Methods 4.1 Multi-omics data normalization and coordination techniques Before integrating and analyzing multi-level data such as genomic, transcriptomic and proteomic data, it is necessary to standardize and coordinate the data of different omics to make them comparable and compatible. Since multi-omics data often come from different batches, platforms, and even different laboratories, batch effects and system biases are inevitable. If not corrected, batch differences will be mistaken for biological differences during integrated analysis. For this purpose, we conducted batch correction on the omics data before integration. For the transcription and protein quantification data, we normalized them using the Combat algorithm, which adjusts the mean and variance of different batches through a linear model to make the data distribution more consistent. In practice, Combat significantly reduced the differences in the distribution of expression values among different studies, making the comparison between the patient group and the control group more reliable. For metabolite data, we used normalization based on quality markers to proportionally adjust the data of each batch according to the common internal standard (Ali et al., 2025). For the methylated chip data, methods such as COMBAT-Seq were used for correction. In addition, we conduct QC after calibration to ensure that the calibration does not overly weaken the real biological signal. For example, principal component analysis (PCA) is used to check the clustering between groups before and after correction. If it is found that the correction affects the differences between groups, we will retain certain batch variables or reevaluate using a mixed-effects model. In conclusion, batch effect correction ensures that multi-source data can be compared on the same scale, laying the foundation for subsequent integration. To reduce the complexity of the analysis, we carried out appropriate dimensionality reduction processing on the data before integration. Common methods include Principal Component Analysis (PCA) and automatic feature selection. For instance, for high-dimensional transcriptional data, before integrating it into the network, we first use PCA to extract the first few principal components, capture the major variations, and discard the components with a low proportion to reduce noise interference. 4.2 Data integration algorithm After completing the data preprocessing and normalization, we apply multiple algorithms to conduct in-depth integrated analysis of multi-omics data. We constructed multiple interconnected network layers: gene regulation/co-expression networks, protein-protein interaction networks, metabolite pathway networks, etc., and

Made with FlippingBook

RkJQdWJsaXNoZXIy MjQ4ODYzNA==