CMB_2025v15n3

Computational Molecular Biology 2025, Vol.15, No.3, 131-140 http://bioscipublisher.com/index.php/cmb 135 still effective. Sometimes, the correlation between gene expression levels or mutation frequencies and the IC50 of drugs is also calculated to identify possible markers. In terms of models, linear regression is the most commonly used one. Although it is simple, the results are clear and easy to explain. To prevent the model from being distorted due to "learning too much", researchers usually add regularization penalties to make the results more stable. Overall, although these basic methods are simple, they are important starting points for screening markers (Nguyen et al., 2016). 4.2 Multi-omics integrated analysis based on machine learning and artificial intelligence In recent years, the advancement of algorithms has made machine learning and artificial intelligence almost standard in multi-omics analysis. Researchers are no longer content with the set of linear relationships; instead, they are more eager to capture those complex and elusive nonlinear patterns. Deep learning has its advantages. Data from different omics can be processed separately and then "converged" at the middle layer to extract more comprehensive features (Tan et al., 2020). Some people prefer ensemble learning, which combines models trained for different omics, and the results are often more stable (Yang et al., 2022). One study did it this way: by integrating gene expression, mutation and copy number data into a Stacking model, the accuracy of predicting drug sensitivity directly increased. Of course, such methods are not omnipotent. However, on the whole, multi-omics models are indeed more reliable than single-omics models and have greater reference value for clinical decision-making. 4.3 Application of biological networks and systems biology methods in marker recognition When it comes to studying drug responses, looking at just a few genes is often insufficient. Systems biology places more emphasis on the "sense of the whole". It attempts to weave the relationships among molecules such as genes, proteins, and drugs into a web, and identify the truly crucial nodes or modules from the network structure. Some people use protein-protein interaction networks, while others conduct gene co-expression analysis (Zogopoulos et al., 2022). Their approaches are different, but their goals are similar - to identify those factors that are at the "hub" position in the network and may dominate drug responses. Sometimes, the results also need to be combined with pathway enrichment analysis, placing candidate genes in known biological pathways for comparison. This can make their functional roles clearer and also make the model's explanation more intuitive (Zhang et al., 2018). Overall, this type of network method enables people to understand the underlying logic of drug effects from a more systematic perspective. 5 Data Sources and Database Resources 5.1 Comparison and application of drug sensitivity datasets (such as CCLE, GDSC, NCI-60) Most people studying drug sensitivity cannot avoid several commonly used databases. The most famous one is probably CCLE, which collected approximately a thousand cancer cell lines. It not only contains genomic information but also multi-omics and drug response data. The GDSC developed by the Sanger Institute in the UK is also frequently mentioned. Currently, it has collected nearly 700 cell lines and tested their responses to over 100 types of anti-cancer drugs. It is regarded as one of the largest drug sensitivity databases. Its focus lies in analyzing drug efficacy data together with information such as gene mutations and copy number changes, from which molecular clues that may affect drug responses are unearthed (Reinhold et al., 2015). A little further back, NCI-60 can be regarded as a "senior". Although it only contains sixty cell lines, it covers over a thousand compounds and remains an important reference for the study of traditional chemotherapy drugs to this day (Takamatsu and Matsumura, 2023). 5.2 Integration and utilization of public omics databases (such as TCGA, GEO) Most people who conduct drug response research cannot do without those large cancer databases. TCGA is a representative among them. It not only contains the genomic and transcriptomic data of the patient's tumor, but also the corresponding clinical information, which can be used to analyze which molecular features may affect the therapeutic effect. Some researchers will match these patient data with the drug sensitivity information of cell lines to infer which drugs the patients might be more sensitive to. For instance, the R package pRRophetic combines the gene expression profile of GDSC with the tumor data of TCGA, and uses the ridge regression model

RkJQdWJsaXNoZXIy MjQ4ODYzNA==