BM_2024v15n1

Bioscience Method 2024, Vol.15, No.1, 37-49 http://bioscipublisher.com/index.php/bm 42 In drug screening, machine learning models are often used to preliminarily screen a large number of candidate compounds. By extracting and encoding the features of candidate compounds, machine learning models can predict the biological activity of these compounds, thereby screening compounds with potential activity and providing a candidate list for subsequent experimental verification. It should be noted that the accuracy and reliability of machine learning models highly depend on the quality and quantity of training data. Therefore, when constructing machine learning models, it is necessary to ensure the accuracy and completeness of training data, and collect as many samples as possible to improve the model's generalization ability. 3.2 Common machine learning algorithms and models The field of machine learning provides various algorithms and models for processing and analyzing large amounts of data, and is applied in various fields such as image processing, natural language processing, predictive modeling, etc. Ray (2019) briefly reviewed the most commonly used and popular machine learning algorithms, emphasizing the advantages and disadvantages of these machine learning algorithms from an application perspective, in order to help make wise decisions about selecting appropriate learning algorithms to meet specific application needs. Raju et al. (2023) compared various popular supervised learning algorithms, such as SVM, decision tree, random forest, KNN, logistic regression, etc., and tested the efficiency of the algorithms on three different datasets in different fields, aiming to compare different algorithms used on different datasets in different fields to understand the best algorithm and overall best algorithm. Mitchell (2014) focuses on certain machine learning methods commonly used in chemical informatics and quantitative structure-activity relationships (QSAR), including artificial neural networks, random forests, support vector machines, k-nearest neighbors, and naive Bayesian classifiers. In drug screening and drug development, commonly used machine learning algorithms and models include the following: Linear regression: used to establish linear relationships between variables and predict the values of continuous variables. In drug development, linear regression can be used to predict continuous indicators such as drug efficacy or toxicity. Logistic Regression: An algorithm used to establish classification models. By mapping the output of a linear regression model to a probability value, logistic regression can achieve binary or multi classification tasks, such as predicting whether a drug has a certain activity. Decision Tree: An algorithm based on tree structure used to establish classification or regression models. The decision tree recursively divides the dataset into subsets, makes judgments based on eigenvalues, and constructs a tree like structure. In drug development, decision trees can be used to identify key features that affect drug activity. Random Forest: An ensemble learning algorithm composed of multiple decision trees. Construct multiple decision trees through random sampling and feature selection, and obtain the final prediction results through voting or averaging. Random forests have high accuracy and robustness, making them suitable for processing large-scale datasets (Lei et al., 2021). Support Vector Machine (SVM): an algorithm used to establish classification and regression models. By mapping data to a high-dimensional space and finding a hyperplane to maximize the spacing between different categories, classification tasks can be achieved. SVM has strong generalization ability and robustness, and can handle high-dimensional data. Naive Bayes: A classification algorithm based on Bayesian theorem. It assumes that features are independent of each other and classifies them by calculating a posterior probability. Naive Bayes algorithm is simple and efficient, suitable for processing large-scale datasets and high-dimensional data.

RkJQdWJsaXNoZXIy MjQ4ODYzMg==