BM_2024v15n1

Bioscience Method 2024, Vol.15, No.1, 37-49 http://bioscipublisher.com/index.php/bm 43 K-means Clustering: An unsupervised learning algorithm used to partition a dataset into K non overlapping clusters. By iteratively calculating the distance between each sample and the cluster center, the samples are assigned to the nearest cluster. K-means clustering can be used to discover potential structures or patterns in data. In addition, algorithms and models such as neural networks (Figure 3), deep learning, and ensemble learning have also been widely applied in drug development. The selection of these algorithms and models depends on factors such as specific data characteristics, task requirements, and model performance. Figure 3 AI Neural Network Model Note: Yellow: Input data; Blue: For hidden layers; Green: Process and target another hidden layer; Red: Make modifications to generate final output 3.3 Optimizing the model to meet the needs of drug screening In order to ensure the accuracy and efficiency of machine learning models in drug screening, a series of targeted optimization work is needed. High quality data is the foundation of model performance, so steps such as removing noise, filling in missing values, standardizing or normalizing features, and handling imbalanced data are essential. Feature engineering is equally crucial, and selecting features closely related to drug activity can significantly improve the predictive ability of the model. Select appropriate machine learning algorithms and models based on the specific needs of drug screening and the characteristics of the data. Different algorithms and models are suitable for different tasks and data types, so it is necessary to choose according to the actual situation. At the same time, it is also important to consider the interpretability and comprehensibility of the model in order to provide meaningful insights in the drug development process. During the model construction process, by adjusting the hyperparameters of the model, such as learning rate, regularization strength, tree depth, etc., the performance of the model can be optimized. Using grid search, random search, or Bayesian optimization methods to find the optimal combination of parameters can help improve the prediction accuracy and stability of the model (Beaurivage et al., 2019). By dividing the dataset into training and validation sets (or multiple folds), training the model on the training set, and evaluating the model's performance on the validation set, a stable estimate of the model's performance can be obtained. This can better understand the generalization ability of the model and further optimize it based on it. By combining the prediction results of multiple single models, integrated learning can improve the overall performance. In drug screening, ensemble learning methods such as random forests and gradient boosting trees can combine the advantages of multiple models to improve their accuracy and stability. Selecting features with clear biological significance or constructing interpretable models can help researchers understand the predictive results and underlying biological mechanisms of the models. 4 Candidate Drug Validation 4.1 Evaluation criteria and validation methods for candidate drugs In the drug development process, the evaluation and validation of candidate drugs provide key measurement

RkJQdWJsaXNoZXIy MjQ4ODYzMg==