Computational Molecular Biology 2024, Vol.14, No.4, 163-172 http://bioscipublisher.com/index.php/cmb 164 poses significant computational and statistical challenges. For instance, traditional regression methods may become ineffective due to the non-identifiability of the optimization problem inherent in such data (Kim et al., 2019; Vinga, 2020). The complexity is further compounded by the need to integrate diverse types of data, such as genomics, transcriptomics, proteomics, and metabolomics, each contributing to the overall dimensionality and complexity (Misra et al., 2019; Leonavicius et al., 2019). Advanced techniques like structured sparsity regularization and high-dimensional LASSO (Hi-LASSO) have been developed to address these challenges by imposing additional constraints and improving feature selection and prediction performance. 2.2 Data sparsity and multicollinearity Data sparsity is a common characteristic of high-dimensional biological datasets, particularly those derived from single-cell RNA sequencing and other high-throughput technologies. This sparsity arises because many features (e.g., genes) may not be expressed in all samples, leading to a large number of zero values in the dataset (Amezquita et al., 2019). Multicollinearity, where features are highly correlated, further complicates the analysis. This issue is particularly problematic in omics data, where different types of measurements (e.g., gene expression, protein levels) are often interrelated (ShyamMohanJ, 2016; Alzubaidi, 2018). Techniques such as regularized optimization and the development of specialized algorithms like Hi-LASSO help mitigate these issues by refining importance scores and improving the robustness of the models. 2.3 Heterogeneity in biological datasets Biological datasets are inherently heterogeneous, reflecting the complex and varied nature of biological systems. This heterogeneity can be seen across different levels, from the molecular (e.g., gene expression) to the cellular (e.g., single-cell measurements) and even the organismal level. The integration of multi-modal high-throughput data, such as combining genomic, transcriptomic, and proteomic data, is essential for a comprehensive understanding of biological processes but introduces significant challenges. Single-cell technologies, for example, have highlighted the variability between individual cells, necessitating advanced analytical methods to accurately interpret this heterogeneity (Palit et al., 2019). The development of computational principles and tools for data integration is crucial to address these challenges and to enable the extraction of meaningful insights from complex biological data (Argelaguet et al., 2021; Juan and Huang, 2023). 3 Biostatistical Challenges in High-Dimensional Data Analysis 3.1 Curse of dimensionality The curse of dimensionality refers to the exponential increase in computational complexity and data sparsity as the number of dimensions in a dataset grows. This phenomenon poses significant challenges in various domains, including feature selection, clustering, and anomaly detection. For instance, in single-cell RNA sequencing (scRNA-seq) data, the high dimensionality combined with technical noise complicates downstream analyses. The RECODE method has been proposed to address this issue by reducing noise without dimension reduction, thereby improving the accuracy of cell clustering and gene expression recovery (Imoto et al., 2022). Similarly, data augmentation techniques have been employed to mitigate sparsity in industrial data, enhancing the robustness and interpretability of data-driven models (Jiang et al., 2023). Evolutionary algorithms like the variable-size cooperative coevolutionary particle swarm optimization (VS-CCPSO) have also shown promise in effectively selecting relevant features from high-dimensional datasets (Song et al., 2022). 3.2 Model overfitting 3.2.1 Challenges with model complexity High-dimensional data often contain correlated and noisy predictors, making it difficult to fit empirical models without overfitting. For example, in hyperspectral remote sensing, the large number of narrow spectral bands can lead to overly complex models that fit the noise rather than the underlying biological processes (Rocha et al., 2017). Similarly, in biomedical datasets, the high dimensionality can obscure the true signal, making it challenging to develop reliable predictive models (Yan et al., 2018).
RkJQdWJsaXNoZXIy MjQ4ODYzNA==