Advanced Drug Delivery Reviews | Small Data, Big Challenges: Machine Learning and Deep Learning Strategies for Drug Discovery with Limited Data

source：material synthesis Views：856time：2026-02-26material synthesis: 1092348845

已传文件：photo/1771980306.png

A key bottleneck limiting the potential of machine learning (ML) and deep learning (DL) models in the drug discovery and development (DDD) process is the scarcity of high-quality experimental data. Limited data is not an anomaly but an inherent characteristic of the DDD process. Enormous financial costs, time constraints, and confidentiality issues restrict the size of available datasets. Applying standard machine learning and deep learning algorithms directly to these small datasets poses significant challenges. Traditional machine learning models are still limited by their reliance on handcrafted features and their ability to capture complex biological relationships. In contrast, DL algorithms, which assume abundant data, are prone to overfitting and poor generalization when trained on small datasets.Therefore, the small data problem represents a fundamental constraint that determines the practicality and reliability of AI applications in DDD. Although previous reviews have comprehensively examined the broad landscape of artificial intelligence and machine learning in drug discovery, there is a significant gap regarding the small data challenges within the DDD process. Addressing this challenge requires adapting digital learning methods that typically assume abundant data, while also extending traditional machine learning approaches, which are suitable for small data but have limited representational capacity. This review highlights the ubiquity of limited data by surveying key drug discovery tasks and integrates traditional machine learning methods with advanced DL strategies tailored for these scenarios, thereby addressing this gap. By combining methodological advances with task-specific applications, the review provides an overview of current approaches and identifies opportunities to advance robust, interpretable, and generalizable AI in drug discovery.

Summary

Drug development has long faced the challenge of scarce high-quality data, which severely limits the potential applications of machine learning and deep learning. In situations where data is limited, traditional machine learning methods demonstrate irreplaceable value. For example, methods such as Support Vector Machines (SVM) and Random Forests (RF), due to their robustness with small datasets, model interpretability, and computational efficiency, remain preferred tools for key tasks such as biomarker discovery, target identification, initial activity screening, and toxicity prediction. However, these models rely on manual feature engineering and struggle to automatically learn deep patterns from raw, complex biological data, presenting inherent bottlenecks when handling high-dimensional, nonlinear molecular and omics data.

To overcome data bottlenecks, the field of deep learning has developed a series of innovative adaptation strategies. Transfer learning leverages large-scale unlabeled data to pre-train models and then fine-tunes them for small-sample tasks, effectively transferring knowledge; self-supervised learning generates supervisory signals from the data itself through designed pretext tasks; meta-learning and few-shot learning enable models to generalize quickly from very limited samples; data augmentation and lightweight hybrid models enhance robustness and generalization under small-data conditions. These techniques have been successfully applied to virtual screening, drug-target interaction prediction, ADMET property optimization, and drug repositioning. In the future, the key to advancing the field will lie in establishing standardized small-data benchmarks, building interpretable hybrid modeling frameworks, and systematically integrating prior knowledge such as biological networks and pathways into models, ultimately promoting reliable, transparent, and compliant AI tools toward clinical translation.

References:

DOI: 10.1016/j.addr.2025.115762

Next： Bioactive Materials | Previous： IF: 20.3! | Lymphatic