Top Data Science Interview Questions and Answers (2026)
Strong candidates explain trade-offs. These questions are framed to test that, not to reward memorisation of formulas.
Q01.Difference between supervised and unsupervised learning?
Supervised learning trains on labelled data to predict outcomes (regression, classification). Unsupervised learning finds structure in unlabelled data (clustering, dimensionality reduction).
Q02.Explain bias-variance trade-off.
High-bias models underfit; high-variance models overfit. Good models balance the two and generalise to unseen data.
Q03.What is overfitting and how do you prevent it?
When a model memorises training data and fails on new data. Prevent with more data, regularisation, cross-validation, dropout, or simpler models.
Q04.What is cross-validation?
Splitting data into k folds, training k times each with a different fold held out, and averaging the score. Reduces variance of the performance estimate.
Q05.Precision vs recall?
Precision = TP / (TP + FP) — how many predicted positives were right. Recall = TP / (TP + FN) — how many actual positives were caught.
Q06.When would you favour recall over precision?
When the cost of false negatives is high — fraud detection, disease screening, security alerts.
Q07.What is feature engineering?
Creating new input variables from raw data to improve model performance — encoding categoricals, scaling, deriving ratios, time-based features.
Q08.Explain the curse of dimensionality.
As feature count grows, data becomes sparse and distance measures lose meaning, degrading model performance. Reduce with feature selection or PCA.
Q09.What is a confusion matrix?
A table of TP, FP, FN, TN. The foundation for precision, recall, F1 and accuracy.
Q10.Linear regression vs logistic regression?
Linear regression predicts continuous values. Logistic regression predicts probability of a class via the sigmoid function.
Q11.What is regularisation?
A penalty added to the loss function to discourage overly complex models. L1 (Lasso) drives weights to zero; L2 (Ridge) keeps them small.
Q12.Explain random forest.
An ensemble of decision trees trained on bootstrap samples with random feature subsets. Predictions are averaged (regression) or voted (classification).
Q13.What is gradient boosting?
An ensemble where each new tree corrects errors of the previous ensemble. XGBoost and LightGBM are popular implementations.
Q14.When would you use clustering?
Customer segmentation, anomaly detection, exploratory analysis — when you want structure but have no labels.
Q15.What is PCA?
Principal Component Analysis — a linear method that projects data onto axes of maximum variance to reduce dimensionality while preserving signal.
Q16.Difference between bagging and boosting?
Bagging trains models in parallel on bootstrap samples and averages them, reducing variance. Boosting trains sequentially, each model correcting the previous, reducing bias.
Q17.What's a baseline model and why do you need one?
A trivial model (e.g. majority class, mean prediction). It sets the floor any real model must clear.
Q18.How do you handle imbalanced classes?
Resample (SMOTE, undersampling), class weights, threshold tuning, or use metrics like F1 / PR-AUC instead of accuracy.
Q19.What's the difference between R² and adjusted R²?
R² measures variance explained; it always rises with more features. Adjusted R² penalises extra features, making model comparisons fairer.
Q20.How do you deploy an ML model in production?
Wrap it behind an API (FastAPI, BentoML), version model artefacts in MLflow or S3, monitor input drift and performance, and automate retraining when metrics degrade.
More interview prep
Java Interview Questions
Curated Java interview questions and answers covering core syntax, OOP, collections, multithreading, JVM internals and Spring Boot — for freshers and experienced developers.
Python Interview Questions
Practical Python interview questions covering core language features, data structures, the GIL, decorators, generators and modern Python toolchains.
AWS Interview Questions
AWS interview questions covering compute, storage, networking, IAM, monitoring and the Well-Architected Framework — for DevOps, Cloud Engineer and SRE roles.
