Question 1

Difference between supervised and unsupervised learning?

Accepted Answer

Supervised learning trains on labelled data to predict outcomes (regression, classification). Unsupervised learning finds structure in unlabelled data (clustering, dimensionality reduction).

Question 2

Explain bias-variance trade-off.

Accepted Answer

High-bias models underfit; high-variance models overfit. Good models balance the two and generalise to unseen data.

Question 3

What is overfitting and how do you prevent it?

Accepted Answer

When a model memorises training data and fails on new data. Prevent with more data, regularisation, cross-validation, dropout, or simpler models.

Question 4

What is cross-validation?

Accepted Answer

Splitting data into k folds, training k times each with a different fold held out, and averaging the score. Reduces variance of the performance estimate.

Question 5

Precision vs recall?

Accepted Answer

Precision = TP / (TP + FP) — how many predicted positives were right. Recall = TP / (TP + FN) — how many actual positives were caught.

Question 6

When would you favour recall over precision?

Accepted Answer

When the cost of false negatives is high — fraud detection, disease screening, security alerts.

Question 7

What is feature engineering?

Accepted Answer

Creating new input variables from raw data to improve model performance — encoding categoricals, scaling, deriving ratios, time-based features.

Question 8

Explain the curse of dimensionality.

Accepted Answer

As feature count grows, data becomes sparse and distance measures lose meaning, degrading model performance. Reduce with feature selection or PCA.

Question 9

What is a confusion matrix?

Accepted Answer

A table of TP, FP, FN, TN. The foundation for precision, recall, F1 and accuracy.

Question 10

Linear regression vs logistic regression?

Accepted Answer

Linear regression predicts continuous values. Logistic regression predicts probability of a class via the sigmoid function.

Question 11

What is regularisation?

Accepted Answer

A penalty added to the loss function to discourage overly complex models. L1 (Lasso) drives weights to zero; L2 (Ridge) keeps them small.

Question 12

Explain random forest.

Accepted Answer

An ensemble of decision trees trained on bootstrap samples with random feature subsets. Predictions are averaged (regression) or voted (classification).

Question 13

What is gradient boosting?

Accepted Answer

An ensemble where each new tree corrects errors of the previous ensemble. XGBoost and LightGBM are popular implementations.

Question 14

When would you use clustering?

Accepted Answer

Customer segmentation, anomaly detection, exploratory analysis — when you want structure but have no labels.

Question 15

What is PCA?

Accepted Answer

Principal Component Analysis — a linear method that projects data onto axes of maximum variance to reduce dimensionality while preserving signal.

Question 16

Difference between bagging and boosting?

Accepted Answer

Bagging trains models in parallel on bootstrap samples and averages them, reducing variance. Boosting trains sequentially, each model correcting the previous, reducing bias.

Question 17

What's a baseline model and why do you need one?

Accepted Answer

A trivial model (e.g. majority class, mean prediction). It sets the floor any real model must clear.

Question 18

How do you handle imbalanced classes?

Accepted Answer

Resample (SMOTE, undersampling), class weights, threshold tuning, or use metrics like F1 / PR-AUC instead of accuracy.

Question 19

What's the difference between R² and adjusted R²?

Accepted Answer

R² measures variance explained; it always rises with more features. Adjusted R² penalises extra features, making model comparisons fairer.

Question 20

How do you deploy an ML model in production?

Accepted Answer

Wrap it behind an API (FastAPI, BentoML), version model artefacts in MLflow or S3, monitor input drift and performance, and automate retraining when metrics degrade.

Top Data Science Interview Questions and Answers (2026)

Q01.Difference between supervised and unsupervised learning?

Q02.Explain bias-variance trade-off.

Q03.What is overfitting and how do you prevent it?

Q04.What is cross-validation?

Q05.Precision vs recall?

Q06.When would you favour recall over precision?

Q07.What is feature engineering?

Q08.Explain the curse of dimensionality.

Q09.What is a confusion matrix?

Q10.Linear regression vs logistic regression?

Q11.What is regularisation?

Q12.Explain random forest.

Q13.What is gradient boosting?

Q14.When would you use clustering?

Q15.What is PCA?

Q16.Difference between bagging and boosting?

Q17.What's a baseline model and why do you need one?

Q18.How do you handle imbalanced classes?

Q19.What's the difference between R² and adjusted R²?

Q20.How do you deploy an ML model in production?

Get 1:1 prep on Data Science

More interview prep

Java Interview Questions

Python Interview Questions

AWS Interview Questions