L1 Data Scientist Interview Questions: Multiple Choice Test

This multiple-choice test is designed to assess foundational knowledge in data science, including machine learning, statistics, programming, and database concepts. It’s suitable for entry-level or L1 data scientist roles and can be used for self-assessment or interview preparation.

Machine Learning Fundamentals
Data Processing & Feature Engineering
Statistics & Model Evaluation
SQL & Databases
Python Programming
Data Engineering & General Concepts

Machine Learning Fundamentals

1) What is the primary difference between supervised and unsupervised learning?

A) Supervised learning uses labeled data, while unsupervised learning uses unlabeled data.
B) Supervised learning is used for clustering; unsupervised is for classification.
C) Supervised requires lots of data; unsupervised needs little.
D) Supervised is always more accurate than unsupervised.

Click to see the answer

Answer: A. Supervised learning trains on labeled examples to predict outcomes; unsupervised finds structure in unlabeled data.

2) Which of the following is an example of a supervised learning algorithm?

A) K-Means Clustering
B) Principal Component Analysis (PCA)
C) Linear Regression
D) Apriori Algorithm

Click to see the answer

Answer: C. Linear Regression models a target variable from features—classic supervised learning.

3) What is overfitting in a machine learning model?

A) Great performance on both train and test sets
B) Model too simple to capture patterns
C) Model learns noise in training data and fails to generalize
D) Model cannot learn from the training data at all

Click to see the answer

Answer: C. Overfitting fits noise/irrelevancies, hurting performance on unseen data.

4) What is a Confusion Matrix used for?

A) Visualize relation of two continuous vars
B) Evaluate a classification model
C) Dimensionality reduction
D) Pick number of clusters

Click to see the answer

Answer: B. It tabulates TP, FP, TN, FN to assess classifiers.

5) In a classification problem, what does ‘Recall’ measure?

A) Proportion of predicted positives that are correct
B) Proportion of actual positives correctly identified
C) Overall accuracy
D) Balance between FP and FN

Click to see the answer

Answer: B. Recall (sensitivity) = TP / (TP + FN).

6) Which is a common method to handle missing values?

A) Drop rows with missing values
B) Impute with mean/median/mode
C) Predict missing values with a model
D) All of the above

Click to see the answer

Answer: D. Strategy depends on context and data patterns.

7) Purpose of cross-validation?

A) Speed up training
B) Test on entire dataset at once
C) Prevent overfitting and estimate generalization
D) Remove outliers

Click to see the answer

Answer: C. CV evaluates model stability on unseen folds.

8) What is the Bias–Variance Tradeoff?

A) Speed vs accuracy
B) Simplicity (bias) vs fit/complexity (variance)
C) Cost of data vs value
D) Linear vs non-linear models

Click to see the answer

Answer: B. We balance underfitting (high bias) and overfitting (high variance).

9) Which is a drawback of Linear Regression?

A) Too complex to implement
B) Only for classification
C) Assumes linear relationship between features and target
D) Extremely sensitive to missing values

Click to see the answer

Answer: C. Violations of linearity harm performance.

10) What is the main goal of k-means clustering?

A) Classify into predefined categories
B) Fit a linear relationship
C) Partition data into k clusters by nearest mean
D) Reduce features

Click to see the answer

Answer: C. Each point is assigned to the closest centroid.

Statistics & Model Evaluation

11) What is a p-value?

A) Measure of relationship between variables
B) Probability of the observed (or more extreme) result under the null
C) Dataset size
D) Model accuracy

Click to see the answer

Answer: B. Small p-value suggests evidence against the null.

12) R-squared in linear regression represents…

A) Model p-value
B) Mean of target
C) Proportion of variance explained by features
D) Std. dev. of residuals

Click to see the answer

Answer: C. Higher R² ≈ more variance explained (beware overfit).

13) Type I Error (False Positive) is…

A) Predict positive when actual is negative
B) Predict negative when actual is positive
C) No prediction
D) Random prediction

Click to see the answer

Answer: A. False alarm.

14) Type II Error (False Negative) is…

A) Predict positive, actual negative
B) Predict negative, actual positive
C) No prediction
D) Random prediction

Click to see the answer

Answer: B. Missed detection.

15) Common metric to evaluate regression?

A) Accuracy
B) F1-Score
C) Mean Squared Error (MSE)
D) Precision

Click to see the answer

Answer: C. MSE measures average squared residuals.

16) L1 regularization (Lasso) does what?

A) Penalty = square of coefficients
B) Penalty = absolute value of coefficients
C) Penalty for number of features
D) Penalty for number of data points

Click to see the answer

Answer: B. Drives some coefficients to zero → feature selection.

17) Which of these is a type of Regularization?

A) Logistic Regression
B) K-Means
C) Ridge Regression
D) SVM

Click to see the answer

Answer: C. Ridge = L2 penalty to reduce variance.

Python & Data Handling

18) Which Python library is most used for data manipulation?

A) Scikit-learn
B) Matplotlib
C) Pandas
D) Seaborn

Click to see the answer

Answer: C. Pandas DataFrame/Series power tabular work.

19) Main difference between list and tuple?

A) List immutable, tuple mutable
B) List ordered, tuple unordered
C) List mutable, tuple immutable
D) List only strings, tuple any type

Click to see the answer

Answer: C. Lists changeable; tuples fixed.

20) Purpose of `pip`?

A) Create/manage virtual envs
B) Install/manage external Python packages
C) Run Python scripts
D) Debug Python code

Click to see the answer

Answer: B. pip installs libraries from PyPI.

21) What is a DataFrame in Pandas?

A) One-dimensional labeled array
B) Two-dimensional labeled table with mixed dtypes
C) Plotting object
D) Numerical computation object

Click to see the answer

Answer: B. Primary tabular structure in Pandas.

22) Which is not a common Python viz library?

A) Matplotlib
B) Seaborn
C) Plotly
D) SciPy

Click to see the answer

Answer: D. SciPy is for scientific computing.

23) In Python, a dictionary is…

A) Ordered sequence of items
B) Collection of key–value pairs
C) Numeric-only structure
D) Immutable structure

Click to see the answer

Answer: B. Dicts map keys to values.

24) How to check dtypes in a Pandas DataFrame?

A) df.describe()
B) df.info()
C) df.head()
D) df.shape

Click to see the answer

Answer: B. df.info() shows dtypes and non-null counts.

25) What is NaN in Pandas?

A) A string
B) Zero
C) Missing/Not-a-Number value
D) Boolean

Click to see the answer

Answer: C. Standard representation for missing data.

26) Primary function of NumPy?

A) Visualization
B) Numerical computing with n-dim arrays
C) Web scraping
D) Build ML models

Click to see the answer

Answer: B. NumPy powers vectorized computation.

27) scikit-learn is primarily used for…

A) Data visualization
B) Database management
C) Machine learning
D) Web development

Click to see the answer

Answer: C. It provides ML algorithms and utilities.

SQL & Databases

28) What is a primary key?

A) Column with unique value per row
B) Column linking to a different table
C) Column that allows duplicates
D) Column that stores text

Click to see the answer

Answer: A. Uniquely identifies each record.

29) Which SQL command retrieves data?

A) UPDATE
B) DELETE
C) SELECT
D) INSERT

Click to see the answer

Answer: C. SELECT fetches results from tables.

30) What does `SELECT * FROM table_name` do?

A) Selects a random row
B) Selects all columns
C) Counts rows
D) Deletes the table

Click to see the answer

Answer: B. Asterisk selects all columns.

31) Purpose of the WHERE clause?

A) Group rows
B) Filter records by condition
C) Sort results
D) Join tables

Click to see the answer

Answer: B. Filters rows before aggregation.

32) Purpose of GROUP BY?

A) Sort results
B) Filter rows before grouping
C) Aggregate rows sharing same values
D) Combine data from two tables

Click to see the answer

Answer: C. Groups rows to compute aggregates.

33) HAVING clause does what?

A) Filter before grouping
B) Sort results
C) Filter groups after GROUP BY
D) Select columns

Click to see the answer

Answer: C. Applies conditions to aggregates.

34) INNER JOIN returns…

A) All rows from left + matched right
B) All rows from both with NULLs where no match
C) Rows where the join matches in both tables
D) All rows from right + matched left

Click to see the answer

Answer: C. Only matched rows.

35) JOIN vs LEFT JOIN?

A) LEFT JOIN returns all left rows, INNER JOIN only matched rows
B) LEFT JOIN requires matches in both
C) They’re identical
D) LEFT JOIN is only for >2 tables

Click to see the answer

Answer: A. LEFT keeps unmatched left rows with NULLs on right.

36) UNION vs UNION ALL?

A) UNION removes dups; UNION ALL keeps them
B) UNION requires same dtypes; UNION ALL does not
C) UNION ALL is faster
D) A and C

Click to see the answer

Answer: D. UNION de-duplicates (slower); UNION ALL doesn’t (faster).

37) What is a JOIN condition?

A) The type of join
B) The column(s) linking two tables
C) Number of rows in result
D) Output order

Click to see the answer

Answer: B. It specifies the keys to match.

38) What is a foreign key?

A) Uniquely identifies a row in a table
B) Column in one table referencing another table’s primary key
C) Column that cannot be NULL
D) Encrypted column

Click to see the answer

Answer: B. Enforces referential integrity across tables.

More ML Concepts

39) Feature engineering’s purpose?

A) Select the best algorithm
B) Create new features to improve models
C) Remove highly correlated features
D) Choose optimal hyperparameters

Click to see the answer

Answer: B. Using domain knowledge to craft predictive features.

40) Handle imbalanced datasets?

A) Remove majority class
B) Undersample minority
C) Oversample minority (e.g., SMOTE)
D) Use a model unaffected by imbalance

Click to see the answer

Answer: C. Oversampling (or class-weighted training) helps balance.

41) Machine Learning vs Deep Learning?

A) Same thing
B) Deep learning is an ML subset using multi-layer neural networks
C) ML uses neural nets; DL does not
D) DL is always more accurate

Click to see the answer

Answer: B. DL specializes in deep neural architectures.

General & Tools

42) Main purpose of git?

A) Write Python code
B) Version control source code
C) Statistical analysis
D) Create web apps

Click to see the answer

Answer: B. Git tracks changes and supports collaboration.

43) Example of an unsupervised task?

A) Predict house prices
B) Spam classification
C) Customer segmentation
D) Churn prediction

Click to see the answer

Answer: C. Clustering users into segments.

44) Handling high dimensionality?

A) Linear regression
B) Simple logistic regression
C) PCA (dimensionality reduction)
D) Min–max scaling

Click to see the answer

Answer: C. Reduce features while retaining variance.

45) Difference between Machine Learning and Deep Learning?

A) Same
B) Deep learning ⊂ ML using multi-layer NNs
C) ML uses NNs; DL doesn’t
D) DL always more accurate

Click to see the answer

Answer: B. (Same as Q41; included for completeness.)

More SQL

46) INNER JOIN in SQL is used to…

A) Return all left rows + matched right
B) Return all rows from both with NULLs where no match
C) Return rows with at least one match in both tables
D) Return all right rows + matched left

Click to see the answer

Answer: C. Matches only.

47) Common metric to evaluate regression?

A) Accuracy
B) F1-Score
C) Mean Squared Error (MSE)
D) Precision

Click to see the answer

Answer: C. (Reinforces Q15.)

48) Purpose of Normalization/Standardization?

A) Remove outliers
B) Change dtypes
C) Scale features to common range for better learning
D) Handle missing values

Click to see the answer

Answer: C. Many ML methods assume similarly scaled inputs.

49) What is a Primary Key’s purpose?

A) Create relationships between tables
B) Ensure integrity by uniquely identifying each record
C) Store long text
D) Perform complex calculations

Click to see the answer

Answer: B. Unique, non-NULL identifier per row.

50) What does SELECT do in SQL?

A) Update rows
B) Delete rows
C) Retrieve data
D) Insert rows

Click to see the answer

Answer: C. Retrieves data from one or more tables.

Tip: For each question in interviews, first state your choice, then justify with a one-line definition or formula, and (if time permits) a 5–10s example.

Was this resource helpful?

Yes1No0

Have any thoughts?

Share your reaction or leave a quick response — we’d love to hear what you think!

Fabrum Planet Solutions Pvt. Ltd.

Have any thoughts?

SERVICES

IMPORTANT LINKS

CONTACT

Multiple Choice Test: L1 Data Scientist Interview Questions and Mockup

Table of Contents

Machine Learning Fundamentals

1) What is the primary difference between supervised and unsupervised learning?

2) Which of the following is an example of a supervised learning algorithm?

3) What is overfitting in a machine learning model?

4) What is a Confusion Matrix used for?

5) In a classification problem, what does ‘Recall’ measure?

6) Which is a common method to handle missing values?

7) Purpose of cross-validation?

8) What is the Bias–Variance Tradeoff?

9) Which is a drawback of Linear Regression?

10) What is the main goal of k-means clustering?

Statistics & Model Evaluation

11) What is a p-value?

12) R-squared in linear regression represents…

13) Type I Error (False Positive) is…

14) Type II Error (False Negative) is…

15) Common metric to evaluate regression?

16) L1 regularization (Lasso) does what?

17) Which of these is a type of Regularization?

Python & Data Handling

18) Which Python library is most used for data manipulation?

19) Main difference between list and tuple?

20) Purpose of pip?

21) What is a DataFrame in Pandas?

22) Which is not a common Python viz library?

23) In Python, a dictionary is…

24) How to check dtypes in a Pandas DataFrame?

25) What is NaN in Pandas?

26) Primary function of NumPy?

27) scikit-learn is primarily used for…

SQL & Databases

28) What is a primary key?

29) Which SQL command retrieves data?

30) What does SELECT * FROM table_name do?

31) Purpose of the WHERE clause?

32) Purpose of GROUP BY?

33) HAVING clause does what?

34) INNER JOIN returns…

35) JOIN vs LEFT JOIN?

36) UNION vs UNION ALL?

37) What is a JOIN condition?

38) What is a foreign key?

More ML Concepts

39) Feature engineering’s purpose?

40) Handle imbalanced datasets?

41) Machine Learning vs Deep Learning?

General & Tools

42) Main purpose of git?

43) Example of an unsupervised task?

44) Handling high dimensionality?

45) Difference between Machine Learning and Deep Learning?

More SQL

46) INNER JOIN in SQL is used to…

47) Common metric to evaluate regression?

48) Purpose of Normalization/Standardization?

49) What is a Primary Key’s purpose?

50) What does SELECT do in SQL?

Have any thoughts?

Network Role – Level 1 Interview Q&A (Concise Guide)

Best Scenario-Based Questions for SAP Basis (Real Interview Q&A)

Leave a ReplyCancel reply

SERVICES

IMPORTANT LINKS

CONTACT

20) Purpose of `pip`?

30) What does `SELECT * FROM table_name` do?