Multiple Choice Test: L1 Data Scientist Interview Questions and Mockup

This multiple-choice test is designed to assess foundational knowledge in data science, including machine learning, statistics, programming, and database concepts. It’s suitable for entry-level or L1 data scientist roles and can be used for self-assessment or interview preparation.

Table of Contents

Machine Learning Fundamentals

1) What is the primary difference between supervised and unsupervised learning?

  • A) Supervised learning uses labeled data, while unsupervised learning uses unlabeled data.
  • B) Supervised learning is used for clustering; unsupervised is for classification.
  • C) Supervised requires lots of data; unsupervised needs little.
  • D) Supervised is always more accurate than unsupervised.
Click to see the answer
Answer: A. Supervised learning trains on labeled examples to predict outcomes; unsupervised finds structure in unlabeled data.

2) Which of the following is an example of a supervised learning algorithm?

  • A) K-Means Clustering
  • B) Principal Component Analysis (PCA)
  • C) Linear Regression
  • D) Apriori Algorithm
Click to see the answer
Answer: C. Linear Regression models a target variable from features—classic supervised learning.

3) What is overfitting in a machine learning model?

  • A) Great performance on both train and test sets
  • B) Model too simple to capture patterns
  • C) Model learns noise in training data and fails to generalize
  • D) Model cannot learn from the training data at all
Click to see the answer
Answer: C. Overfitting fits noise/irrelevancies, hurting performance on unseen data.

4) What is a Confusion Matrix used for?

  • A) Visualize relation of two continuous vars
  • B) Evaluate a classification model
  • C) Dimensionality reduction
  • D) Pick number of clusters
Click to see the answer
Answer: B. It tabulates TP, FP, TN, FN to assess classifiers.

5) In a classification problem, what does ‘Recall’ measure?

  • A) Proportion of predicted positives that are correct
  • B) Proportion of actual positives correctly identified
  • C) Overall accuracy
  • D) Balance between FP and FN
Click to see the answer
Answer: B. Recall (sensitivity) = TP / (TP + FN).

6) Which is a common method to handle missing values?

  • A) Drop rows with missing values
  • B) Impute with mean/median/mode
  • C) Predict missing values with a model
  • D) All of the above
Click to see the answer
Answer: D. Strategy depends on context and data patterns.

7) Purpose of cross-validation?

  • A) Speed up training
  • B) Test on entire dataset at once
  • C) Prevent overfitting and estimate generalization
  • D) Remove outliers
Click to see the answer
Answer: C. CV evaluates model stability on unseen folds.

8) What is the Bias–Variance Tradeoff?

  • A) Speed vs accuracy
  • B) Simplicity (bias) vs fit/complexity (variance)
  • C) Cost of data vs value
  • D) Linear vs non-linear models
Click to see the answer
Answer: B. We balance underfitting (high bias) and overfitting (high variance).

9) Which is a drawback of Linear Regression?

  • A) Too complex to implement
  • B) Only for classification
  • C) Assumes linear relationship between features and target
  • D) Extremely sensitive to missing values
Click to see the answer
Answer: C. Violations of linearity harm performance.

10) What is the main goal of k-means clustering?

  • A) Classify into predefined categories
  • B) Fit a linear relationship
  • C) Partition data into k clusters by nearest mean
  • D) Reduce features
Click to see the answer
Answer: C. Each point is assigned to the closest centroid.

Statistics & Model Evaluation

11) What is a p-value?

  • A) Measure of relationship between variables
  • B) Probability of the observed (or more extreme) result under the null
  • C) Dataset size
  • D) Model accuracy
Click to see the answer
Answer: B. Small p-value suggests evidence against the null.

12) R-squared in linear regression represents…

  • A) Model p-value
  • B) Mean of target
  • C) Proportion of variance explained by features
  • D) Std. dev. of residuals
Click to see the answer
Answer: C. Higher R² ≈ more variance explained (beware overfit).

13) Type I Error (False Positive) is…

  • A) Predict positive when actual is negative
  • B) Predict negative when actual is positive
  • C) No prediction
  • D) Random prediction
Click to see the answer
Answer: A. False alarm.

14) Type II Error (False Negative) is…

  • A) Predict positive, actual negative
  • B) Predict negative, actual positive
  • C) No prediction
  • D) Random prediction
Click to see the answer
Answer: B. Missed detection.

15) Common metric to evaluate regression?

  • A) Accuracy
  • B) F1-Score
  • C) Mean Squared Error (MSE)
  • D) Precision
Click to see the answer
Answer: C. MSE measures average squared residuals.

16) L1 regularization (Lasso) does what?

  • A) Penalty = square of coefficients
  • B) Penalty = absolute value of coefficients
  • C) Penalty for number of features
  • D) Penalty for number of data points
Click to see the answer
Answer: B. Drives some coefficients to zero → feature selection.

17) Which of these is a type of Regularization?

  • A) Logistic Regression
  • B) K-Means
  • C) Ridge Regression
  • D) SVM
Click to see the answer
Answer: C. Ridge = L2 penalty to reduce variance.

Python & Data Handling

18) Which Python library is most used for data manipulation?

  • A) Scikit-learn
  • B) Matplotlib
  • C) Pandas
  • D) Seaborn
Click to see the answer
Answer: C. Pandas DataFrame/Series power tabular work.

19) Main difference between list and tuple?

  • A) List immutable, tuple mutable
  • B) List ordered, tuple unordered
  • C) List mutable, tuple immutable
  • D) List only strings, tuple any type
Click to see the answer
Answer: C. Lists changeable; tuples fixed.

20) Purpose of pip?

  • A) Create/manage virtual envs
  • B) Install/manage external Python packages
  • C) Run Python scripts
  • D) Debug Python code
Click to see the answer
Answer: B. pip installs libraries from PyPI.

21) What is a DataFrame in Pandas?

  • A) One-dimensional labeled array
  • B) Two-dimensional labeled table with mixed dtypes
  • C) Plotting object
  • D) Numerical computation object
Click to see the answer
Answer: B. Primary tabular structure in Pandas.

22) Which is not a common Python viz library?

  • A) Matplotlib
  • B) Seaborn
  • C) Plotly
  • D) SciPy
Click to see the answer
Answer: D. SciPy is for scientific computing.

23) In Python, a dictionary is…

  • A) Ordered sequence of items
  • B) Collection of key–value pairs
  • C) Numeric-only structure
  • D) Immutable structure
Click to see the answer
Answer: B. Dicts map keys to values.

24) How to check dtypes in a Pandas DataFrame?

  • A) df.describe()
  • B) df.info()
  • C) df.head()
  • D) df.shape
Click to see the answer
Answer: B. df.info() shows dtypes and non-null counts.

25) What is NaN in Pandas?

  • A) A string
  • B) Zero
  • C) Missing/Not-a-Number value
  • D) Boolean
Click to see the answer
Answer: C. Standard representation for missing data.

26) Primary function of NumPy?

  • A) Visualization
  • B) Numerical computing with n-dim arrays
  • C) Web scraping
  • D) Build ML models
Click to see the answer
Answer: B. NumPy powers vectorized computation.

27) scikit-learn is primarily used for…

  • A) Data visualization
  • B) Database management
  • C) Machine learning
  • D) Web development
Click to see the answer
Answer: C. It provides ML algorithms and utilities.

SQL & Databases

28) What is a primary key?

  • A) Column with unique value per row
  • B) Column linking to a different table
  • C) Column that allows duplicates
  • D) Column that stores text
Click to see the answer
Answer: A. Uniquely identifies each record.

29) Which SQL command retrieves data?

  • A) UPDATE
  • B) DELETE
  • C) SELECT
  • D) INSERT
Click to see the answer
Answer: C. SELECT fetches results from tables.

30) What does SELECT * FROM table_name do?

  • A) Selects a random row
  • B) Selects all columns
  • C) Counts rows
  • D) Deletes the table
Click to see the answer
Answer: B. Asterisk selects all columns.

31) Purpose of the WHERE clause?

  • A) Group rows
  • B) Filter records by condition
  • C) Sort results
  • D) Join tables
Click to see the answer
Answer: B. Filters rows before aggregation.

32) Purpose of GROUP BY?

  • A) Sort results
  • B) Filter rows before grouping
  • C) Aggregate rows sharing same values
  • D) Combine data from two tables
Click to see the answer
Answer: C. Groups rows to compute aggregates.

33) HAVING clause does what?

  • A) Filter before grouping
  • B) Sort results
  • C) Filter groups after GROUP BY
  • D) Select columns
Click to see the answer
Answer: C. Applies conditions to aggregates.

34) INNER JOIN returns…

  • A) All rows from left + matched right
  • B) All rows from both with NULLs where no match
  • C) Rows where the join matches in both tables
  • D) All rows from right + matched left
Click to see the answer
Answer: C. Only matched rows.

35) JOIN vs LEFT JOIN?

  • A) LEFT JOIN returns all left rows, INNER JOIN only matched rows
  • B) LEFT JOIN requires matches in both
  • C) They’re identical
  • D) LEFT JOIN is only for >2 tables
Click to see the answer
Answer: A. LEFT keeps unmatched left rows with NULLs on right.

36) UNION vs UNION ALL?

  • A) UNION removes dups; UNION ALL keeps them
  • B) UNION requires same dtypes; UNION ALL does not
  • C) UNION ALL is faster
  • D) A and C
Click to see the answer
Answer: D. UNION de-duplicates (slower); UNION ALL doesn’t (faster).

37) What is a JOIN condition?

  • A) The type of join
  • B) The column(s) linking two tables
  • C) Number of rows in result
  • D) Output order
Click to see the answer
Answer: B. It specifies the keys to match.

38) What is a foreign key?

  • A) Uniquely identifies a row in a table
  • B) Column in one table referencing another table’s primary key
  • C) Column that cannot be NULL
  • D) Encrypted column
Click to see the answer
Answer: B. Enforces referential integrity across tables.

More ML Concepts

39) Feature engineering’s purpose?

  • A) Select the best algorithm
  • B) Create new features to improve models
  • C) Remove highly correlated features
  • D) Choose optimal hyperparameters
Click to see the answer
Answer: B. Using domain knowledge to craft predictive features.

40) Handle imbalanced datasets?

  • A) Remove majority class
  • B) Undersample minority
  • C) Oversample minority (e.g., SMOTE)
  • D) Use a model unaffected by imbalance
Click to see the answer
Answer: C. Oversampling (or class-weighted training) helps balance.

41) Machine Learning vs Deep Learning?

  • A) Same thing
  • B) Deep learning is an ML subset using multi-layer neural networks
  • C) ML uses neural nets; DL does not
  • D) DL is always more accurate
Click to see the answer
Answer: B. DL specializes in deep neural architectures.

General & Tools

42) Main purpose of git?

  • A) Write Python code
  • B) Version control source code
  • C) Statistical analysis
  • D) Create web apps
Click to see the answer
Answer: B. Git tracks changes and supports collaboration.

43) Example of an unsupervised task?

  • A) Predict house prices
  • B) Spam classification
  • C) Customer segmentation
  • D) Churn prediction
Click to see the answer
Answer: C. Clustering users into segments.

44) Handling high dimensionality?

  • A) Linear regression
  • B) Simple logistic regression
  • C) PCA (dimensionality reduction)
  • D) Min–max scaling
Click to see the answer
Answer: C. Reduce features while retaining variance.

45) Difference between Machine Learning and Deep Learning?

  • A) Same
  • B) Deep learning ⊂ ML using multi-layer NNs
  • C) ML uses NNs; DL doesn’t
  • D) DL always more accurate
Click to see the answer
Answer: B. (Same as Q41; included for completeness.)

More SQL

46) INNER JOIN in SQL is used to…

  • A) Return all left rows + matched right
  • B) Return all rows from both with NULLs where no match
  • C) Return rows with at least one match in both tables
  • D) Return all right rows + matched left
Click to see the answer
Answer: C. Matches only.

47) Common metric to evaluate regression?

  • A) Accuracy
  • B) F1-Score
  • C) Mean Squared Error (MSE)
  • D) Precision
Click to see the answer
Answer: C. (Reinforces Q15.)

48) Purpose of Normalization/Standardization?

  • A) Remove outliers
  • B) Change dtypes
  • C) Scale features to common range for better learning
  • D) Handle missing values
Click to see the answer
Answer: C. Many ML methods assume similarly scaled inputs.

49) What is a Primary Key’s purpose?

  • A) Create relationships between tables
  • B) Ensure integrity by uniquely identifying each record
  • C) Store long text
  • D) Perform complex calculations
Click to see the answer
Answer: B. Unique, non-NULL identifier per row.

50) What does SELECT do in SQL?

  • A) Update rows
  • B) Delete rows
  • C) Retrieve data
  • D) Insert rows
Click to see the answer
Answer: C. Retrieves data from one or more tables.

Tip: For each question in interviews, first state your choice, then justify with a one-line definition or formula, and (if time permits) a 5–10s example.

This website uses cookies to improve your experience. We'll assume you're ok with this, but you can opt-out if you wish. Read More