Multiple Choice Test: L1 Data Scientist Interview Questions and Mockup

by Ashwani Singh

This multiple-choice test is designed to assess foundational knowledge in data science, including machine learning, statistics, programming, and database concepts. It’s suitable for entry-level or L1 data scientist roles and can be used for self-assessment or interview preparation.

Table of Contents

Machine Learning Fundamentals

1) What is the primary difference between supervised and unsupervised learning?

  • A) Supervised learning uses labeled data, while unsupervised learning uses unlabeled data.
  • B) Supervised learning is used for clustering; unsupervised is for classification.
  • C) Supervised requires lots of data; unsupervised needs little.
  • D) Supervised is always more accurate than unsupervised.
Click to see the answer
Answer: A. Supervised learning trains on labeled examples to predict outcomes; unsupervised finds structure in unlabeled data.

2) Which of the following is an example of a supervised learning algorithm?

  • A) K-Means Clustering
  • B) Principal Component Analysis (PCA)
  • C) Linear Regression
  • D) Apriori Algorithm
Click to see the answer
Answer: C. Linear Regression models a target variable from features—classic supervised learning.

3) What is overfitting in a machine learning model?

  • A) Great performance on both train and test sets
  • B) Model too simple to capture patterns
  • C) Model learns noise in training data and fails to generalize
  • D) Model cannot learn from the training data at all
Click to see the answer
Answer: C. Overfitting fits noise/irrelevancies, hurting performance on unseen data.

4) What is a Confusion Matrix used for?

  • A) Visualize relation of two continuous vars
  • B) Evaluate a classification model
  • C) Dimensionality reduction
  • D) Pick number of clusters
Click to see the answer
Answer: B. It tabulates TP, FP, TN, FN to assess classifiers.

5) In a classification problem, what does ‘Recall’ measure?

  • A) Proportion of predicted positives that are correct
  • B) Proportion of actual positives correctly identified
  • C) Overall accuracy
  • D) Balance between FP and FN
Click to see the answer
Answer: B. Recall (sensitivity) = TP / (TP + FN).

6) Which is a common method to handle missing values?

  • A) Drop rows with missing values
  • B) Impute with mean/median/mode
  • C) Predict missing values with a model
  • D) All of the above
Click to see the answer
Answer: D. Strategy depends on context and data patterns.

7) Purpose of cross-validation?

  • A) Speed up training
  • B) Test on entire dataset at once
  • C) Prevent overfitting and estimate generalization
  • D) Remove outliers
Click to see the answer
Answer: C. CV evaluates model stability on unseen folds.

8) What is the Bias–Variance Tradeoff?

  • A) Speed vs accuracy
  • B) Simplicity (bias) vs fit/complexity (variance)
  • C) Cost of data vs value
  • D) Linear vs non-linear models
Click to see the answer
Answer: B. We balance underfitting (high bias) and overfitting (high variance).

9) Which is a drawback of Linear Regression?

  • A) Too complex to implement
  • B) Only for classification
  • C) Assumes linear relationship between features and target
  • D) Extremely sensitive to missing values
Click to see the answer
Answer: C. Violations of linearity harm performance.

10) What is the main goal of k-means clustering?

  • A) Classify into predefined categories
  • B) Fit a linear relationship
  • C) Partition data into k clusters by nearest mean
  • D) Reduce features
Click to see the answer
Answer: C. Each point is assigned to the closest centroid.

Statistics & Model Evaluation

11) What is a p-value?

  • A) Measure of relationship between variables
  • B) Probability of the observed (or more extreme) result under the null
  • C) Dataset size
  • D) Model accuracy
Click to see the answer
Answer: B. Small p-value suggests evidence against the null.

12) R-squared in linear regression represents…

  • A) Model p-value
  • B) Mean of target
  • C) Proportion of variance explained by features
  • D) Std. dev. of residuals
Click to see the answer
Answer: C. Higher R² ≈ more variance explained (beware overfit).

13) Type I Error (False Positive) is…

  • A) Predict positive when actual is negative
  • B) Predict negative when actual is positive
  • C) No prediction
  • D) Random prediction
Click to see the answer
Answer: A. False alarm.

14) Type II Error (False Negative) is…

  • A) Predict positive, actual negative
  • B) Predict negative, actual positive
  • C) No prediction
  • D) Random prediction
Click to see the answer
Answer: B. Missed detection.

15) Common metric to evaluate regression?

  • A) Accuracy
  • B) F1-Score
  • C) Mean Squared Error (MSE)
  • D) Precision
Click to see the answer
Answer: C. MSE measures average squared residuals.

16) L1 regularization (Lasso) does what?

  • A) Penalty = square of coefficients
  • B) Penalty = absolute value of coefficients
  • C) Penalty for number of features
  • D) Penalty for number of data points
Click to see the answer
Answer: B. Drives some coefficients to zero → feature selection.

17) Which of these is a type of Regularization?

  • A) Logistic Regression
  • B) K-Means
  • C) Ridge Regression
  • D) SVM
Click to see the answer
Answer: C. Ridge = L2 penalty to reduce variance.

Python & Data Handling

18) Which Python library is most used for data manipulation?

  • A) Scikit-learn
  • B) Matplotlib
  • C) Pandas
  • D) Seaborn
Click to see the answer
Answer: C. Pandas DataFrame/Series power tabular work.

19) Main difference between list and tuple?

  • A) List immutable, tuple mutable
  • B) List ordered, tuple unordered
  • C) List mutable, tuple immutable
  • D) List only strings, tuple any type
Click to see the answer
Answer: C. Lists changeable; tuples fixed.

20) Purpose of pip?

  • A) Create/manage virtual envs
  • B) Install/manage external Python packages
  • C) Run Python scripts
  • D) Debug Python code
Click to see the answer
Answer: B. pip installs libraries from PyPI.

21) What is a DataFrame in Pandas?

  • A) One-dimensional labeled array
  • B) Two-dimensional labeled table with mixed dtypes
  • C) Plotting object
  • D) Numerical computation object
Click to see the answer
Answer: B. Primary tabular structure in Pandas.

22) Which is not a common Python viz library?

  • A) Matplotlib
  • B) Seaborn
  • C) Plotly
  • D) SciPy
Click to see the answer
Answer: D. SciPy is for scientific computing.

23) In Python, a dictionary is…

  • A) Ordered sequence of items
  • B) Collection of key–value pairs
  • C) Numeric-only structure
  • D) Immutable structure
Click to see the answer
Answer: B. Dicts map keys to values.

24) How to check dtypes in a Pandas DataFrame?

  • A) df.describe()
  • B) df.info()
  • C) df.head()
  • D) df.shape
Click to see the answer
Answer: B. df.info() shows dtypes and non-null counts.

25) What is NaN in Pandas?

  • A) A string
  • B) Zero
  • C) Missing/Not-a-Number value
  • D) Boolean
Click to see the answer
Answer: C. Standard representation for missing data.

26) Primary function of NumPy?

  • A) Visualization
  • B) Numerical computing with n-dim arrays
  • C) Web scraping
  • D) Build ML models
Click to see the answer
Answer: B. NumPy powers vectorized computation.

27) scikit-learn is primarily used for…

  • A) Data visualization
  • B) Database management
  • C) Machine learning
  • D) Web development
Click to see the answer
Answer: C. It provides ML algorithms and utilities.

SQL & Databases

28) What is a primary key?

  • A) Column with unique value per row
  • B) Column linking to a different table
  • C) Column that allows duplicates
  • D) Column that stores text
Click to see the answer
Answer: A. Uniquely identifies each record.

29) Which SQL command retrieves data?

  • A) UPDATE
  • B) DELETE
  • C) SELECT
  • D) INSERT
Click to see the answer
Answer: C. SELECT fetches results from tables.

30) What does SELECT * FROM table_name do?

  • A) Selects a random row
  • B) Selects all columns
  • C) Counts rows
  • D) Deletes the table
Click to see the answer
Answer: B. Asterisk selects all columns.

31) Purpose of the WHERE clause?

  • A) Group rows
  • B) Filter records by condition
  • C) Sort results
  • D) Join tables
Click to see the answer
Answer: B. Filters rows before aggregation.

32) Purpose of GROUP BY?

  • A) Sort results
  • B) Filter rows before grouping
  • C) Aggregate rows sharing same values
  • D) Combine data from two tables
Click to see the answer
Answer: C. Groups rows to compute aggregates.

33) HAVING clause does what?

  • A) Filter before grouping
  • B) Sort results
  • C) Filter groups after GROUP BY
  • D) Select columns
Click to see the answer
Answer: C. Applies conditions to aggregates.

34) INNER JOIN returns…

  • A) All rows from left + matched right
  • B) All rows from both with NULLs where no match
  • C) Rows where the join matches in both tables
  • D) All rows from right + matched left
Click to see the answer
Answer: C. Only matched rows.

35) JOIN vs LEFT JOIN?

  • A) LEFT JOIN returns all left rows, INNER JOIN only matched rows
  • B) LEFT JOIN requires matches in both
  • C) They’re identical
  • D) LEFT JOIN is only for >2 tables
Click to see the answer
Answer: A. LEFT keeps unmatched left rows with NULLs on right.

36) UNION vs UNION ALL?

  • A) UNION removes dups; UNION ALL keeps them
  • B) UNION requires same dtypes; UNION ALL does not
  • C) UNION ALL is faster
  • D) A and C
Click to see the answer
Answer: D. UNION de-duplicates (slower); UNION ALL doesn’t (faster).

37) What is a JOIN condition?

  • A) The type of join
  • B) The column(s) linking two tables
  • C) Number of rows in result
  • D) Output order
Click to see the answer
Answer: B. It specifies the keys to match.

38) What is a foreign key?

  • A) Uniquely identifies a row in a table
  • B) Column in one table referencing another table’s primary key
  • C) Column that cannot be NULL
  • D) Encrypted column
Click to see the answer
Answer: B. Enforces referential integrity across tables.

More ML Concepts

39) Feature engineering’s purpose?

  • A) Select the best algorithm
  • B) Create new features to improve models
  • C) Remove highly correlated features
  • D) Choose optimal hyperparameters
Click to see the answer
Answer: B. Using domain knowledge to craft predictive features.

40) Handle imbalanced datasets?

  • A) Remove majority class
  • B) Undersample minority
  • C) Oversample minority (e.g., SMOTE)
  • D) Use a model unaffected by imbalance
Click to see the answer
Answer: C. Oversampling (or class-weighted training) helps balance.

41) Machine Learning vs Deep Learning?

  • A) Same thing
  • B) Deep learning is an ML subset using multi-layer neural networks
  • C) ML uses neural nets; DL does not
  • D) DL is always more accurate
Click to see the answer
Answer: B. DL specializes in deep neural architectures.

General & Tools

42) Main purpose of git?

  • A) Write Python code
  • B) Version control source code
  • C) Statistical analysis
  • D) Create web apps
Click to see the answer
Answer: B. Git tracks changes and supports collaboration.

43) Example of an unsupervised task?

  • A) Predict house prices
  • B) Spam classification
  • C) Customer segmentation
  • D) Churn prediction
Click to see the answer
Answer: C. Clustering users into segments.

44) Handling high dimensionality?

  • A) Linear regression
  • B) Simple logistic regression
  • C) PCA (dimensionality reduction)
  • D) Min–max scaling
Click to see the answer
Answer: C. Reduce features while retaining variance.

45) Difference between Machine Learning and Deep Learning?

  • A) Same
  • B) Deep learning ⊂ ML using multi-layer NNs
  • C) ML uses NNs; DL doesn’t
  • D) DL always more accurate
Click to see the answer
Answer: B. (Same as Q41; included for completeness.)

More SQL

46) INNER JOIN in SQL is used to…

  • A) Return all left rows + matched right
  • B) Return all rows from both with NULLs where no match
  • C) Return rows with at least one match in both tables
  • D) Return all right rows + matched left
Click to see the answer
Answer: C. Matches only.

47) Common metric to evaluate regression?

  • A) Accuracy
  • B) F1-Score
  • C) Mean Squared Error (MSE)
  • D) Precision
Click to see the answer
Answer: C. (Reinforces Q15.)

48) Purpose of Normalization/Standardization?

  • A) Remove outliers
  • B) Change dtypes
  • C) Scale features to common range for better learning
  • D) Handle missing values
Click to see the answer
Answer: C. Many ML methods assume similarly scaled inputs.

49) What is a Primary Key’s purpose?

  • A) Create relationships between tables
  • B) Ensure integrity by uniquely identifying each record
  • C) Store long text
  • D) Perform complex calculations
Click to see the answer
Answer: B. Unique, non-NULL identifier per row.

50) What does SELECT do in SQL?

  • A) Update rows
  • B) Delete rows
  • C) Retrieve data
  • D) Insert rows
Click to see the answer
Answer: C. Retrieves data from one or more tables.

Tip: For each question in interviews, first state your choice, then justify with a one-line definition or formula, and (if time permits) a 5–10s example.

We’ve teamed up with sproutQ.com, one of India’s leading hiring platforms, to bring you a smarter, faster, and more personalized resume-building experience.

Leave a Reply

[script_16]

This site uses Akismet to reduce spam. Learn how your comment data is processed.

This website uses cookies to improve your experience. We'll assume you're ok with this, but you can opt-out if you wish. OK Read More

Privacy & Cookies Policy