This multiple-choice test is designed to assess foundational knowledge in data science, including machine learning, statistics, programming, and database concepts. It’s suitable for entry-level or L1 data scientist roles and can be used for self-assessment or interview preparation.
Table of Contents
- Machine Learning Fundamentals
- Data Processing & Feature Engineering
- Statistics & Model Evaluation
- SQL & Databases
- Python Programming
- Data Engineering & General Concepts
Machine Learning Fundamentals
1) What is the primary difference between supervised and unsupervised learning?
- A) Supervised learning uses labeled data, while unsupervised learning uses unlabeled data.
- B) Supervised learning is used for clustering; unsupervised is for classification.
- C) Supervised requires lots of data; unsupervised needs little.
- D) Supervised is always more accurate than unsupervised.
2) Which of the following is an example of a supervised learning algorithm?
- A) K-Means Clustering
- B) Principal Component Analysis (PCA)
- C) Linear Regression
- D) Apriori Algorithm
3) What is overfitting in a machine learning model?
- A) Great performance on both train and test sets
- B) Model too simple to capture patterns
- C) Model learns noise in training data and fails to generalize
- D) Model cannot learn from the training data at all
4) What is a Confusion Matrix used for?
- A) Visualize relation of two continuous vars
- B) Evaluate a classification model
- C) Dimensionality reduction
- D) Pick number of clusters
5) In a classification problem, what does ‘Recall’ measure?
- A) Proportion of predicted positives that are correct
- B) Proportion of actual positives correctly identified
- C) Overall accuracy
- D) Balance between FP and FN
6) Which is a common method to handle missing values?
- A) Drop rows with missing values
- B) Impute with mean/median/mode
- C) Predict missing values with a model
- D) All of the above
7) Purpose of cross-validation?
- A) Speed up training
- B) Test on entire dataset at once
- C) Prevent overfitting and estimate generalization
- D) Remove outliers
8) What is the Bias–Variance Tradeoff?
- A) Speed vs accuracy
- B) Simplicity (bias) vs fit/complexity (variance)
- C) Cost of data vs value
- D) Linear vs non-linear models
9) Which is a drawback of Linear Regression?
- A) Too complex to implement
- B) Only for classification
- C) Assumes linear relationship between features and target
- D) Extremely sensitive to missing values
10) What is the main goal of k-means clustering?
- A) Classify into predefined categories
- B) Fit a linear relationship
- C) Partition data into k clusters by nearest mean
- D) Reduce features
Statistics & Model Evaluation
11) What is a p-value?
- A) Measure of relationship between variables
- B) Probability of the observed (or more extreme) result under the null
- C) Dataset size
- D) Model accuracy
12) R-squared in linear regression represents…
- A) Model p-value
- B) Mean of target
- C) Proportion of variance explained by features
- D) Std. dev. of residuals
13) Type I Error (False Positive) is…
- A) Predict positive when actual is negative
- B) Predict negative when actual is positive
- C) No prediction
- D) Random prediction
14) Type II Error (False Negative) is…
- A) Predict positive, actual negative
- B) Predict negative, actual positive
- C) No prediction
- D) Random prediction
15) Common metric to evaluate regression?
- A) Accuracy
- B) F1-Score
- C) Mean Squared Error (MSE)
- D) Precision
16) L1 regularization (Lasso) does what?
- A) Penalty = square of coefficients
- B) Penalty = absolute value of coefficients
- C) Penalty for number of features
- D) Penalty for number of data points
17) Which of these is a type of Regularization?
- A) Logistic Regression
- B) K-Means
- C) Ridge Regression
- D) SVM
Python & Data Handling
18) Which Python library is most used for data manipulation?
- A) Scikit-learn
- B) Matplotlib
- C) Pandas
- D) Seaborn
19) Main difference between list and tuple?
- A) List immutable, tuple mutable
- B) List ordered, tuple unordered
- C) List mutable, tuple immutable
- D) List only strings, tuple any type
20) Purpose of pip
?
- A) Create/manage virtual envs
- B) Install/manage external Python packages
- C) Run Python scripts
- D) Debug Python code
21) What is a DataFrame in Pandas?
- A) One-dimensional labeled array
- B) Two-dimensional labeled table with mixed dtypes
- C) Plotting object
- D) Numerical computation object
22) Which is not a common Python viz library?
- A) Matplotlib
- B) Seaborn
- C) Plotly
- D) SciPy
23) In Python, a dictionary is…
- A) Ordered sequence of items
- B) Collection of key–value pairs
- C) Numeric-only structure
- D) Immutable structure
24) How to check dtypes in a Pandas DataFrame?
- A)
df.describe()
- B)
df.info()
- C)
df.head()
- D)
df.shape
25) What is NaN in Pandas?
- A) A string
- B) Zero
- C) Missing/Not-a-Number value
- D) Boolean
26) Primary function of NumPy?
- A) Visualization
- B) Numerical computing with n-dim arrays
- C) Web scraping
- D) Build ML models
27) scikit-learn is primarily used for…
- A) Data visualization
- B) Database management
- C) Machine learning
- D) Web development
SQL & Databases
28) What is a primary key?
- A) Column with unique value per row
- B) Column linking to a different table
- C) Column that allows duplicates
- D) Column that stores text
29) Which SQL command retrieves data?
- A) UPDATE
- B) DELETE
- C) SELECT
- D) INSERT
30) What does SELECT * FROM table_name
do?
- A) Selects a random row
- B) Selects all columns
- C) Counts rows
- D) Deletes the table
31) Purpose of the WHERE clause?
- A) Group rows
- B) Filter records by condition
- C) Sort results
- D) Join tables
32) Purpose of GROUP BY?
- A) Sort results
- B) Filter rows before grouping
- C) Aggregate rows sharing same values
- D) Combine data from two tables
33) HAVING clause does what?
- A) Filter before grouping
- B) Sort results
- C) Filter groups after GROUP BY
- D) Select columns
34) INNER JOIN returns…
- A) All rows from left + matched right
- B) All rows from both with NULLs where no match
- C) Rows where the join matches in both tables
- D) All rows from right + matched left
35) JOIN vs LEFT JOIN?
- A) LEFT JOIN returns all left rows, INNER JOIN only matched rows
- B) LEFT JOIN requires matches in both
- C) They’re identical
- D) LEFT JOIN is only for >2 tables
36) UNION vs UNION ALL?
- A) UNION removes dups; UNION ALL keeps them
- B) UNION requires same dtypes; UNION ALL does not
- C) UNION ALL is faster
- D) A and C
37) What is a JOIN condition?
- A) The type of join
- B) The column(s) linking two tables
- C) Number of rows in result
- D) Output order
38) What is a foreign key?
- A) Uniquely identifies a row in a table
- B) Column in one table referencing another table’s primary key
- C) Column that cannot be NULL
- D) Encrypted column
More ML Concepts
39) Feature engineering’s purpose?
- A) Select the best algorithm
- B) Create new features to improve models
- C) Remove highly correlated features
- D) Choose optimal hyperparameters
40) Handle imbalanced datasets?
- A) Remove majority class
- B) Undersample minority
- C) Oversample minority (e.g., SMOTE)
- D) Use a model unaffected by imbalance
41) Machine Learning vs Deep Learning?
- A) Same thing
- B) Deep learning is an ML subset using multi-layer neural networks
- C) ML uses neural nets; DL does not
- D) DL is always more accurate
General & Tools
42) Main purpose of git?
- A) Write Python code
- B) Version control source code
- C) Statistical analysis
- D) Create web apps
43) Example of an unsupervised task?
- A) Predict house prices
- B) Spam classification
- C) Customer segmentation
- D) Churn prediction
44) Handling high dimensionality?
- A) Linear regression
- B) Simple logistic regression
- C) PCA (dimensionality reduction)
- D) Min–max scaling
45) Difference between Machine Learning and Deep Learning?
- A) Same
- B) Deep learning ⊂ ML using multi-layer NNs
- C) ML uses NNs; DL doesn’t
- D) DL always more accurate
More SQL
46) INNER JOIN in SQL is used to…
- A) Return all left rows + matched right
- B) Return all rows from both with NULLs where no match
- C) Return rows with at least one match in both tables
- D) Return all right rows + matched left
47) Common metric to evaluate regression?
- A) Accuracy
- B) F1-Score
- C) Mean Squared Error (MSE)
- D) Precision
48) Purpose of Normalization/Standardization?
- A) Remove outliers
- B) Change dtypes
- C) Scale features to common range for better learning
- D) Handle missing values
49) What is a Primary Key’s purpose?
- A) Create relationships between tables
- B) Ensure integrity by uniquely identifying each record
- C) Store long text
- D) Perform complex calculations
50) What does SELECT do in SQL?
- A) Update rows
- B) Delete rows
- C) Retrieve data
- D) Insert rows
Tip: For each question in interviews, first state your choice, then justify with a one-line definition or formula, and (if time permits) a 5–10s example.