We use cookies to improve your browsing experience, support the operation of this site, and understand how visitors use our content.
You can accept all cookies, accept only essential cookies, or deny non-essential cookies.
Privacy Policy
10 most-asked MLOps interview questions covering ML pipelines, CI/CD for ML, model deployment, monitoring, data drift, feature stores, experiment tracking, model registry, automated retraining, and A/B testing.
MLOps interview, ML pipeline, CI/CD machine learning, model deployment, model monitoring, data drift, concept drift, feature store, experiment tracking, model registry, MLflow, Kubeflow, automated retraining, A/B testing ML models
Introduction
This is Part 1 of our MLOps Interview QA series, covering the 10 most frequently asked MLOps interview questions. MLOps (Machine Learning Operations) bridges the gap between data science and production engineering — ensuring ML models are developed, deployed, monitored, and maintained reliably at scale.
Q1: What Is MLOps and How Does It Differ from DevOps?
Answer:
MLOps is a set of practices that combines Machine Learning, DevOps, and Data Engineering to deploy and maintain ML systems in production reliably and efficiently. While DevOps focuses on software application lifecycle, MLOps addresses the unique challenges of ML systems — data dependencies, experiment tracking, model decay, and continuous training.
An ML pipeline is a sequence of automated steps that takes raw data and produces a deployed, monitored model. Each step is reproducible, versioned, and can be triggered independently.
Track predictions, detect drift, alert on degradation
Evidently AI, WhyLabs, Prometheus + Grafana
Pipeline Orchestration
Tool
Type
Best For
Apache Airflow
DAG-based workflow
General-purpose data + ML pipelines
Kubeflow Pipelines
Kubernetes-native
ML-specific pipelines on K8s
Prefect
Modern orchestrator
Python-native, dynamic workflows
Vertex AI Pipelines
Managed (GCP)
Teams on GCP wanting minimal infra
ZenML
ML-specific framework
Portable pipelines across infra
Dagster
Asset-based orchestrator
Data-aware pipelines with lineage
Pipeline Design Principles
1. Idempotent steps: Re-running any step produces same result
2. Versioned artifacts: Every output (data, features, model) is versioned
3. Parameterized: Hyperparameters, thresholds passed as config (not hardcoded)
4. Cached: Skip steps whose inputs haven't changed
5. Testable: Each step has unit tests
6. Observable: Metrics and logs at every stage
7. Triggerable: Can be triggered by schedule, event, or manual
Q3: How Do You Implement CI/CD for Machine Learning?
Answer:
CI/CD for ML extends traditional software CI/CD with additional pipelines for data validation, model training, model evaluation, and model deployment. It ensures that changes to code, data, or configuration automatically result in tested, validated model deployments.
graph TD
subgraph CI["Continuous Integration"]
CODE["Code Push<br/>(Git)"]
CODE --> LINT["Lint & Unit Tests"]
LINT --> DATA_TEST["Data Validation<br/>Tests"]
DATA_TEST --> TRAIN_TEST["Training Pipeline<br/>Tests (small data)"]
TRAIN_TEST --> BUILD["Build Container<br/>Image"]
end
subgraph CT["Continuous Training"]
TRIGGER["Trigger:<br/>schedule / new data / drift"]
TRIGGER --> PIPELINE["Full Training Pipeline<br/>(production data)"]
PIPELINE --> EVAL_GATE["Evaluation Gate<br/>(metrics threshold)"]
EVAL_GATE -->|"Pass"| REGISTER["Register Model<br/>(Model Registry)"]
EVAL_GATE -->|"Fail"| ALERT["Alert Team"]
end
subgraph CD["Continuous Deployment"]
REGISTER --> STAGE["Deploy to Staging"]
STAGE --> SHADOW["Shadow / Canary<br/>Evaluation"]
SHADOW -->|"Approved"| PROD["Deploy to Production"]
end
style CI fill:#6cc3d5,stroke:#333,color:#fff
style CT fill:#56cc9d,stroke:#333,color:#fff
style CD fill:#ffce67,stroke:#333
CI/CD Pipeline Stages for ML
Stage
What It Does
Trigger
Code CI
Lint, unit tests, integration tests for pipeline code
Git push / PR
Data validation
Schema checks, distribution tests on new data
New data arrives
Training (CT)
Full training pipeline with production data
Schedule / drift alert / manual
Model evaluation
Compare new model vs current production model
After training completes
Model registration
Version and tag model in registry
Evaluation passes threshold
Staging deployment
Deploy model to staging for integration tests
Model registered
Production deployment
Canary/shadow → full rollout
Manual approval or auto
Evaluation Gate (Quality Gates)
# Example: Automated quality gate in training pipelinedef evaluate_model_for_promotion(new_model_metrics, production_metrics, thresholds):""" Decide whether to promote new model to production. Returns True if new model passes all quality gates. """ checks = {"accuracy_improvement": ( new_model_metrics["accuracy"] >= production_metrics["accuracy"] - thresholds["max_accuracy_drop"] ),"latency_acceptable": ( new_model_metrics["p99_latency_ms"] <= thresholds["max_p99_latency_ms"] ),"bias_check": ( new_model_metrics["demographic_parity_diff"] <= thresholds["max_bias_score"] ),"data_coverage": ( new_model_metrics["test_coverage"] >= thresholds["min_test_coverage"] ), } all_passed =all(checks.values())return all_passed, checks
CI/CD Tools for ML
Tool
Purpose
GitHub Actions / GitLab CI
Code CI, trigger training pipelines
DVC (Data Version Control)
Version datasets alongside code in Git
MLflow
Experiment tracking, model registry
CML (Continuous Machine Learning)
Auto-generate model reports in PRs
Seldon Core / KServe
Model serving on Kubernetes
ArgoCD
GitOps deployment of model services
Q4: How Do You Deploy ML Models to Production?
Answer:
Model deployment is the process of making a trained model available to serve predictions in a production environment. The deployment strategy depends on latency requirements, scale, and how the model is consumed.
graph LR
subgraph Shadow["Shadow Deployment"]
PROD_M["Production Model<br/>(serves traffic)"]
SHADOW_M["New Model<br/>(receives traffic copy,<br/>predictions discarded)"]
end
subgraph Canary["Canary Deployment"]
OLD["Old Model<br/>(95% traffic)"]
NEW["New Model<br/>(5% traffic)"]
end
subgraph AB["A/B Testing"]
MODEL_A["Model A<br/>(Control group)"]
MODEL_B["Model B<br/>(Treatment group)"]
end
style Shadow fill:#6cc3d5,stroke:#333,color:#fff
style Canary fill:#56cc9d,stroke:#333,color:#fff
style AB fill:#ffce67,stroke:#333
Strategy
How It Works
Risk Level
When to Use
Direct replacement
Swap old model with new immediately
High
Low-risk models, strong offline eval
Shadow (dark launch)
New model runs alongside, predictions logged but not served
Zero
High-risk models needing production validation
Canary
Route small % of traffic to new model
Low
Gradual rollout with monitoring
A/B test
Split users into groups, measure business metrics
Low
Need statistical proof of improvement
Multi-armed bandit
Dynamically allocate traffic based on performance
Low
Continuous optimization
Q5: What Is Model Monitoring and How Do You Detect Drift?
Answer:
Model monitoring is the continuous observation of a deployed model’s performance, input data, and predictions to detect degradation before it impacts business outcomes. Drift is when the statistical properties of data or the relationship between features and target change over time.
Q6: What Is a Feature Store and Why Is It Important?
Answer:
A feature store is a centralized repository for storing, managing, and serving features used in ML models. It ensures consistency between training and serving (avoiding training-serving skew), enables feature reuse across teams, and provides point-in-time correct features for training.
Features computed differently in training vs serving → silent bugs
Same feature definitions used everywhere
Feature duplication
Teams reimplement same features independently
Central catalog, reuse across models
Point-in-time correctness
Training with future data (data leakage)
Time-travel queries ensure no leakage
Feature freshness
Stale features in production
Online store updated in real-time
Discovery
“Does anyone already compute user_age_days?”
Searchable feature catalog
Feature Store Architecture
Component
Purpose
Technology
Feature definitions
Code that transforms raw data → features
Python/SQL transformations
Offline store
Historical feature values for training
Data warehouse (BigQuery, Snowflake, Parquet files)
Online store
Latest feature values for low-latency serving
Redis, DynamoDB, Bigtable
Feature registry
Metadata catalog (names, types, owners, lineage)
Built-in registry UI
Materialization
Process that computes and loads features into stores
Batch (Spark) + Stream (Flink/Kafka)
Feature Store Tools
Tool
Type
Best For
Feast
Open-source
Teams wanting flexibility and self-managed
Tecton
Managed
Production-grade, real-time features
Databricks Feature Store
Integrated
Teams already on Databricks
Vertex AI Feature Store
Managed (GCP)
GCP-native workflows
Amazon SageMaker Feature Store
Managed (AWS)
AWS-native workflows
Hopsworks
Open-source + managed
Real-time features with Kafka
Example: Feature Definition (Feast)
from feast import Entity, Feature, FeatureView, FileSource, ValueTypefrom datetime import timedelta# Entity (the primary key for feature lookup)user = Entity(name="user_id", value_type=ValueType.INT64)# Feature view definitionuser_features = FeatureView( name="user_features", entities=[user], ttl=timedelta(days=1), features=[ Feature(name="total_purchases_30d", dtype=ValueType.INT64), Feature(name="avg_order_value_30d", dtype=ValueType.FLOAT), Feature(name="days_since_last_login", dtype=ValueType.INT64), Feature(name="account_age_days", dtype=ValueType.INT64), ], online=True, # Materialize to online store source=FileSource(path="data/user_features.parquet", timestamp_field="event_timestamp"),)# Training: get historical features (point-in-time correct)training_df = store.get_historical_features( entity_df=entity_df, # DataFrame with user_id + event_timestamp features=["user_features:total_purchases_30d", "user_features:avg_order_value_30d"],).to_df()# Serving: get latest features for online inferencefeature_vector = store.get_online_features( features=["user_features:total_purchases_30d", "user_features:avg_order_value_30d"], entity_rows=[{"user_id": 12345}],).to_dict()
Q7: How Do You Track Experiments and Manage Model Versions?
Answer:
Experiment tracking records all parameters, metrics, code versions, and artifacts from every training run, enabling comparison, reproducibility, and auditability. A model registry provides lifecycle management for models from development to production.
import mlflowimport mlflow.sklearnfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.metrics import accuracy_score, f1_score# Set experimentmlflow.set_experiment("churn-prediction")with mlflow.start_run(run_name="rf-baseline"):# Log parameters params = {"n_estimators": 100, "max_depth": 10, "min_samples_split": 5} mlflow.log_params(params)# Train model model = RandomForestClassifier(**params) model.fit(X_train, y_train)# Evaluate and log metrics y_pred = model.predict(X_test) mlflow.log_metric("accuracy", accuracy_score(y_test, y_pred)) mlflow.log_metric("f1_score", f1_score(y_test, y_pred))# Log model artifact mlflow.sklearn.log_model(model, "model")# Log additional artifacts mlflow.log_artifact("feature_importance.png")# Register model if it meets thresholdif accuracy_score(y_test, y_pred) >0.90: mlflow.register_model(f"runs:/{mlflow.active_run().info.run_id}/model","churn-prediction-model" )
Model Registry Lifecycle
Stage
Description
Action
None
Model just registered
Auto after training
Staging
Candidate for production, undergoing validation
Manual or automated promotion
Production
Actively serving traffic
After passing evaluation gate
Archived
Replaced by newer version, kept for rollback
After new model promoted
Q8: How Do You Implement Automated Model Retraining?
Answer:
Automated retraining ensures models stay fresh as data evolves. A retraining pipeline is triggered by schedules, data changes, or drift detection — then validates the new model before promotion.
Before deploying retrained model:
1. Performance check: new_model.accuracy >= prod_model.accuracy - 0.02
2. Bias check: fairness metrics within acceptable bounds
3. Latency check: inference time within SLA
4. Data quality: training data passed validation (no corruption)
5. Sanity check: predictions on known inputs match expectations
6. Shadow deployment: run alongside production for N hours
7. Gradual rollout: canary (5%) → 25% → 50% → 100%
8. Automatic rollback: if error rate spikes after deployment
Q9: How Do You Perform A/B Testing for ML Models?
Answer:
A/B testing for ML models is the process of comparing two (or more) models in production by splitting traffic between them and measuring the impact on business metrics with statistical rigor.
User ID (not request) — ensures consistent experience
Sample size
Calculate required sample based on expected effect size and power
Duration
Run for full business cycle (e.g., 1-2 weeks minimum)
Guardrail metrics
Revenue, latency, error rate — must not degrade
Primary metric
The one metric that decides winner
Novelty effect
Wait for initial excitement to wear off
Statistical Testing
Method
When to Use
t-test / z-test
Continuous metrics (revenue, time on site)
Chi-squared test
Proportions (conversion rate, CTR)
Mann-Whitney U test
Non-normally distributed metrics
Bayesian analysis
Want probability of being better (not just p-value)
Sequential testing
Want to peek at results early without inflating error
Beyond A/B: Advanced Deployment Testing
Method
Description
Advantage
Shadow mode
New model runs in parallel, predictions logged not served
Zero risk, validate in production
Interleaving
Mix recommendations from both models in one list
Needs fewer samples than A/B
Multi-armed bandit
Dynamically shift traffic to better-performing model
Faster convergence, less regret
Backtest
Evaluate on historical data before live test
Pre-validate offline
Q10: How Do You Ensure Reproducibility in ML Systems?
Answer:
Reproducibility means that given the same inputs, code, and configuration, you can produce the same model and predictions. It’s critical for debugging, auditing, regulatory compliance, and collaboration.
Pattern 1: Containerized Training
- Dockerfile pins EVERY dependency (including CUDA, cuDNN)
- docker build → immutable training environment
- docker run → exact same environment everywhere
Pattern 2: DVC for Data Versioning
- dvc add data/training_set.parquet (hashes file, stores in remote)
- git add data/training_set.parquet.dvc (version the pointer in Git)
- Reproduce: dvc checkout → git checkout → exact same data + code
Pattern 3: Experiment Tracking
- Every training run logs: git_commit, docker_image, data_hash, params
- To reproduce: pull that commit, build that image, fetch that data, run with those params
Pattern 4: Pipeline as Code
- Pipeline defined in code (not manually run notebooks)
- Each step: deterministic inputs → deterministic outputs
- Cached: re-run only steps whose inputs changed
Common Reproducibility Pitfalls
Pitfall
Problem
Solution
Non-deterministic GPU operations
cuDNN auto-tuning selects different algorithms
Set torch.backends.cudnn.deterministic = True
Floating-point ordering
Multi-threaded reduction → different sum order
Set fixed number of workers, use deterministic ops