graph TD
subgraph MLflow["MLflow Platform"]
TRACKING["MLflow Tracking<br/>(experiments, metrics, params)"]
PROJECTS["MLflow Projects<br/>(reproducible packaging)"]
MODELS["MLflow Models<br/>(multi-flavor packaging)"]
REGISTRY["MLflow Model Registry<br/>(versioning, staging)"]
end
subgraph Backends["Backend Options"]
LOCAL["Local filesystem"]
DB["Database<br/>(PostgreSQL, MySQL)"]
S3_ART["Artifact Store<br/>(S3, GCS, ADLS, HDFS)"]
MANAGED["Managed<br/>(Databricks, AWS, Azure)"]
end
subgraph Serve["Serving"]
REST["MLflow serve<br/>(REST API)"]
DOCKER["Docker container"]
CLOUD["Cloud deploy<br/>(SageMaker, AzureML)"]
SPARK_S["Spark UDF"]
end
TRACKING --> DB
TRACKING --> S3_ART
MODELS --> REST
MODELS --> DOCKER
MODELS --> CLOUD
MODELS --> SPARK_S
REGISTRY --> MODELS
style MLflow fill:#6cc3d5,stroke:#333,color:#fff
style Backends fill:#56cc9d,stroke:#333,color:#fff
MLOps Interview QA - 5
MLOps open source, MLflow, Kubeflow, DVC, Weights and Biases, Feast feature store, Seldon Core, BentoML, Great Expectations, Apache Airflow ML, Terraform ML, cloud-agnostic MLOps
Introduction
This is Part 5 of our MLOps Interview QA series, focused on cloud-agnostic and third-party MLOps tools. While cloud providers offer integrated platforms (Azure ML, Vertex AI, SageMaker), many teams prefer open-source or vendor-neutral tools to avoid lock-in, support multi-cloud strategies, or leverage best-of-breed capabilities. This article covers the most widely adopted tools across the MLOps lifecycle — experiment tracking, pipeline orchestration, data/model versioning, feature stores, model serving, data validation, and infrastructure as code.
For cloud-specific MLOps, see MLOps Interview QA - 2 (Azure), MLOps Interview QA - 3 (GCP), MLOps Interview QA - 4 (AWS). For general MLOps concepts, see MLOps Interview QA - 1.
Q1: How Does MLflow Provide End-to-End Experiment Tracking and Model Management?
Answer:
MLflow is the most widely adopted open-source ML lifecycle platform. It provides four core components: Tracking (log experiments), Projects (reproducible runs), Models (packaging standard), and Model Registry (versioning + staging). It runs anywhere — locally, on-prem, or on any cloud — and integrates with all major ML frameworks.
MLflow Components
| Component | Purpose | Key Features |
|---|---|---|
| Tracking | Log parameters, metrics, artifacts per run | UI comparison, search API, autolog |
| Projects | Package ML code for reproducibility | MLproject file, conda/docker envs |
| Models | Standard model packaging format | Multi-flavor (sklearn, pytorch, tf, custom) |
| Model Registry | Centralized model versioning & lifecycle | Stages (None → Staging → Production → Archived) |
| Evaluate | Automated model evaluation | Built-in metrics, LLM evaluation |
| Recipes | Opinionated ML workflow templates | Regression, classification pipelines |
Experiment Tracking Example
import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score
# Configure tracking server (self-hosted or managed)
mlflow.set_tracking_uri("http://mlflow-server:5000")
mlflow.set_experiment("churn-prediction")
with mlflow.start_run(run_name="rf-baseline"):
# Log parameters
params = {"n_estimators": 200, "max_depth": 10, "min_samples_split": 5}
mlflow.log_params(params)
mlflow.log_param("feature_version", "v3")
# Train
model = RandomForestClassifier(**params)
model.fit(X_train, y_train)
# Log metrics
y_pred = model.predict(X_test)
mlflow.log_metric("accuracy", accuracy_score(y_test, y_pred))
mlflow.log_metric("f1_score", f1_score(y_test, y_pred))
# Log model with signature
from mlflow.models import infer_signature
signature = infer_signature(X_test, y_pred)
mlflow.sklearn.log_model(model, "model", signature=signature)
# Log artifacts
mlflow.log_artifact("feature_importance.png")Model Registry Workflow
import mlflow
from mlflow import MlflowClient
client = MlflowClient()
# Register model from a run
model_uri = f"runs:/{run_id}/model"
mv = mlflow.register_model(model_uri, "churn-classifier")
# Transition to staging
client.transition_model_version_stage(
name="churn-classifier",
version=mv.version,
stage="Staging",
)
# After validation, promote to production
client.transition_model_version_stage(
name="churn-classifier",
version=mv.version,
stage="Production",
archive_existing_versions=True, # Archive previous production version
)
# Load production model for serving
model = mlflow.pyfunc.load_model("models:/churn-classifier/Production")
predictions = model.predict(new_data)MLflow Deployment Options
| Deployment | Command | Use Case |
|---|---|---|
| Local REST API | mlflow models serve -m models:/model/Production -p 5001 |
Development/testing |
| Docker | mlflow models build-docker -m models:/model/1 -n my-model |
Container orchestration |
| SageMaker | mlflow deployments create -t sagemaker |
AWS production |
| Azure ML | mlflow deployments create -t azureml |
Azure production |
| Spark UDF | mlflow.pyfunc.spark_udf(spark, model_uri) |
Batch inference on Spark |
| Kubernetes | Seldon/KServe with MLflow format | K8s-native serving |
Q2: How Does Kubeflow Enable ML Pipelines on Kubernetes?
Answer:
Kubeflow is a Kubernetes-native ML platform that provides pipeline orchestration, distributed training, model serving, and notebook environments. Its pipeline system (Kubeflow Pipelines / KFP) defines ML workflows as DAGs of containerized steps, running on any Kubernetes cluster (on-prem, GKE, EKS, AKS).
graph TD
subgraph Kubeflow["Kubeflow Platform"]
KFP["Kubeflow Pipelines<br/>(DAG orchestration)"]
NOTEBOOKS["Jupyter Notebooks<br/>(multi-user)"]
KATIB["Katib<br/>(hyperparameter tuning)"]
TRAINING_OP["Training Operators<br/>(TF, PyTorch, MPI)"]
KSERVE["KServe<br/>(model serving)"]
end
subgraph K8S["Kubernetes"]
PODS["Pods<br/>(pipeline steps)"]
PV["Persistent Volumes<br/>(data)"]
GPU["GPU Nodes<br/>(training)"]
ISTIO["Istio<br/>(networking, auth)"]
end
subgraph Storage["External Storage"]
MINIO["MinIO / S3<br/>(artifacts)"]
MYSQL["MySQL<br/>(metadata)"]
REG["Container Registry<br/>(images)"]
end
KFP --> PODS
TRAINING_OP --> GPU
KSERVE --> PODS
KFP --> MINIO
KFP --> MYSQL
style Kubeflow fill:#6cc3d5,stroke:#333,color:#fff
style K8S fill:#56cc9d,stroke:#333,color:#fff
Kubeflow Components
| Component | Purpose | Key Feature |
|---|---|---|
| Kubeflow Pipelines (KFP) | ML workflow orchestration as DAGs | Caching, lineage, UI, versioning |
| Katib | Hyperparameter optimization | Bayesian, grid, random, NAS |
| Training Operators | Distributed training on K8s | TFJob, PyTorchJob, MPIJob, XGBoostJob |
| KServe | Serverless model serving on K8s | Autoscale-to-zero, canary, A/B |
| Notebooks | Multi-user Jupyter environments | GPU support, custom images |
| Central Dashboard | Unified access to all components | Multi-tenancy support |
KFP v2 Pipeline Example
from kfp import dsl, compiler
from kfp.dsl import Input, Output, Dataset, Model, Metrics
@dsl.component(base_image="python:3.10", packages_to_install=["pandas", "scikit-learn"])
def preprocess_data(
raw_data: Input[Dataset],
processed_data: Output[Dataset],
test_split: float = 0.2,
):
import pandas as pd
from sklearn.model_selection import train_test_split
df = pd.read_csv(raw_data.path)
# ... preprocessing logic ...
df_processed.to_csv(processed_data.path, index=False)
@dsl.component(base_image="python:3.10", packages_to_install=["scikit-learn", "joblib"])
def train_model(
training_data: Input[Dataset],
model_output: Output[Model],
metrics_output: Output[Metrics],
n_estimators: int = 100,
):
import joblib
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score
# Train model
clf = GradientBoostingClassifier(n_estimators=n_estimators)
clf.fit(X_train, y_train)
# Log metrics
accuracy = accuracy_score(y_test, y_pred)
metrics_output.log_metric("accuracy", accuracy)
# Save model
joblib.dump(clf, model_output.path)
@dsl.component(base_image="python:3.10")
def deploy_model(model: Input[Model], endpoint_name: str):
# Deploy to KServe or other serving infrastructure
pass
@dsl.pipeline(name="churn-training-pipeline")
def training_pipeline(data_path: str, n_estimators: int = 200):
preprocess_task = preprocess_data(raw_data=data_path)
train_task = train_model(
training_data=preprocess_task.outputs["processed_data"],
n_estimators=n_estimators,
)
deploy_task = deploy_model(
model=train_task.outputs["model_output"],
endpoint_name="churn-model",
)
# Compile pipeline
compiler.Compiler().compile(training_pipeline, "pipeline.yaml")
# Submit to KFP cluster
from kfp.client import Client
client = Client(host="https://kubeflow.example.com/pipeline")
client.create_run_from_pipeline_package("pipeline.yaml", arguments={"data_path": "s3://bucket/data/"})Kubeflow vs Managed Platforms
| Aspect | Kubeflow | SageMaker Pipelines | Vertex AI Pipelines |
|---|---|---|---|
| Infrastructure | Self-managed K8s | Fully managed | Fully managed |
| Lock-in | None (portable) | AWS | GCP |
| Setup complexity | High (K8s expertise needed) | Low | Low |
| Customization | Full (custom operators) | Limited to step types | Moderate (KFP-based) |
| Cost | K8s cluster + ops | Pay per job | Pay per job |
| Multi-cloud | Yes (any K8s) | No | No |
| Best for | Teams with K8s expertise, multi-cloud | AWS-native teams | GCP-native teams |
Q3: How Does DVC Handle Data and Model Versioning?
Answer:
DVC (Data Version Control) extends Git to handle large files, datasets, and ML models. It tracks data/model versions using lightweight .dvc metafiles in Git while storing actual data in remote storage (S3, GCS, Azure Blob, NFS). Combined with DVC Pipelines, it enables reproducible ML experiments tracked alongside code.
graph TD
subgraph Git["Git Repository"]
CODE["Source Code"]
DVC_FILES[".dvc files<br/>(pointers to data)"]
DVC_YAML["dvc.yaml<br/>(pipeline definition)"]
DVC_LOCK["dvc.lock<br/>(exact versions)"]
end
subgraph Remote["DVC Remote Storage"]
S3_R["S3 / GCS / Azure Blob"]
NFS_R["NFS / SSH / HDFS"]
LOCAL_R["Local cache"]
end
subgraph Workflow["DVC Workflow"]
ADD["dvc add<br/>(track data)"]
PUSH["dvc push<br/>(upload to remote)"]
PULL["dvc pull<br/>(download data)"]
REPRO["dvc repro<br/>(reproduce pipeline)"]
METRICS["dvc metrics<br/>(compare experiments)"]
end
DVC_FILES --> Remote
CODE --> Git
ADD --> DVC_FILES
PUSH --> Remote
PULL --> Remote
DVC_YAML --> REPRO
REPRO --> DVC_LOCK
style Git fill:#6cc3d5,stroke:#333,color:#fff
style Remote fill:#56cc9d,stroke:#333,color:#fff
DVC Core Features
| Feature | Description | Command |
|---|---|---|
| Data tracking | Version large files without storing in Git | dvc add data/training.csv |
| Remote storage | Push/pull data to cloud or shared storage | dvc push / dvc pull |
| Pipelines | Define reproducible ML workflows (DAG) | dvc repro |
| Experiments | Branch/compare experiments efficiently | dvc exp run / dvc exp diff |
| Metrics | Track & compare metrics across experiments | dvc metrics show / dvc metrics diff |
| Plots | Visualize metrics (ROC, loss curves) | dvc plots show |
| Data registry | Share datasets across projects | dvc import / dvc get |
DVC Pipeline Definition
# dvc.yaml - ML pipeline stages
stages:
prepare:
cmd: python src/prepare.py
deps:
- src/prepare.py
- data/raw/
params:
- prepare.split_ratio
- prepare.seed
outs:
- data/processed/train.csv
- data/processed/test.csv
train:
cmd: python src/train.py
deps:
- src/train.py
- data/processed/train.csv
params:
- train.n_estimators
- train.max_depth
- train.learning_rate
outs:
- models/model.pkl
metrics:
- metrics/train_metrics.json:
cache: false
plots:
- plots/loss_curve.csv:
x: epoch
y: loss
evaluate:
cmd: python src/evaluate.py
deps:
- src/evaluate.py
- models/model.pkl
- data/processed/test.csv
metrics:
- metrics/eval_metrics.json:
cache: false
plots:
- plots/confusion_matrix.csv:
template: confusion
x: predicted
y: actualDVC Experiment Workflow
# Initialize DVC in a Git repo
git init && dvc init
# Configure remote storage
dvc remote add -d myremote s3://my-bucket/dvc-store
# Track a large dataset
dvc add data/training_data.parquet
git add data/training_data.parquet.dvc data/.gitignore
git commit -m "Add training data v1"
dvc push
# Define parameters (params.yaml)
cat > params.yaml << EOF
prepare:
split_ratio: 0.2
seed: 42
train:
n_estimators: 200
max_depth: 10
learning_rate: 0.05
EOF
# Run pipeline (only re-runs changed stages)
dvc repro
# Run experiment with modified params
dvc exp run --set-param train.n_estimators=300 --set-param train.max_depth=12
# Compare experiments
dvc exp diff
dvc metrics diff
# Apply best experiment to workspace
dvc exp apply exp-abc123
git add . && git commit -m "Best model: 300 estimators"
dvc pushDVC vs Git LFS vs Lakehouse
| Aspect | DVC | Git LFS | Delta Lake / Lakehouse |
|---|---|---|---|
| Versioning | Content-addressable (hash) | Pointer files in Git | Table versioning (time travel) |
| Storage | Any remote (S3, GCS, NFS) | Git server (GitHub LFS) | Cloud storage (S3, ADLS) |
| Pipeline support | Yes (dvc.yaml) | No | No (needs orchestrator) |
| Experiment tracking | Built-in (dvc exp) | No | No |
| File types | Any (data, models, artifacts) | Any (but no dedup) | Tabular data (Parquet) |
| Deduplication | Yes (content-addressable cache) | No | Partial (file-level) |
| Best for | ML data/model versioning + pipelines | Large files in Git | Data lake versioning |
Q4: How Does Weights & Biases (W&B) Support ML Experiment Management?
Answer:
Weights & Biases (W&B) is a developer-focused ML platform providing experiment tracking, dataset versioning, hyperparameter sweeps, model evaluation, and collaboration. It’s known for its rich visualizations, real-time dashboards, and seamless framework integration. W&B can run as SaaS or self-hosted (on-prem / private cloud).
graph TD
subgraph WandB["Weights & Biases"]
EXPERIMENTS["Experiments<br/>(runs, groups, projects)"]
SWEEPS["Sweeps<br/>(hyperparameter optimization)"]
ARTIFACTS["Artifacts<br/>(data & model versioning)"]
TABLES["Tables<br/>(dataset visualization)"]
REPORTS["Reports<br/>(collaborative docs)"]
LAUNCH["Launch<br/>(job scheduling)"]
end
subgraph Integrations["Framework Integrations"]
PYTORCH["PyTorch / Lightning"]
TF_INT["TensorFlow / Keras"]
HF["HuggingFace Transformers"]
SKLEARN_INT["scikit-learn"]
LANGCHAIN["LangChain / LLMs"]
end
subgraph Deploy_Options["Deployment"]
SAAS["W&B Cloud (SaaS)"]
SELF["Self-Hosted (Docker)"]
DEDICATED["Dedicated Cloud"]
end
Integrations --> WandB
WandB --> SAAS
WandB --> SELF
style WandB fill:#6cc3d5,stroke:#333,color:#fff
style Integrations fill:#56cc9d,stroke:#333,color:#fff
W&B Core Products
| Product | Purpose | Key Feature |
|---|---|---|
| Experiments | Track & compare ML runs | Real-time dashboards, custom charts |
| Sweeps | Automated hyperparameter search | Bayesian, grid, random; early stopping |
| Artifacts | Version datasets, models, results | Lineage graph, deduplication |
| Tables | Interactive data exploration | Filter, group, visualize predictions |
| Reports | Collaborative experiment documentation | Embed charts, share findings |
| Launch | Job scheduling on any compute | Queue jobs to K8s, Slurm, cloud |
| Weave | LLM observability and evaluation | Trace chains, evaluate outputs |
| Models | Model registry with lineage | Link artifacts to model versions |
W&B Experiment Tracking Example
import wandb
from wandb.integration.sklearn import plot_precision_recall
# Initialize W&B run
wandb.init(
project="churn-prediction",
name="gbm-v3-velocity-features",
config={
"model": "GradientBoosting",
"n_estimators": 200,
"max_depth": 8,
"learning_rate": 0.05,
"feature_set": "v3-velocity",
},
tags=["production-candidate", "velocity-features"],
)
# Train with live metric logging
for epoch in range(epochs):
train_loss = train_one_epoch(model, train_loader)
val_loss, val_acc = evaluate(model, val_loader)
wandb.log({
"epoch": epoch,
"train/loss": train_loss,
"val/loss": val_loss,
"val/accuracy": val_acc,
})
# Log evaluation results
wandb.log({
"test/accuracy": accuracy,
"test/f1": f1,
"test/auc_roc": auc,
"confusion_matrix": wandb.plot.confusion_matrix(
y_true=y_test, preds=y_pred, class_names=["retain", "churn"]
),
})
# Log model as artifact
artifact = wandb.Artifact("churn-model", type="model")
artifact.add_file("model.pkl")
wandb.log_artifact(artifact)
wandb.finish()W&B Sweeps (Hyperparameter Optimization)
import wandb
# Define sweep configuration
sweep_config = {
"method": "bayes", # bayesian optimization
"metric": {"name": "val/f1", "goal": "maximize"},
"parameters": {
"n_estimators": {"min": 50, "max": 500},
"max_depth": {"values": [4, 6, 8, 10, 12]},
"learning_rate": {"distribution": "log_uniform_values", "min": 0.001, "max": 0.3},
"subsample": {"min": 0.6, "max": 1.0},
},
"early_terminate": {"type": "hyperband", "min_iter": 10},
}
# Create sweep
sweep_id = wandb.sweep(sweep_config, project="churn-prediction")
# Define training function
def train():
wandb.init()
config = wandb.config
model = GradientBoostingClassifier(
n_estimators=config.n_estimators,
max_depth=config.max_depth,
learning_rate=config.learning_rate,
)
model.fit(X_train, y_train)
wandb.log({"val/f1": f1_score(y_val, model.predict(X_val))})
# Run sweep (distributed across agents)
wandb.agent(sweep_id, function=train, count=50)W&B vs MLflow Comparison
| Feature | W&B | MLflow |
|---|---|---|
| Hosting | SaaS (default) + self-hosted | Self-hosted (default) + managed |
| UI/Visualization | Rich, interactive dashboards | Basic comparison UI |
| Hyperparameter sweeps | Built-in (Bayesian, early stop) | Not built-in (use Optuna etc.) |
| Collaboration | Reports, team dashboards | Basic sharing |
| Dataset versioning | Artifacts with lineage | Basic artifact logging |
| Cost | Free tier → paid per user | Free (open-source) |
| LLM support | Weave (tracing, eval) | MLflow Evaluate |
| Model serving | No (registry only) | Yes (mlflow serve) |
| Best for | Teams wanting rich UI + managed service | Teams wanting open-source + flexibility |
Q5: How Does Feast Provide a Cloud-Agnostic Feature Store?
Answer:
Feast (Feature Store) is an open-source feature store that manages ML features from ingestion to serving. It provides a consistent interface for feature retrieval across training (offline: batch) and inference (online: low-latency), with support for multiple backends (Redis, DynamoDB, BigQuery, PostgreSQL, Snowflake). Feast prevents training-serving skew and enables feature reuse across teams.
graph TD
subgraph FeastCore["Feast"]
REGISTRY_F["Feature Registry<br/>(definitions in code)"]
OFFLINE_F["Offline Store<br/>(historical features)"]
ONLINE_F["Online Store<br/>(low-latency serving)"]
MATERIALIZE["Materialization<br/>(offline → online)"]
end
subgraph OfflineBackends["Offline Backends"]
BQ["BigQuery"]
SNOWFLAKE["Snowflake"]
REDSHIFT["Redshift"]
SPARK_OFF["Spark / Parquet"]
PG_OFF["PostgreSQL"]
end
subgraph OnlineBackends["Online Backends"]
REDIS["Redis"]
DYNAMO["DynamoDB"]
PG_ON["PostgreSQL"]
SQLITE["SQLite"]
DATASTORE["Datastore"]
end
subgraph Consumers_F["Consumers"]
TRAIN_F["Training<br/>(get_historical_features)"]
SERVE_F["Inference<br/>(get_online_features)"]
end
REGISTRY_F --> OFFLINE_F
REGISTRY_F --> ONLINE_F
MATERIALIZE --> ONLINE_F
OFFLINE_F --> OfflineBackends
ONLINE_F --> OnlineBackends
OFFLINE_F --> TRAIN_F
ONLINE_F --> SERVE_F
style FeastCore fill:#6cc3d5,stroke:#333,color:#fff
Feast Architecture
| Component | Role | Example Backends |
|---|---|---|
| Feature Repository | Git repo with feature definitions (Python) | Any Git provider |
| Registry | Metadata about features, entities, data sources | File (S3/GCS), SQL, Snowflake |
| Offline Store | Historical feature retrieval for training | BigQuery, Snowflake, Redshift, Spark, file |
| Online Store | Low-latency feature retrieval for serving | Redis, DynamoDB, PostgreSQL, SQLite |
| Materialization | Sync latest feature values to online store | feast materialize (scheduled) |
| Feature Server | REST/gRPC API for online feature serving | feast serve (Go or Python) |
Feast Feature Definitions
# feature_repo/features.py
from feast import Entity, FeatureView, Field, FileSource, PushSource
from feast.types import Float32, Int64, String
from datetime import timedelta
# Entity (primary key)
customer = Entity(
name="customer_id",
join_keys=["customer_id"],
description="Unique customer identifier",
)
# Offline data source (batch)
customer_spending_source = FileSource(
path="s3://bucket/features/customer_spending.parquet",
timestamp_field="event_timestamp",
created_timestamp_column="created_timestamp",
)
# Feature view (defines features + source + TTL)
customer_spending_fv = FeatureView(
name="customer_spending_features",
entities=[customer],
ttl=timedelta(days=90),
schema=[
Field(name="avg_spend_30d", dtype=Float32),
Field(name="transaction_count_7d", dtype=Int64),
Field(name="days_since_last_purchase", dtype=Int64),
Field(name="preferred_category", dtype=String),
],
source=customer_spending_source,
online=True, # Materialize to online store
tags={"team": "data-science", "version": "v3"},
)
# Push source for real-time features
realtime_source = PushSource(
name="realtime_spending_push",
batch_source=customer_spending_source,
)
realtime_spending_fv = FeatureView(
name="realtime_spending",
entities=[customer],
ttl=timedelta(hours=1),
schema=[
Field(name="current_session_spend", dtype=Float32),
Field(name="items_in_cart", dtype=Int64),
],
source=realtime_source,
online=True,
)Feast Usage (Training & Serving)
from feast import FeatureStore
import pandas as pd
store = FeatureStore(repo_path="feature_repo/")
# Training: Get historical features (point-in-time join)
entity_df = pd.DataFrame({
"customer_id": ["c001", "c002", "c003"],
"event_timestamp": pd.to_datetime(["2026-01-15", "2026-01-16", "2026-01-17"]),
})
training_df = store.get_historical_features(
entity_df=entity_df,
features=[
"customer_spending_features:avg_spend_30d",
"customer_spending_features:transaction_count_7d",
"customer_spending_features:days_since_last_purchase",
],
).to_df()
# Serving: Get online features (latest values, low latency)
online_features = store.get_online_features(
features=[
"customer_spending_features:avg_spend_30d",
"customer_spending_features:transaction_count_7d",
"realtime_spending:current_session_spend",
],
entity_rows=[{"customer_id": "c001"}, {"customer_id": "c002"}],
).to_dict()
# Materialize offline → online (run on schedule)
# feast materialize 2026-01-01T00:00:00 2026-05-21T00:00:00
store.materialize(
start_date=datetime(2026, 1, 1),
end_date=datetime(2026, 5, 21),
)Feast vs Managed Feature Stores
| Aspect | Feast | SageMaker Feature Store | Vertex AI Feature Store |
|---|---|---|---|
| Open-source | Yes | No | No |
| Cloud lock-in | None | AWS | GCP |
| Online backends | Redis, DynamoDB, PG, etc. | DynamoDB (managed) | Bigtable (managed) |
| Offline backends | BigQuery, Snowflake, Spark, etc. | S3 + Athena | BigQuery |
| Setup | Self-managed | Fully managed | Fully managed |
| Point-in-time joins | Yes | Yes (via Athena) | Yes |
| Real-time ingestion | Push source API | PutRecord API | Streaming import |
| Best for | Multi-cloud, custom infra | AWS-native teams | GCP-native teams |
Q6: How Do Seldon Core and KServe Serve Models on Kubernetes?
Answer:
Seldon Core and KServe (formerly KFServing) are Kubernetes-native model serving frameworks. They provide inference graphs, canary deployments, autoscaling (including scale-to-zero), A/B testing, multi-model serving, and model explainability — running on any K8s cluster with support for all major ML frameworks.
graph TD
subgraph Serving["K8s Model Serving"]
SELDON["Seldon Core<br/>(inference graphs)"]
KSERVE["KServe<br/>(serverless inference)"]
end
subgraph Features["Capabilities"]
CANARY["Canary / A/B<br/>Deployments"]
AUTOSCALE["Autoscaling<br/>(HPA + scale-to-zero)"]
GRAPH["Inference Graphs<br/>(pre/post processing)"]
MULTI["Multi-Model Serving<br/>(1000s of models)"]
EXPLAIN["Explainability<br/>(SHAP, Anchors)"]
MONITOR_S["Monitoring<br/>(Prometheus + Grafana)"]
end
subgraph Frameworks["Supported Frameworks"]
SKLEARN_S["scikit-learn"]
TF_S["TensorFlow"]
PYTORCH_S["PyTorch (TorchServe)"]
XGBOOST_S["XGBoost / LightGBM"]
TRITON["NVIDIA Triton"]
CUSTOM_S["Custom (any language)"]
MLFLOW_S["MLflow format"]
end
Serving --> Features
Frameworks --> Serving
style Serving fill:#6cc3d5,stroke:#333,color:#fff
style Features fill:#56cc9d,stroke:#333,color:#fff
Seldon Core vs KServe
| Feature | Seldon Core | KServe |
|---|---|---|
| Architecture | Custom CRD (SeldonDeployment) | Knative-based (InferenceService) |
| Scale-to-zero | With KEDA addon | Native (Knative serverless) |
| Inference graph | Rich (router, combiner, transformer) | Basic (transformer + predictor) |
| Multi-model | Yes (Triton integration) | Yes (ModelMesh) |
| Protocol | REST + gRPC (v2 protocol) | REST + gRPC (v2 protocol) |
| Canary | Traffic splitting in CRD | Canary via revision routing |
| Explainability | Built-in (Alibi Explain) | Explainer component |
| Monitoring | Prometheus metrics + drift (Alibi Detect) | Prometheus metrics |
| Best for | Complex inference pipelines | Serverless, simple deployments |
Seldon Core Deployment
# seldon-deployment.yaml
apiVersion: machinelearning.seldon.io/v1
kind: SeldonDeployment
metadata:
name: churn-classifier
namespace: ml-serving
spec:
predictors:
- name: default
replicas: 2
graph:
name: classifier
implementation: SKLEARN_SERVER
modelUri: s3://models/churn/v3
envSecretRefName: s3-credentials
children: []
componentSpecs:
- spec:
containers:
- name: classifier
resources:
requests: { cpu: "500m", memory: "1Gi" }
limits: { cpu: "2", memory: "4Gi" }
traffic: 90
labels:
version: v3
- name: canary
replicas: 1
graph:
name: classifier
implementation: SKLEARN_SERVER
modelUri: s3://models/churn/v4-candidate
traffic: 10
labels:
version: v4-candidateKServe InferenceService
# kserve-inference.yaml
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: churn-classifier
namespace: ml-serving
spec:
predictor:
model:
modelFormat:
name: sklearn
storageUri: s3://models/churn/v3
resources:
requests: { cpu: "500m", memory: "1Gi" }
limits: { cpu: "2", memory: "4Gi" }
minReplicas: 1
maxReplicas: 10
scaleTarget: 10 # Requests per pod before scaling
transformer:
containers:
- name: feature-transformer
image: myregistry/feature-transformer:v1
resources:
requests: { cpu: "200m", memory: "512Mi" }
explainer:
containers:
- name: shap-explainer
image: myregistry/shap-explainer:v1BentoML Alternative
BentoML is a simpler model serving framework focused on developer experience — package models as “Bentos” (containers) and deploy anywhere:
import bentoml
from bentoml.io import JSON, NumpyNdarray
# Save model to BentoML model store
bentoml.sklearn.save_model("churn_classifier", model)
# Define service
@bentoml.service(resources={"cpu": "2", "memory": "4Gi"})
class ChurnClassifier:
model_ref = bentoml.models.get("churn_classifier:latest")
def __init__(self):
self.model = bentoml.sklearn.load_model(self.model_ref)
@bentoml.api
def predict(self, input_data: dict) -> dict:
features = preprocess(input_data)
prediction = self.model.predict([features])[0]
probability = self.model.predict_proba([features])[0]
return {"prediction": int(prediction), "probability": float(probability[1])}
# Build & containerize: bentoml build && bentoml containerize churn_classifier:latest
# Deploy: docker run -p 3000:3000 churn_classifier:latestQ7: How Does Great Expectations Validate ML Data Quality?
Answer:
Great Expectations (GX) is an open-source data validation framework that defines, tests, and documents data quality expectations. In MLOps, it validates training data, feature pipelines, and inference inputs — catching data issues before they degrade model performance. Expectations are defined as code and integrated into CI/CD and pipeline steps.
graph TD
subgraph GX["Great Expectations"]
SUITE["Expectation Suite<br/>(set of validation rules)"]
CHECKPOINT["Checkpoint<br/>(run validations)"]
DATASOURCE["Data Source<br/>(Pandas, Spark, SQL)"]
DOCS["Data Docs<br/>(HTML reports)"]
PROFILER["Profiler<br/>(auto-generate expectations)"]
end
subgraph Pipeline_GX["ML Pipeline Integration"]
TRAIN_DATA["Training Data<br/>(validate before training)"]
FEATURE_DATA["Feature Pipeline<br/>(validate transforms)"]
INFERENCE_DATA["Inference Input<br/>(validate at serving)"]
end
subgraph Actions["On Failure"]
BLOCK["Block Pipeline<br/>(fail step)"]
ALERT["Alert Team<br/>(Slack, email)"]
LOG_GX["Log to Monitoring"]
end
DATASOURCE --> SUITE
SUITE --> CHECKPOINT
CHECKPOINT --> DOCS
CHECKPOINT -->|"Fail"| Actions
TRAIN_DATA --> CHECKPOINT
FEATURE_DATA --> CHECKPOINT
INFERENCE_DATA --> CHECKPOINT
style GX fill:#6cc3d5,stroke:#333,color:#fff
style Actions fill:#ff6b6b,stroke:#333,color:#fff
Great Expectations Core Concepts
| Concept | Description | Example |
|---|---|---|
| Expectation | Single data assertion (like a unit test for data) | expect_column_values_to_not_be_null("age") |
| Expectation Suite | Collection of expectations for a dataset | “training_data_suite” with 50 rules |
| Validator | Applies expectations to a batch of data | Runs suite against DataFrame |
| Checkpoint | Orchestrates validation + actions on results | Run suite, generate docs, alert on failure |
| Data Source | Connection to data (Pandas, Spark, SQL, file) | PostgreSQL, S3 Parquet, BigQuery |
| Data Docs | Auto-generated HTML documentation of results | Hosted on S3/GCS for team access |
| Profiler | Auto-generates expectations from sample data | Bootstraps initial suite |
Defining Expectations
import great_expectations as gx
# Connect to data context
context = gx.get_context()
# Add data source
datasource = context.data_sources.add_pandas_filesystem(
name="training_data",
base_directory="data/processed/",
)
data_asset = datasource.add_csv_asset(name="train_csv", batching_regex=r"train_(?P<year>\d{4}).csv")
batch = data_asset.get_batch()
# Create expectation suite
suite = context.suites.add(gx.ExpectationSuite(name="training_data_quality"))
# Define expectations
suite.add_expectation(gx.expectations.ExpectColumnValuesToNotBeNull(column="customer_id"))
suite.add_expectation(gx.expectations.ExpectColumnValuesToNotBeNull(column="target"))
suite.add_expectation(gx.expectations.ExpectColumnValuesToBeBetween(
column="age", min_value=18, max_value=120
))
suite.add_expectation(gx.expectations.ExpectColumnValuesToBeBetween(
column="monthly_spend", min_value=0, max_value=50000
))
suite.add_expectation(gx.expectations.ExpectColumnValuesToBeInSet(
column="contract_type", value_set=["month-to-month", "one_year", "two_year"]
))
suite.add_expectation(gx.expectations.ExpectColumnMeanToBeBetween(
column="tenure_months", min_value=10, max_value=40
))
suite.add_expectation(gx.expectations.ExpectTableRowCountToBeBetween(
min_value=10000, max_value=1000000
))
suite.add_expectation(gx.expectations.ExpectColumnProportionOfUniqueValuesToBeBetween(
column="customer_id", min_value=0.99, max_value=1.0
))
# Save suite
suite.save()Running Validations in Pipelines
# Run checkpoint (in pipeline step)
checkpoint = context.checkpoints.add(
gx.Checkpoint(
name="training_data_checkpoint",
validation_definitions=[
gx.ValidationDefinition(
name="validate_training",
data=batch,
suite=suite,
)
],
actions=[
gx.checkpoint.UpdateDataDocsAction(name="update_docs"),
],
)
)
result = checkpoint.run()
# Check result in pipeline
if not result.success:
failed_expectations = [
r.expectation_config.type
for r in result.run_results.values()
for r in r.results
if not r.success
]
raise ValueError(f"Data validation failed: {failed_expectations}")Common ML Data Expectations
| Category | Expectations | Purpose |
|---|---|---|
| Completeness | No nulls in critical columns | Prevent training on missing data |
| Range validity | Values within expected bounds | Catch data pipeline errors |
| Schema | Column types, names, count match | Detect schema drift |
| Distribution | Mean, stddev, quantiles within range | Detect distribution shift |
| Uniqueness | ID columns are unique | Prevent duplicate records |
| Freshness | Max timestamp within expected window | Ensure data is recent |
| Referential | Foreign keys exist in reference table | Data integrity |
| Volume | Row count within expected range | Detect data loss/explosion |
Q8: How Does Apache Airflow Orchestrate ML Workflows?
Answer:
Apache Airflow is the most widely used open-source workflow orchestrator. While not ML-specific, it’s commonly used for ML pipeline orchestration — scheduling data ingestion, feature engineering, model training, evaluation, and deployment as DAGs (Directed Acyclic Graphs). Airflow provides rich scheduling, retry logic, SLA monitoring, and integrations with every major ML tool and cloud service.
graph TD
subgraph Airflow["Apache Airflow"]
SCHEDULER["Scheduler<br/>(trigger DAGs on schedule)"]
WEBSERVER["Web UI<br/>(monitor, trigger, debug)"]
EXECUTOR["Executor<br/>(run tasks)"]
META["Metadata DB<br/>(PostgreSQL)"]
end
subgraph Executors["Executor Types"]
LOCAL["Local Executor<br/>(single machine)"]
CELERY["Celery Executor<br/>(distributed workers)"]
K8S_EX["Kubernetes Executor<br/>(pod per task)"]
end
subgraph ML_DAG["ML DAG"]
INGEST["Ingest Data"]
VALIDATE["Validate (GX)"]
FEATURE["Feature Engineering"]
TRAIN_AF["Train Model"]
EVAL_AF["Evaluate"]
DEPLOY_AF["Deploy"]
end
SCHEDULER --> EXECUTOR --> ML_DAG
EXECUTOR --> Executors
style Airflow fill:#6cc3d5,stroke:#333,color:#fff
style ML_DAG fill:#56cc9d,stroke:#333,color:#fff
Airflow for ML — Key Concepts
| Concept | Description | ML Use |
|---|---|---|
| DAG | Directed Acyclic Graph of tasks | ML pipeline (train → eval → deploy) |
| Operator | Template for a task type | BashOperator, PythonOperator, KubernetesPodOperator |
| Sensor | Wait for external condition | S3KeySensor (new data arrival) |
| XCom | Pass data between tasks | Model metrics, S3 paths |
| Connections | Store external service credentials | AWS, GCP, database connections |
| Variables | Store configuration values | Model thresholds, feature versions |
| Pools | Limit concurrent tasks | GPU pool (max 4 concurrent training) |
| TaskGroup | Organize related tasks visually | Group all feature engineering tasks |
ML Training DAG Example
from airflow import DAG
from airflow.operators.python import PythonOperator, BranchPythonOperator
from airflow.providers.amazon.aws.operators.sagemaker import (
SageMakerTrainingOperator,
SageMakerEndpointOperator,
)
from airflow.providers.cncf.kubernetes.operators.pod import KubernetesPodOperator
from airflow.sensors.s3_key_sensor import S3KeySensor
from datetime import datetime, timedelta
default_args = {
"owner": "ml-team",
"retries": 2,
"retry_delay": timedelta(minutes=10),
"email_on_failure": True,
"email": ["ml-team@company.com"],
}
with DAG(
dag_id="churn_model_training",
default_args=default_args,
schedule_interval="@weekly",
start_date=datetime(2026, 1, 1),
catchup=False,
tags=["ml", "churn", "production"],
) as dag:
# Wait for new data
wait_for_data = S3KeySensor(
task_id="wait_for_data",
bucket_name="data-lake",
bucket_key="churn/weekly/{{ ds }}/_SUCCESS",
timeout=3600,
)
# Validate data quality
validate_data = KubernetesPodOperator(
task_id="validate_data",
image="myregistry/data-validator:v2",
cmds=["python", "validate.py"],
arguments=["--date={{ ds }}", "--suite=training_quality"],
namespace="ml-pipelines",
get_logs=True,
)
# Feature engineering
build_features = KubernetesPodOperator(
task_id="build_features",
image="myregistry/feature-builder:v3",
cmds=["python", "build_features.py"],
arguments=["--date={{ ds }}", "--output=s3://features/churn/{{ ds }}/"],
namespace="ml-pipelines",
resources={"request_memory": "8Gi", "request_cpu": "4"},
)
# Train model
train_model = SageMakerTrainingOperator(
task_id="train_model",
config={
"TrainingJobName": "churn-{{ ds_nodash }}",
"AlgorithmSpecification": {"TrainingImage": "...", "TrainingInputMode": "File"},
"InputDataConfig": [{"ChannelName": "train", "DataSource": {...}}],
"OutputDataConfig": {"S3OutputPath": "s3://models/churn/"},
"ResourceConfig": {"InstanceType": "ml.p3.2xlarge", "InstanceCount": 1},
},
)
# Evaluate model
evaluate = PythonOperator(
task_id="evaluate_model",
python_callable=evaluate_model,
op_kwargs={"model_path": "s3://models/churn/{{ ds_nodash }}/"},
)
# Branch: deploy or alert
def check_metrics(**context):
metrics = context["ti"].xcom_pull(task_ids="evaluate_model")
if metrics["f1_score"] >= 0.85:
return "deploy_model"
return "alert_team"
branch = BranchPythonOperator(
task_id="check_metrics",
python_callable=check_metrics,
)
deploy = SageMakerEndpointOperator(
task_id="deploy_model",
operation="update",
config={...},
)
alert = PythonOperator(
task_id="alert_team",
python_callable=send_slack_alert,
)
# DAG dependencies
wait_for_data >> validate_data >> build_features >> train_model >> evaluate >> branch
branch >> [deploy, alert]Airflow ML Provider Packages
| Provider | Operators | Use Case |
|---|---|---|
| amazon | SageMaker (Training, Endpoint, Transform) | AWS ML jobs |
| Vertex AI (Training, Prediction, AutoML) | GCP ML jobs | |
| microsoft.azure | AzureML (Run, Endpoint) | Azure ML jobs |
| cncf.kubernetes | KubernetesPodOperator | Any containerized task |
| databricks | DatabricksRunNow, DatabricksSubmitRun | Spark/ML on Databricks |
| dbt | DbtCloudRunJob, DbtRunOperator | Data transformation |
Airflow vs ML-Specific Orchestrators
| Feature | Apache Airflow | Kubeflow Pipelines | SageMaker Pipelines |
|---|---|---|---|
| Scope | General workflow orchestration | ML-specific on K8s | ML-specific on AWS |
| Scheduling | Rich (cron, sensors, data-aware) | Basic (cron, manual) | EventBridge, API |
| ML integration | Via operators/providers | Native (KFP components) | Native (step types) |
| Caching | Manual (check before run) | Built-in (step-level) | Built-in (step-level) |
| Data lineage | Via plugins (OpenLineage) | Built-in artifacts | Built-in |
| Learning curve | Moderate | High (K8s + KFP) | Low (SDK) |
| Best for | Mixed workloads (data + ML) | K8s-native ML teams | AWS-native ML teams |
Q9: How Do You Use Terraform/Pulumi for ML Infrastructure as Code?
Answer:
Infrastructure as Code (IaC) for ML ensures reproducible, version-controlled environments across development, staging, and production. Terraform (HCL) and Pulumi (Python/TypeScript) define ML infrastructure — compute clusters, model endpoints, feature stores, networking, IAM — as code that’s reviewed, tested, and deployed through CI/CD.
graph TD
subgraph IaC["Infrastructure as Code"]
TF["Terraform<br/>(HCL, declarative)"]
PULUMI["Pulumi<br/>(Python/TS, imperative)"]
CDK["AWS CDK / Bicep<br/>(cloud-specific)"]
end
subgraph MLInfra["ML Infrastructure"]
COMPUTE["Compute<br/>(GPU clusters, K8s, VMs)"]
STORAGE_I["Storage<br/>(S3, GCS, ADLS)"]
NETWORK["Networking<br/>(VPC, subnets, endpoints)"]
SERVE_I["Serving<br/>(endpoints, load balancers)"]
MONITOR_I["Monitoring<br/>(CloudWatch, Prometheus)"]
IAM_I["IAM<br/>(roles, policies)"]
end
subgraph Workflow_IaC["IaC Workflow"]
CODE_I["Write Code<br/>(tf / pulumi)"]
PLAN["Plan<br/>(preview changes)"]
REVIEW["Code Review<br/>(PR approval)"]
APPLY["Apply<br/>(provision infra)"]
end
IaC --> MLInfra
CODE_I --> PLAN --> REVIEW --> APPLY
style IaC fill:#6cc3d5,stroke:#333,color:#fff
style MLInfra fill:#56cc9d,stroke:#333,color:#fff
Why IaC for ML?
| Challenge | Without IaC | With IaC |
|---|---|---|
| Environment drift | Manual setup differs between envs | Identical infrastructure everywhere |
| Audit trail | Who changed what? | Git history tracks all changes |
| Disaster recovery | Manual rebuild | terraform apply recreates everything |
| Team onboarding | Undocumented setup steps | Self-documenting code |
| Cost control | Forgotten resources running | Destroy unused envs: terraform destroy |
| Compliance | Manual security checks | Policy-as-code (Sentinel, OPA) |
Terraform for SageMaker
# main.tf - SageMaker ML infrastructure
terraform {
required_providers {
aws = { source = "hashicorp/aws", version = "~> 5.0" }
}
backend "s3" {
bucket = "terraform-state-ml"
key = "ml-platform/terraform.tfstate"
region = "us-east-1"
}
}
# IAM Role for SageMaker
resource "aws_iam_role" "sagemaker_execution" {
name = "sagemaker-execution-role"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Action = "sts:AssumeRole"
Effect = "Allow"
Principal = { Service = "sagemaker.amazonaws.com" }
}]
})
}
resource "aws_iam_role_policy_attachment" "sagemaker_full" {
role = aws_iam_role.sagemaker_execution.name
policy_arn = "arn:aws:iam::aws:policy/AmazonSageMakerFullAccess"
}
# S3 bucket for ML artifacts
resource "aws_s3_bucket" "ml_artifacts" {
bucket = "ml-artifacts-${var.environment}"
tags = { Environment = var.environment, Team = "ml-platform" }
}
resource "aws_s3_bucket_server_side_encryption_configuration" "ml_artifacts" {
bucket = aws_s3_bucket.ml_artifacts.id
rule {
apply_server_side_encryption_by_default {
sse_algorithm = "aws:kms"
kms_master_key_id = aws_kms_key.ml_key.arn
}
}
}
# SageMaker Domain (Studio)
resource "aws_sagemaker_domain" "ml_studio" {
domain_name = "ml-studio-${var.environment}"
auth_mode = "IAM"
vpc_id = var.vpc_id
subnet_ids = var.private_subnet_ids
default_user_settings {
execution_role = aws_iam_role.sagemaker_execution.arn
security_groups = [aws_security_group.sagemaker.id]
}
}
# SageMaker Model (for endpoint)
resource "aws_sagemaker_model" "churn" {
name = "churn-model-${var.model_version}"
execution_role_arn = aws_iam_role.sagemaker_execution.arn
primary_container {
image = var.inference_image
model_data_url = "s3://${aws_s3_bucket.ml_artifacts.id}/models/churn/${var.model_version}/model.tar.gz"
}
vpc_config {
subnets = var.private_subnet_ids
security_group_ids = [aws_security_group.sagemaker.id]
}
}
# SageMaker Endpoint
resource "aws_sagemaker_endpoint_configuration" "churn" {
name = "churn-endpoint-config-${var.model_version}"
production_variants {
variant_name = "primary"
model_name = aws_sagemaker_model.churn.name
initial_instance_count = var.endpoint_instance_count
instance_type = var.endpoint_instance_type
}
data_capture_config {
enable_capture = true
initial_sampling_percentage = 20
destination_s3_uri = "s3://${aws_s3_bucket.ml_artifacts.id}/data-capture/"
capture_options { capture_mode = "Input" }
capture_options { capture_mode = "Output" }
}
}
resource "aws_sagemaker_endpoint" "churn" {
name = "churn-prediction-${var.environment}"
endpoint_config_name = aws_sagemaker_endpoint_configuration.churn.name
tags = { Environment = var.environment }
}
# Auto-scaling
resource "aws_appautoscaling_target" "endpoint" {
max_capacity = 10
min_capacity = var.environment == "production" ? 2 : 1
resource_id = "endpoint/${aws_sagemaker_endpoint.churn.name}/variant/primary"
scalable_dimension = "sagemaker:variant:DesiredInstanceCount"
service_namespace = "sagemaker"
}
Pulumi for ML (Python)
import pulumi
import pulumi_aws as aws
import pulumi_kubernetes as k8s
# Configuration
config = pulumi.Config()
environment = config.require("environment")
# S3 bucket for ML artifacts
ml_bucket = aws.s3.Bucket(
f"ml-artifacts-{environment}",
bucket=f"ml-artifacts-{environment}",
server_side_encryption_configuration={
"rule": {"apply_server_side_encryption_by_default": {"sse_algorithm": "aws:kms"}}
},
tags={"Environment": environment, "Team": "ml-platform"},
)
# Kubernetes namespace for ML workloads
ml_namespace = k8s.core.v1.Namespace(
"ml-serving",
metadata={"name": f"ml-serving-{environment}"},
)
# Deploy KServe InferenceService via Pulumi K8s
inference_service = k8s.apiextensions.CustomResource(
"churn-model",
api_version="serving.kserve.io/v1beta1",
kind="InferenceService",
metadata={"name": "churn-classifier", "namespace": ml_namespace.metadata.name},
spec={
"predictor": {
"model": {
"modelFormat": {"name": "sklearn"},
"storageUri": pulumi.Output.concat("s3://", ml_bucket.id, "/models/churn/latest"),
},
"minReplicas": 1 if environment == "dev" else 2,
"maxReplicas": 10,
}
},
)
pulumi.export("endpoint_url", inference_service.metadata.name)
pulumi.export("bucket_name", ml_bucket.id)IaC Best Practices for ML
| Practice | Implementation |
|---|---|
| Module per component | modules/sagemaker-endpoint/, modules/feature-store/ |
| Environment separation | envs/dev/, envs/staging/, envs/prod/ (different tfvars) |
| State locking | S3 + DynamoDB (Terraform) or Pulumi Cloud |
| Policy-as-code | OPA/Sentinel to enforce security (no public endpoints, encryption required) |
| Cost estimation | infracost in CI to preview cost changes |
| Drift detection | Scheduled terraform plan to detect manual changes |
| Secrets management | AWS Secrets Manager / Vault (never in state) |
Q10: How Do You Build a Best-of-Breed Cloud-Agnostic MLOps Stack?
Answer:
A cloud-agnostic MLOps stack combines best-of-breed open-source tools to cover the full ML lifecycle — avoiding vendor lock-in while maintaining production-grade capabilities. The key is choosing tools that integrate well, have active communities, and support your deployment targets (cloud, on-prem, hybrid).
graph TD
subgraph Stack["Cloud-Agnostic MLOps Stack"]
subgraph Versioning["Versioning & Tracking"]
DVC_S["DVC<br/>(data/model versioning)"]
MLFLOW_S["MLflow / W&B<br/>(experiment tracking)"]
end
subgraph Orchestration["Orchestration"]
AIRFLOW_S["Airflow / Prefect<br/>(workflow scheduling)"]
KFP_S["Kubeflow Pipelines<br/>(ML DAGs on K8s)"]
end
subgraph Data["Data Quality & Features"]
GX_S["Great Expectations<br/>(data validation)"]
FEAST_S["Feast<br/>(feature store)"]
end
subgraph Serving_S["Model Serving"]
SELDON_S["Seldon / KServe<br/>(K8s inference)"]
BENTO_S["BentoML<br/>(model packaging)"]
end
subgraph Monitoring_S["Monitoring"]
EVIDENTLY["Evidently AI<br/>(drift detection)"]
PROM["Prometheus + Grafana<br/>(metrics & dashboards)"]
end
subgraph Infra_S["Infrastructure"]
TF_S["Terraform / Pulumi<br/>(IaC)"]
K8S_S["Kubernetes<br/>(runtime platform)"]
end
end
style Stack fill:#f8f9fa,stroke:#333
style Versioning fill:#6cc3d5,stroke:#333,color:#fff
style Orchestration fill:#56cc9d,stroke:#333,color:#fff
style Data fill:#ffce67,stroke:#333
style Serving_S fill:#ff6b6b,stroke:#333,color:#fff
style Monitoring_S fill:#c3aed6,stroke:#333
style Infra_S fill:#78c2ad,stroke:#333,color:#fff
Reference Stack by MLOps Stage
| Stage | Open-Source Tool | Alternative | Purpose |
|---|---|---|---|
| Data versioning | DVC | LakeFS, Delta Lake | Track data/model versions alongside code |
| Experiment tracking | MLflow | W&B, Neptune, CometML | Log metrics, params, compare runs |
| Pipeline orchestration | Apache Airflow | Prefect, Dagster, Flyte | Schedule and orchestrate workflows |
| ML pipelines | Kubeflow Pipelines | Metaflow, ZenML | ML-specific DAGs with caching |
| Data validation | Great Expectations | Pandera, Deequ, TFDV | Validate data quality |
| Feature store | Feast | Hopsworks, Tecton | Consistent feature serving |
| Model serving | Seldon Core / KServe | BentoML, Ray Serve, Triton | Low-latency inference on K8s |
| Monitoring | Evidently AI | NannyML, Whylabs, Arize | Drift detection, model quality |
| Infrastructure | Terraform | Pulumi, Crossplane | Provision and manage infra |
| Container runtime | Kubernetes | Nomad, Docker Swarm | Run all workloads |
| CI/CD | GitHub Actions | GitLab CI, Jenkins, Argo CD | Build, test, deploy |
| Secrets | HashiCorp Vault | Sealed Secrets, SOPS | Manage credentials |
Example Integration Architecture
# docker-compose.yml - Local MLOps development stack
version: "3.8"
services:
mlflow:
image: ghcr.io/mlflow/mlflow:2.15.0
ports: ["5000:5000"]
command: >
mlflow server --host 0.0.0.0
--backend-store-uri postgresql://mlflow:mlflow@postgres:5432/mlflow
--default-artifact-root s3://mlflow-artifacts/
environment:
AWS_ACCESS_KEY_ID: ${AWS_ACCESS_KEY_ID}
AWS_SECRET_ACCESS_KEY: ${AWS_SECRET_ACCESS_KEY}
feast:
image: feastdev/feature-server:0.38.0
ports: ["6566:6566"]
volumes: ["./feature_repo:/feature_repo"]
command: feast serve --host 0.0.0.0
great-expectations:
image: greatexpectations/great_expectations:latest
volumes: ["./gx:/gx"]
grafana:
image: grafana/grafana:latest
ports: ["3000:3000"]
volumes: ["./grafana/dashboards:/var/lib/grafana/dashboards"]
prometheus:
image: prom/prometheus:latest
ports: ["9090:9090"]
volumes: ["./prometheus.yml:/etc/prometheus/prometheus.yml"]
postgres:
image: postgres:15
environment:
POSTGRES_DB: mlflow
POSTGRES_USER: mlflow
POSTGRES_PASSWORD: mlflow
redis:
image: redis:7
ports: ["6379:6379"]Decision Framework: When to Use What
| Scenario | Recommended Stack | Why |
|---|---|---|
| Startup, AWS-only | SageMaker (managed) | Fastest to production, no ops overhead |
| Enterprise, multi-cloud | Kubeflow + MLflow + Feast + Seldon | Portable, no lock-in |
| Small team, quick iteration | MLflow + DVC + BentoML + GitHub Actions | Simple, low overhead |
| Regulated industry | Cloud-managed + Terraform + OPA | Compliance, audit trail |
| On-prem/hybrid | Kubeflow + Feast + Airflow + Terraform | Full control, any environment |
| Large org, many teams | W&B + Feast + Airflow + KServe + Terraform | Collaboration, governance |
Migration Strategy (Cloud → Agnostic)
| Step | Action | Risk Mitigation |
|---|---|---|
| 1 | Abstract model packaging — use MLflow model format | Standard format works everywhere |
| 2 | Adopt Feast — decouple feature serving from cloud feature store | Dual-write during transition |
| 3 | Containerize training — Docker + KFP components | Runs on any K8s cluster |
| 4 | IaC everything — Terraform modules per provider | Swap providers by changing modules |
| 5 | Portable CI/CD — GitHub Actions with provider-agnostic steps | Same workflow, different targets |
| 6 | Monitoring abstraction — Evidently + Prometheus (cloud-agnostic) | Consistent metrics everywhere |
Summary Table
| # | Topic | Key Tools |
|---|---|---|
| 1 | Experiment Tracking | MLflow (Tracking, Registry, Projects, Models) |
| 2 | ML Pipelines on K8s | Kubeflow Pipelines (KFP), Katib, KServe |
| 3 | Data & Model Versioning | DVC (dvc add, dvc repro, dvc exp) |
| 4 | Experiment Management | Weights & Biases (Experiments, Sweeps, Artifacts) |
| 5 | Feature Store | Feast (online/offline stores, point-in-time joins) |
| 6 | Model Serving on K8s | Seldon Core, KServe, BentoML |
| 7 | Data Validation | Great Expectations (suites, checkpoints, Data Docs) |
| 8 | Workflow Orchestration | Apache Airflow (DAGs, operators, sensors) |
| 9 | ML Infrastructure as Code | Terraform, Pulumi (multi-cloud IaC) |
| 10 | Best-of-Breed Stack | Reference architecture combining all tools |
What’s Next?
This article covered cloud-agnostic MLOps tools. For related content:
- General MLOps concepts: MLOps Interview QA - 1
- Azure MLOps: MLOps Interview QA - 2
- GCP MLOps: MLOps Interview QA - 3
- AWS MLOps: MLOps Interview QA - 4
- LLMOps: LLMOps Interview QA - 1
- DevOps foundations: DevOps Interview QA - 1