MLOps Interview QA - 5

10 cloud-agnostic MLOps interview questions covering open-source and third-party tools — MLflow, Kubeflow, DVC, Weights & Biases, Feast, Seldon Core, BentoML, Great Expectations, Apache Airflow, and Terraform/Pulumi for ML infrastructure.

Author

Vectoring AI

Published

21 May 2026

Keywords

MLOps open source, MLflow, Kubeflow, DVC, Weights and Biases, Feast feature store, Seldon Core, BentoML, Great Expectations, Apache Airflow ML, Terraform ML, cloud-agnostic MLOps

Introduction

This is Part 5 of our MLOps Interview QA series, focused on cloud-agnostic and third-party MLOps tools. While cloud providers offer integrated platforms (Azure ML, Vertex AI, SageMaker), many teams prefer open-source or vendor-neutral tools to avoid lock-in, support multi-cloud strategies, or leverage best-of-breed capabilities. This article covers the most widely adopted tools across the MLOps lifecycle — experiment tracking, pipeline orchestration, data/model versioning, feature stores, model serving, data validation, and infrastructure as code.

For cloud-specific MLOps, see MLOps Interview QA - 2 (Azure), MLOps Interview QA - 3 (GCP), MLOps Interview QA - 4 (AWS). For general MLOps concepts, see MLOps Interview QA - 1.

Q1: How Does MLflow Provide End-to-End Experiment Tracking and Model Management?

Answer:

MLflow is the most widely adopted open-source ML lifecycle platform. It provides four core components: Tracking (log experiments), Projects (reproducible runs), Models (packaging standard), and Model Registry (versioning + staging). It runs anywhere — locally, on-prem, or on any cloud — and integrates with all major ML frameworks.

graph TD
    linkStyle default stroke:#000,color:#000
    subgraph MLflow["MLflow Platform"]
        TRACKING["MLflow Tracking<br/>(experiments, metrics, params)"]
        PROJECTS["MLflow Projects<br/>(reproducible packaging)"]
        MODELS["MLflow Models<br/>(multi-flavor packaging)"]
        REGISTRY["MLflow Model Registry<br/>(versioning, staging)"]
    end

    subgraph Backends["Backend Options"]
        LOCAL["Local filesystem"]
        DB["Database<br/>(PostgreSQL, MySQL)"]
        S3_ART["Artifact Store<br/>(S3, GCS, ADLS, HDFS)"]
        MANAGED["Managed<br/>(Databricks, AWS, Azure)"]
    end

    subgraph Serve["Serving"]
        REST["MLflow serve<br/>(REST API)"]
        DOCKER["Docker container"]
        CLOUD["Cloud deploy<br/>(SageMaker, AzureML)"]
        SPARK_S["Spark UDF"]
    end

    TRACKING --> DB
    TRACKING --> S3_ART
    MODELS --> REST
    MODELS --> DOCKER
    MODELS --> CLOUD
    MODELS --> SPARK_S
    REGISTRY --> MODELS

    style MLflow fill:#6cc3d5,stroke:#333,color:#fff
    style Backends fill:#56cc9d,stroke:#333,color:#fff
    style Serve fill:#fff

MLflow Components

Component	Purpose	Key Features
Tracking	Log parameters, metrics, artifacts per run	UI comparison, search API, autolog
Projects	Package ML code for reproducibility	`MLproject` file, conda/docker envs
Models	Standard model packaging format	Multi-flavor (sklearn, pytorch, tf, custom)
Model Registry	Centralized model versioning & lifecycle	Stages (None → Staging → Production → Archived)
Evaluate	Automated model evaluation	Built-in metrics, LLM evaluation
Recipes	Opinionated ML workflow templates	Regression, classification pipelines

Experiment Tracking Example

import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score

# Configure tracking server (self-hosted or managed)
mlflow.set_tracking_uri("http://mlflow-server:5000")
mlflow.set_experiment("churn-prediction")

with mlflow.start_run(run_name="rf-baseline"):
    # Log parameters
    params = {"n_estimators": 200, "max_depth": 10, "min_samples_split": 5}
    mlflow.log_params(params)
    mlflow.log_param("feature_version", "v3")

    # Train
    model = RandomForestClassifier(**params)
    model.fit(X_train, y_train)

    # Log metrics
    y_pred = model.predict(X_test)
    mlflow.log_metric("accuracy", accuracy_score(y_test, y_pred))
    mlflow.log_metric("f1_score", f1_score(y_test, y_pred))

    # Log model with signature
    from mlflow.models import infer_signature
    signature = infer_signature(X_test, y_pred)
    mlflow.sklearn.log_model(model, "model", signature=signature)

    # Log artifacts
    mlflow.log_artifact("feature_importance.png")

Model Registry Workflow

import mlflow
from mlflow import MlflowClient

client = MlflowClient()

# Register model from a run
model_uri = f"runs:/{run_id}/model"
mv = mlflow.register_model(model_uri, "churn-classifier")

# Transition to staging
client.transition_model_version_stage(
    name="churn-classifier",
    version=mv.version,
    stage="Staging",
)

# After validation, promote to production
client.transition_model_version_stage(
    name="churn-classifier",
    version=mv.version,
    stage="Production",
    archive_existing_versions=True,  # Archive previous production version
)

# Load production model for serving
model = mlflow.pyfunc.load_model("models:/churn-classifier/Production")
predictions = model.predict(new_data)

MLflow Deployment Options

Deployment	Command	Use Case
Local REST API	`mlflow models serve -m models:/model/Production -p 5001`	Development/testing
Docker	`mlflow models build-docker -m models:/model/1 -n my-model`	Container orchestration
SageMaker	`mlflow deployments create -t sagemaker`	AWS production
Azure ML	`mlflow deployments create -t azureml`	Azure production
Spark UDF	`mlflow.pyfunc.spark_udf(spark, model_uri)`	Batch inference on Spark
Kubernetes	Seldon/KServe with MLflow format	K8s-native serving

Q2: How Does Kubeflow Enable ML Pipelines on Kubernetes?

Answer:

Kubeflow is a Kubernetes-native ML platform that provides pipeline orchestration, distributed training, model serving, and notebook environments. Its pipeline system (Kubeflow Pipelines / KFP) defines ML workflows as DAGs of containerized steps, running on any Kubernetes cluster (on-prem, GKE, EKS, AKS).

graph TD
    linkStyle default stroke:#000,color:#000
    subgraph Kubeflow["Kubeflow Platform"]
        KFP["Kubeflow Pipelines<br/>(DAG orchestration)"]
        NOTEBOOKS["Jupyter Notebooks<br/>(multi-user)"]
        KATIB["Katib<br/>(hyperparameter tuning)"]
        TRAINING_OP["Training Operators<br/>(TF, PyTorch, MPI)"]
        KSERVE["KServe<br/>(model serving)"]
    end

    subgraph K8S["Kubernetes"]
        PODS["Pods<br/>(pipeline steps)"]
        PV["Persistent Volumes<br/>(data)"]
        GPU["GPU Nodes<br/>(training)"]
        ISTIO["Istio<br/>(networking, auth)"]
    end

    subgraph Storage["External Storage"]
        MINIO["MinIO / S3<br/>(artifacts)"]
        MYSQL["MySQL<br/>(metadata)"]
        REG["Container Registry<br/>(images)"]
    end

    KFP --> PODS
    TRAINING_OP --> GPU
    KSERVE --> PODS
    KFP --> MINIO
    KFP --> MYSQL

    style Kubeflow fill:#6cc3d5,stroke:#333,color:#fff
    style K8S fill:#56cc9d,stroke:#333,color:#fff
    style Storage fill:#fff

Kubeflow Components

Component	Purpose	Key Feature
Kubeflow Pipelines (KFP)	ML workflow orchestration as DAGs	Caching, lineage, UI, versioning
Katib	Hyperparameter optimization	Bayesian, grid, random, NAS
Training Operators	Distributed training on K8s	TFJob, PyTorchJob, MPIJob, XGBoostJob
KServe	Serverless model serving on K8s	Autoscale-to-zero, canary, A/B
Notebooks	Multi-user Jupyter environments	GPU support, custom images
Central Dashboard	Unified access to all components	Multi-tenancy support

KFP v2 Pipeline Example

from kfp import dsl, compiler
from kfp.dsl import Input, Output, Dataset, Model, Metrics

@dsl.component(base_image="python:3.10", packages_to_install=["pandas", "scikit-learn"])
def preprocess_data(
    raw_data: Input[Dataset],
    processed_data: Output[Dataset],
    test_split: float = 0.2,
):
    import pandas as pd
    from sklearn.model_selection import train_test_split

    df = pd.read_csv(raw_data.path)
    # ... preprocessing logic ...
    df_processed.to_csv(processed_data.path, index=False)


@dsl.component(base_image="python:3.10", packages_to_install=["scikit-learn", "joblib"])
def train_model(
    training_data: Input[Dataset],
    model_output: Output[Model],
    metrics_output: Output[Metrics],
    n_estimators: int = 100,
):
    import joblib
    from sklearn.ensemble import GradientBoostingClassifier
    from sklearn.metrics import accuracy_score

    # Train model
    clf = GradientBoostingClassifier(n_estimators=n_estimators)
    clf.fit(X_train, y_train)

    # Log metrics
    accuracy = accuracy_score(y_test, y_pred)
    metrics_output.log_metric("accuracy", accuracy)

    # Save model
    joblib.dump(clf, model_output.path)


@dsl.component(base_image="python:3.10")
def deploy_model(model: Input[Model], endpoint_name: str):
    # Deploy to KServe or other serving infrastructure
    pass


@dsl.pipeline(name="churn-training-pipeline")
def training_pipeline(data_path: str, n_estimators: int = 200):
    preprocess_task = preprocess_data(raw_data=data_path)
    train_task = train_model(
        training_data=preprocess_task.outputs["processed_data"],
        n_estimators=n_estimators,
    )
    deploy_task = deploy_model(
        model=train_task.outputs["model_output"],
        endpoint_name="churn-model",
    )

# Compile pipeline
compiler.Compiler().compile(training_pipeline, "pipeline.yaml")

# Submit to KFP cluster
from kfp.client import Client
client = Client(host="https://kubeflow.example.com/pipeline")
client.create_run_from_pipeline_package("pipeline.yaml", arguments={"data_path": "s3://bucket/data/"})

Kubeflow vs Managed Platforms

Aspect	Kubeflow	SageMaker Pipelines	Vertex AI Pipelines
Infrastructure	Self-managed K8s	Fully managed	Fully managed
Lock-in	None (portable)	AWS	GCP
Setup complexity	High (K8s expertise needed)	Low	Low
Customization	Full (custom operators)	Limited to step types	Moderate (KFP-based)
Cost	K8s cluster + ops	Pay per job	Pay per job
Multi-cloud	Yes (any K8s)	No	No
Best for	Teams with K8s expertise, multi-cloud	AWS-native teams	GCP-native teams

Q3: How Does DVC Handle Data and Model Versioning?

Answer:

DVC (Data Version Control) extends Git to handle large files, datasets, and ML models. It tracks data/model versions using lightweight .dvc metafiles in Git while storing actual data in remote storage (S3, GCS, Azure Blob, NFS). Combined with DVC Pipelines, it enables reproducible ML experiments tracked alongside code.

graph TD
    linkStyle default stroke:#000,color:#000
    subgraph Git["Git Repository"]
        CODE["Source Code"]
        DVC_FILES[".dvc files<br/>(pointers to data)"]
        DVC_YAML["dvc.yaml<br/>(pipeline definition)"]
        DVC_LOCK["dvc.lock<br/>(exact versions)"]
    end

    subgraph Remote["DVC Remote Storage"]
        S3_R["S3 / GCS / Azure Blob"]
        NFS_R["NFS / SSH / HDFS"]
        LOCAL_R["Local cache"]
    end

    subgraph Workflow["DVC Workflow"]
        ADD["dvc add<br/>(track data)"]
        PUSH["dvc push<br/>(upload to remote)"]
        PULL["dvc pull<br/>(download data)"]
        REPRO["dvc repro<br/>(reproduce pipeline)"]
        METRICS["dvc metrics<br/>(compare experiments)"]
    end

    DVC_FILES --> Remote
    CODE --> Git
    ADD --> DVC_FILES
    PUSH --> Remote
    PULL --> Remote
    DVC_YAML --> REPRO
    REPRO --> DVC_LOCK

    style Git fill:#6cc3d5,stroke:#333,color:#fff
    style Remote fill:#56cc9d,stroke:#333,color:#fff
    style Workflow fill:#fff

DVC Core Features

Feature	Description	Command
Data tracking	Version large files without storing in Git	`dvc add data/training.csv`
Remote storage	Push/pull data to cloud or shared storage	`dvc push` / `dvc pull`
Pipelines	Define reproducible ML workflows (DAG)	`dvc repro`
Experiments	Branch/compare experiments efficiently	`dvc exp run` / `dvc exp diff`
Metrics	Track & compare metrics across experiments	`dvc metrics show` / `dvc metrics diff`
Plots	Visualize metrics (ROC, loss curves)	`dvc plots show`
Data registry	Share datasets across projects	`dvc import` / `dvc get`

DVC Pipeline Definition

# dvc.yaml - ML pipeline stages
stages:
  prepare:
    cmd: python src/prepare.py
    deps:
      - src/prepare.py
      - data/raw/
    params:
      - prepare.split_ratio
      - prepare.seed
    outs:
      - data/processed/train.csv
      - data/processed/test.csv

  train:
    cmd: python src/train.py
    deps:
      - src/train.py
      - data/processed/train.csv
    params:
      - train.n_estimators
      - train.max_depth
      - train.learning_rate
    outs:
      - models/model.pkl
    metrics:
      - metrics/train_metrics.json:
          cache: false
    plots:
      - plots/loss_curve.csv:
          x: epoch
          y: loss

  evaluate:
    cmd: python src/evaluate.py
    deps:
      - src/evaluate.py
      - models/model.pkl
      - data/processed/test.csv
    metrics:
      - metrics/eval_metrics.json:
          cache: false
    plots:
      - plots/confusion_matrix.csv:
          template: confusion
          x: predicted
          y: actual

DVC Experiment Workflow

# Initialize DVC in a Git repo
git init && dvc init

# Configure remote storage
dvc remote add -d myremote s3://my-bucket/dvc-store

# Track a large dataset
dvc add data/training_data.parquet
git add data/training_data.parquet.dvc data/.gitignore
git commit -m "Add training data v1"
dvc push

# Define parameters (params.yaml)
cat > params.yaml << EOF
prepare:
  split_ratio: 0.2
  seed: 42
train:
  n_estimators: 200
  max_depth: 10
  learning_rate: 0.05
EOF

# Run pipeline (only re-runs changed stages)
dvc repro

# Run experiment with modified params
dvc exp run --set-param train.n_estimators=300 --set-param train.max_depth=12

# Compare experiments
dvc exp diff
dvc metrics diff

# Apply best experiment to workspace
dvc exp apply exp-abc123
git add . && git commit -m "Best model: 300 estimators"
dvc push

DVC vs Git LFS vs Lakehouse

Aspect	DVC	Git LFS	Delta Lake / Lakehouse
Versioning	Content-addressable (hash)	Pointer files in Git	Table versioning (time travel)
Storage	Any remote (S3, GCS, NFS)	Git server (GitHub LFS)	Cloud storage (S3, ADLS)
Pipeline support	Yes (dvc.yaml)	No	No (needs orchestrator)
Experiment tracking	Built-in (dvc exp)	No	No
File types	Any (data, models, artifacts)	Any (but no dedup)	Tabular data (Parquet)
Deduplication	Yes (content-addressable cache)	No	Partial (file-level)
Best for	ML data/model versioning + pipelines	Large files in Git	Data lake versioning

Q4: How Does Weights & Biases (W&B) Support ML Experiment Management?

Answer:

Weights & Biases (W&B) is a developer-focused ML platform providing experiment tracking, dataset versioning, hyperparameter sweeps, model evaluation, and collaboration. It’s known for its rich visualizations, real-time dashboards, and seamless framework integration. W&B can run as SaaS or self-hosted (on-prem / private cloud).

graph TD
    linkStyle default stroke:#000,color:#000
    subgraph WandB["Weights & Biases"]
        EXPERIMENTS["Experiments<br/>(runs, groups, projects)"]
        SWEEPS["Sweeps<br/>(hyperparameter optimization)"]
        ARTIFACTS["Artifacts<br/>(data & model versioning)"]
        TABLES["Tables<br/>(dataset visualization)"]
        REPORTS["Reports<br/>(collaborative docs)"]
        LAUNCH["Launch<br/>(job scheduling)"]
    end

    subgraph Integrations["Framework Integrations"]
        PYTORCH["PyTorch / Lightning"]
        TF_INT["TensorFlow / Keras"]
        HF["HuggingFace Transformers"]
        SKLEARN_INT["scikit-learn"]
        LANGCHAIN["LangChain / LLMs"]
    end

    subgraph Deploy_Options["Deployment"]
        SAAS["W&B Cloud (SaaS)"]
        SELF["Self-Hosted (Docker)"]
        DEDICATED["Dedicated Cloud"]
    end

    Integrations --> WandB
    WandB --> SAAS
    WandB --> SELF

    style WandB fill:#6cc3d5,stroke:#333,color:#fff
    style Integrations fill:#56cc9d,stroke:#333,color:#fff
    style Deploy_Options fill:#fff

W&B Core Products

Product	Purpose	Key Feature
Experiments	Track & compare ML runs	Real-time dashboards, custom charts
Sweeps	Automated hyperparameter search	Bayesian, grid, random; early stopping
Artifacts	Version datasets, models, results	Lineage graph, deduplication
Tables	Interactive data exploration	Filter, group, visualize predictions
Reports	Collaborative experiment documentation	Embed charts, share findings
Launch	Job scheduling on any compute	Queue jobs to K8s, Slurm, cloud
Weave	LLM observability and evaluation	Trace chains, evaluate outputs
Models	Model registry with lineage	Link artifacts to model versions

W&B Experiment Tracking Example

import wandb
from wandb.integration.sklearn import plot_precision_recall

# Initialize W&B run
wandb.init(
    project="churn-prediction",
    name="gbm-v3-velocity-features",
    config={
        "model": "GradientBoosting",
        "n_estimators": 200,
        "max_depth": 8,
        "learning_rate": 0.05,
        "feature_set": "v3-velocity",
    },
    tags=["production-candidate", "velocity-features"],
)

# Train with live metric logging
for epoch in range(epochs):
    train_loss = train_one_epoch(model, train_loader)
    val_loss, val_acc = evaluate(model, val_loader)
    wandb.log({
        "epoch": epoch,
        "train/loss": train_loss,
        "val/loss": val_loss,
        "val/accuracy": val_acc,
    })

# Log evaluation results
wandb.log({
    "test/accuracy": accuracy,
    "test/f1": f1,
    "test/auc_roc": auc,
    "confusion_matrix": wandb.plot.confusion_matrix(
        y_true=y_test, preds=y_pred, class_names=["retain", "churn"]
    ),
})

# Log model as artifact
artifact = wandb.Artifact("churn-model", type="model")
artifact.add_file("model.pkl")
wandb.log_artifact(artifact)

wandb.finish()

W&B Sweeps (Hyperparameter Optimization)

import wandb

# Define sweep configuration
sweep_config = {
    "method": "bayes",  # bayesian optimization
    "metric": {"name": "val/f1", "goal": "maximize"},
    "parameters": {
        "n_estimators": {"min": 50, "max": 500},
        "max_depth": {"values": [4, 6, 8, 10, 12]},
        "learning_rate": {"distribution": "log_uniform_values", "min": 0.001, "max": 0.3},
        "subsample": {"min": 0.6, "max": 1.0},
    },
    "early_terminate": {"type": "hyperband", "min_iter": 10},
}

# Create sweep
sweep_id = wandb.sweep(sweep_config, project="churn-prediction")

# Define training function
def train():
    wandb.init()
    config = wandb.config
    model = GradientBoostingClassifier(
        n_estimators=config.n_estimators,
        max_depth=config.max_depth,
        learning_rate=config.learning_rate,
    )
    model.fit(X_train, y_train)
    wandb.log({"val/f1": f1_score(y_val, model.predict(X_val))})

# Run sweep (distributed across agents)
wandb.agent(sweep_id, function=train, count=50)

W&B vs MLflow Comparison

Feature	W&B	MLflow
Hosting	SaaS (default) + self-hosted	Self-hosted (default) + managed
UI/Visualization	Rich, interactive dashboards	Basic comparison UI
Hyperparameter sweeps	Built-in (Bayesian, early stop)	Not built-in (use Optuna etc.)
Collaboration	Reports, team dashboards	Basic sharing
Dataset versioning	Artifacts with lineage	Basic artifact logging
Cost	Free tier → paid per user	Free (open-source)
LLM support	Weave (tracing, eval)	MLflow Evaluate
Model serving	No (registry only)	Yes (mlflow serve)
Best for	Teams wanting rich UI + managed service	Teams wanting open-source + flexibility

Q5: How Does Feast Provide a Cloud-Agnostic Feature Store?

Answer:

Feast (Feature Store) is an open-source feature store that manages ML features from ingestion to serving. It provides a consistent interface for feature retrieval across training (offline: batch) and inference (online: low-latency), with support for multiple backends (Redis, DynamoDB, BigQuery, PostgreSQL, Snowflake). Feast prevents training-serving skew and enables feature reuse across teams.

graph TD
    linkStyle default stroke:#000,color:#000
    subgraph FeastCore["Feast"]
        REGISTRY_F["Feature Registry<br/>(definitions in code)"]
        OFFLINE_F["Offline Store<br/>(historical features)"]
        ONLINE_F["Online Store<br/>(low-latency serving)"]
        MATERIALIZE["Materialization<br/>(offline → online)"]
    end

    subgraph OfflineBackends["Offline Backends"]
        BQ["BigQuery"]
        SNOWFLAKE["Snowflake"]
        REDSHIFT["Redshift"]
        SPARK_OFF["Spark / Parquet"]
        PG_OFF["PostgreSQL"]
    end

    subgraph OnlineBackends["Online Backends"]
        REDIS["Redis"]
        DYNAMO["DynamoDB"]
        PG_ON["PostgreSQL"]
        SQLITE["SQLite"]
        DATASTORE["Datastore"]
    end

    subgraph Consumers_F["Consumers"]
        TRAIN_F["Training<br/>(get_historical_features)"]
        SERVE_F["Inference<br/>(get_online_features)"]
    end

    REGISTRY_F --> OFFLINE_F
    REGISTRY_F --> ONLINE_F
    MATERIALIZE --> ONLINE_F
    OFFLINE_F --> OfflineBackends
    ONLINE_F --> OnlineBackends
    OFFLINE_F --> TRAIN_F
    ONLINE_F --> SERVE_F

    style FeastCore fill:#6cc3d5,stroke:#333,color:#fff
    style OfflineBackends fill:#fff
    style OnlineBackends fill:#fff
    style Consumers_F fill:#fff

Feast Architecture

Component	Role	Example Backends
Feature Repository	Git repo with feature definitions (Python)	Any Git provider
Registry	Metadata about features, entities, data sources	File (S3/GCS), SQL, Snowflake
Offline Store	Historical feature retrieval for training	BigQuery, Snowflake, Redshift, Spark, file
Online Store	Low-latency feature retrieval for serving	Redis, DynamoDB, PostgreSQL, SQLite
Materialization	Sync latest feature values to online store	`feast materialize` (scheduled)
Feature Server	REST/gRPC API for online feature serving	`feast serve` (Go or Python)

Feast Feature Definitions

# feature_repo/features.py
from feast import Entity, FeatureView, Field, FileSource, PushSource
from feast.types import Float32, Int64, String
from datetime import timedelta

# Entity (primary key)
customer = Entity(
    name="customer_id",
    join_keys=["customer_id"],
    description="Unique customer identifier",
)

# Offline data source (batch)
customer_spending_source = FileSource(
    path="s3://bucket/features/customer_spending.parquet",
    timestamp_field="event_timestamp",
    created_timestamp_column="created_timestamp",
)

# Feature view (defines features + source + TTL)
customer_spending_fv = FeatureView(
    name="customer_spending_features",
    entities=[customer],
    ttl=timedelta(days=90),
    schema=[
        Field(name="avg_spend_30d", dtype=Float32),
        Field(name="transaction_count_7d", dtype=Int64),
        Field(name="days_since_last_purchase", dtype=Int64),
        Field(name="preferred_category", dtype=String),
    ],
    source=customer_spending_source,
    online=True,  # Materialize to online store
    tags={"team": "data-science", "version": "v3"},
)

# Push source for real-time features
realtime_source = PushSource(
    name="realtime_spending_push",
    batch_source=customer_spending_source,
)

realtime_spending_fv = FeatureView(
    name="realtime_spending",
    entities=[customer],
    ttl=timedelta(hours=1),
    schema=[
        Field(name="current_session_spend", dtype=Float32),
        Field(name="items_in_cart", dtype=Int64),
    ],
    source=realtime_source,
    online=True,
)

Feast Usage (Training & Serving)

from feast import FeatureStore
import pandas as pd

store = FeatureStore(repo_path="feature_repo/")

# Training: Get historical features (point-in-time join)
entity_df = pd.DataFrame({
    "customer_id": ["c001", "c002", "c003"],
    "event_timestamp": pd.to_datetime(["2026-01-15", "2026-01-16", "2026-01-17"]),
})

training_df = store.get_historical_features(
    entity_df=entity_df,
    features=[
        "customer_spending_features:avg_spend_30d",
        "customer_spending_features:transaction_count_7d",
        "customer_spending_features:days_since_last_purchase",
    ],
).to_df()

# Serving: Get online features (latest values, low latency)
online_features = store.get_online_features(
    features=[
        "customer_spending_features:avg_spend_30d",
        "customer_spending_features:transaction_count_7d",
        "realtime_spending:current_session_spend",
    ],
    entity_rows=[{"customer_id": "c001"}, {"customer_id": "c002"}],
).to_dict()

# Materialize offline → online (run on schedule)
# feast materialize 2026-01-01T00:00:00 2026-05-21T00:00:00
store.materialize(
    start_date=datetime(2026, 1, 1),
    end_date=datetime(2026, 5, 21),
)

Feast vs Managed Feature Stores

Aspect	Feast	SageMaker Feature Store	Vertex AI Feature Store
Open-source	Yes	No	No
Cloud lock-in	None	AWS	GCP
Online backends	Redis, DynamoDB, PG, etc.	DynamoDB (managed)	Bigtable (managed)
Offline backends	BigQuery, Snowflake, Spark, etc.	S3 + Athena	BigQuery
Setup	Self-managed	Fully managed	Fully managed
Point-in-time joins	Yes	Yes (via Athena)	Yes
Real-time ingestion	Push source API	PutRecord API	Streaming import
Best for	Multi-cloud, custom infra	AWS-native teams	GCP-native teams

Q6: How Do Seldon Core and KServe Serve Models on Kubernetes?

Answer:

Seldon Core and KServe (formerly KFServing) are Kubernetes-native model serving frameworks. They provide inference graphs, canary deployments, autoscaling (including scale-to-zero), A/B testing, multi-model serving, and model explainability — running on any K8s cluster with support for all major ML frameworks.

graph TD
    linkStyle default stroke:#000,color:#000
    subgraph Serving["K8s Model Serving"]
        SELDON["Seldon Core<br/>(inference graphs)"]
        KSERVE["KServe<br/>(serverless inference)"]
    end

    subgraph Features["Capabilities"]
        CANARY["Canary / A/B<br/>Deployments"]
        AUTOSCALE["Autoscaling<br/>(HPA + scale-to-zero)"]
        GRAPH["Inference Graphs<br/>(pre/post processing)"]
        MULTI["Multi-Model Serving<br/>(1000s of models)"]
        EXPLAIN["Explainability<br/>(SHAP, Anchors)"]
        MONITOR_S["Monitoring<br/>(Prometheus + Grafana)"]
    end

    subgraph Frameworks["Supported Frameworks"]
        SKLEARN_S["scikit-learn"]
        TF_S["TensorFlow"]
        PYTORCH_S["PyTorch (TorchServe)"]
        XGBOOST_S["XGBoost / LightGBM"]
        TRITON["NVIDIA Triton"]
        CUSTOM_S["Custom (any language)"]
        MLFLOW_S["MLflow format"]
    end

    Serving --> Features
    Frameworks --> Serving

    style Serving fill:#6cc3d5,stroke:#333,color:#fff
    style Features fill:#56cc9d,stroke:#333,color:#fff
    style Frameworks fill:#fff

Seldon Core vs KServe

Feature	Seldon Core	KServe
Architecture	Custom CRD (SeldonDeployment)	Knative-based (InferenceService)
Scale-to-zero	With KEDA addon	Native (Knative serverless)
Inference graph	Rich (router, combiner, transformer)	Basic (transformer + predictor)
Multi-model	Yes (Triton integration)	Yes (ModelMesh)
Protocol	REST + gRPC (v2 protocol)	REST + gRPC (v2 protocol)
Canary	Traffic splitting in CRD	Canary via revision routing
Explainability	Built-in (Alibi Explain)	Explainer component
Monitoring	Prometheus metrics + drift (Alibi Detect)	Prometheus metrics
Best for	Complex inference pipelines	Serverless, simple deployments

Seldon Core Deployment

# seldon-deployment.yaml
apiVersion: machinelearning.seldon.io/v1
kind: SeldonDeployment
metadata:
  name: churn-classifier
  namespace: ml-serving
spec:
  predictors:
    - name: default
      replicas: 2
      graph:
        name: classifier
        implementation: SKLEARN_SERVER
        modelUri: s3://models/churn/v3
        envSecretRefName: s3-credentials
        children: []
      componentSpecs:
        - spec:
            containers:
              - name: classifier
                resources:
                  requests: { cpu: "500m", memory: "1Gi" }
                  limits: { cpu: "2", memory: "4Gi" }
      traffic: 90
      labels:
        version: v3

    - name: canary
      replicas: 1
      graph:
        name: classifier
        implementation: SKLEARN_SERVER
        modelUri: s3://models/churn/v4-candidate
      traffic: 10
      labels:
        version: v4-candidate

KServe InferenceService

# kserve-inference.yaml
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: churn-classifier
  namespace: ml-serving
spec:
  predictor:
    model:
      modelFormat:
        name: sklearn
      storageUri: s3://models/churn/v3
      resources:
        requests: { cpu: "500m", memory: "1Gi" }
        limits: { cpu: "2", memory: "4Gi" }
    minReplicas: 1
    maxReplicas: 10
    scaleTarget: 10  # Requests per pod before scaling
  transformer:
    containers:
      - name: feature-transformer
        image: myregistry/feature-transformer:v1
        resources:
          requests: { cpu: "200m", memory: "512Mi" }
  explainer:
    containers:
      - name: shap-explainer
        image: myregistry/shap-explainer:v1

BentoML Alternative

BentoML is a simpler model serving framework focused on developer experience — package models as “Bentos” (containers) and deploy anywhere:

import bentoml
from bentoml.io import JSON, NumpyNdarray

# Save model to BentoML model store
bentoml.sklearn.save_model("churn_classifier", model)

# Define service
@bentoml.service(resources={"cpu": "2", "memory": "4Gi"})
class ChurnClassifier:
    model_ref = bentoml.models.get("churn_classifier:latest")

    def __init__(self):
        self.model = bentoml.sklearn.load_model(self.model_ref)

    @bentoml.api
    def predict(self, input_data: dict) -> dict:
        features = preprocess(input_data)
        prediction = self.model.predict([features])[0]
        probability = self.model.predict_proba([features])[0]
        return {"prediction": int(prediction), "probability": float(probability[1])}

# Build & containerize: bentoml build && bentoml containerize churn_classifier:latest
# Deploy: docker run -p 3000:3000 churn_classifier:latest

Q7: How Does Great Expectations Validate ML Data Quality?

Answer:

Great Expectations (GX) is an open-source data validation framework that defines, tests, and documents data quality expectations. In MLOps, it validates training data, feature pipelines, and inference inputs — catching data issues before they degrade model performance. Expectations are defined as code and integrated into CI/CD and pipeline steps.

graph TD
    linkStyle default stroke:#000,color:#000
    subgraph GX["Great Expectations"]
        SUITE["Expectation Suite<br/>(set of validation rules)"]
        CHECKPOINT["Checkpoint<br/>(run validations)"]
        DATASOURCE["Data Source<br/>(Pandas, Spark, SQL)"]
        DOCS["Data Docs<br/>(HTML reports)"]
        PROFILER["Profiler<br/>(auto-generate expectations)"]
    end

    subgraph Pipeline_GX["ML Pipeline Integration"]
        TRAIN_DATA["Training Data<br/>(validate before training)"]
        FEATURE_DATA["Feature Pipeline<br/>(validate transforms)"]
        INFERENCE_DATA["Inference Input<br/>(validate at serving)"]
    end

    subgraph Actions["On Failure"]
        BLOCK["Block Pipeline<br/>(fail step)"]
        ALERT["Alert Team<br/>(Slack, email)"]
        LOG_GX["Log to Monitoring"]
    end

    DATASOURCE --> SUITE
    SUITE --> CHECKPOINT
    CHECKPOINT --> DOCS
    CHECKPOINT -->|"Fail"| Actions

    TRAIN_DATA --> CHECKPOINT
    FEATURE_DATA --> CHECKPOINT
    INFERENCE_DATA --> CHECKPOINT

    style GX fill:#6cc3d5,stroke:#333,color:#fff
    style Actions fill:#ff6b6b,stroke:#333,color:#fff
    style Pipeline_GX fill:#fff

Great Expectations Core Concepts

Concept	Description	Example
Expectation	Single data assertion (like a unit test for data)	`expect_column_values_to_not_be_null("age")`
Expectation Suite	Collection of expectations for a dataset	“training_data_suite” with 50 rules
Validator	Applies expectations to a batch of data	Runs suite against DataFrame
Checkpoint	Orchestrates validation + actions on results	Run suite, generate docs, alert on failure
Data Source	Connection to data (Pandas, Spark, SQL, file)	PostgreSQL, S3 Parquet, BigQuery
Data Docs	Auto-generated HTML documentation of results	Hosted on S3/GCS for team access
Profiler	Auto-generates expectations from sample data	Bootstraps initial suite

Defining Expectations

import great_expectations as gx

# Connect to data context
context = gx.get_context()

# Add data source
datasource = context.data_sources.add_pandas_filesystem(
    name="training_data",
    base_directory="data/processed/",
)
data_asset = datasource.add_csv_asset(name="train_csv", batching_regex=r"train_(?P<year>\d{4}).csv")
batch = data_asset.get_batch()

# Create expectation suite
suite = context.suites.add(gx.ExpectationSuite(name="training_data_quality"))

# Define expectations
suite.add_expectation(gx.expectations.ExpectColumnValuesToNotBeNull(column="customer_id"))
suite.add_expectation(gx.expectations.ExpectColumnValuesToNotBeNull(column="target"))
suite.add_expectation(gx.expectations.ExpectColumnValuesToBeBetween(
    column="age", min_value=18, max_value=120
))
suite.add_expectation(gx.expectations.ExpectColumnValuesToBeBetween(
    column="monthly_spend", min_value=0, max_value=50000
))
suite.add_expectation(gx.expectations.ExpectColumnValuesToBeInSet(
    column="contract_type", value_set=["month-to-month", "one_year", "two_year"]
))
suite.add_expectation(gx.expectations.ExpectColumnMeanToBeBetween(
    column="tenure_months", min_value=10, max_value=40
))
suite.add_expectation(gx.expectations.ExpectTableRowCountToBeBetween(
    min_value=10000, max_value=1000000
))
suite.add_expectation(gx.expectations.ExpectColumnProportionOfUniqueValuesToBeBetween(
    column="customer_id", min_value=0.99, max_value=1.0
))

# Save suite
suite.save()

Running Validations in Pipelines

# Run checkpoint (in pipeline step)
checkpoint = context.checkpoints.add(
    gx.Checkpoint(
        name="training_data_checkpoint",
        validation_definitions=[
            gx.ValidationDefinition(
                name="validate_training",
                data=batch,
                suite=suite,
            )
        ],
        actions=[
            gx.checkpoint.UpdateDataDocsAction(name="update_docs"),
        ],
    )
)
result = checkpoint.run()

# Check result in pipeline
if not result.success:
    failed_expectations = [
        r.expectation_config.type
        for r in result.run_results.values()
        for r in r.results
        if not r.success
    ]
    raise ValueError(f"Data validation failed: {failed_expectations}")

Common ML Data Expectations

Category	Expectations	Purpose
Completeness	No nulls in critical columns	Prevent training on missing data
Range validity	Values within expected bounds	Catch data pipeline errors
Schema	Column types, names, count match	Detect schema drift
Distribution	Mean, stddev, quantiles within range	Detect distribution shift
Uniqueness	ID columns are unique	Prevent duplicate records
Freshness	Max timestamp within expected window	Ensure data is recent
Referential	Foreign keys exist in reference table	Data integrity
Volume	Row count within expected range	Detect data loss/explosion

Q8: How Does Apache Airflow Orchestrate ML Workflows?

Answer:

Apache Airflow is the most widely used open-source workflow orchestrator. While not ML-specific, it’s commonly used for ML pipeline orchestration — scheduling data ingestion, feature engineering, model training, evaluation, and deployment as DAGs (Directed Acyclic Graphs). Airflow provides rich scheduling, retry logic, SLA monitoring, and integrations with every major ML tool and cloud service.

graph TD
    linkStyle default stroke:#000,color:#000
    subgraph Airflow["Apache Airflow"]
        SCHEDULER["Scheduler<br/>(trigger DAGs on schedule)"]
        WEBSERVER["Web UI<br/>(monitor, trigger, debug)"]
        EXECUTOR["Executor<br/>(run tasks)"]
        META["Metadata DB<br/>(PostgreSQL)"]
    end

    subgraph Executors["Executor Types"]
        LOCAL["Local Executor<br/>(single machine)"]
        CELERY["Celery Executor<br/>(distributed workers)"]
        K8S_EX["Kubernetes Executor<br/>(pod per task)"]
    end

    subgraph ML_DAG["ML DAG"]
        INGEST["Ingest Data"]
        VALIDATE["Validate (GX)"]
        FEATURE["Feature Engineering"]
        TRAIN_AF["Train Model"]
        EVAL_AF["Evaluate"]
        DEPLOY_AF["Deploy"]
    end

    SCHEDULER --> EXECUTOR --> ML_DAG
    EXECUTOR --> Executors

    style Airflow fill:#6cc3d5,stroke:#333,color:#fff
    style ML_DAG fill:#56cc9d,stroke:#333,color:#fff
    style Executors fill:#fff

Airflow for ML — Key Concepts

Concept	Description	ML Use
DAG	Directed Acyclic Graph of tasks	ML pipeline (train → eval → deploy)
Operator	Template for a task type	BashOperator, PythonOperator, KubernetesPodOperator
Sensor	Wait for external condition	S3KeySensor (new data arrival)
XCom	Pass data between tasks	Model metrics, S3 paths
Connections	Store external service credentials	AWS, GCP, database connections
Variables	Store configuration values	Model thresholds, feature versions
Pools	Limit concurrent tasks	GPU pool (max 4 concurrent training)
TaskGroup	Organize related tasks visually	Group all feature engineering tasks

ML Training DAG Example

from airflow import DAG
from airflow.operators.python import PythonOperator, BranchPythonOperator
from airflow.providers.amazon.aws.operators.sagemaker import (
    SageMakerTrainingOperator,
    SageMakerEndpointOperator,
)
from airflow.providers.cncf.kubernetes.operators.pod import KubernetesPodOperator
from airflow.sensors.s3_key_sensor import S3KeySensor
from datetime import datetime, timedelta

default_args = {
    "owner": "ml-team",
    "retries": 2,
    "retry_delay": timedelta(minutes=10),
    "email_on_failure": True,
    "email": ["ml-team@company.com"],
}

with DAG(
    dag_id="churn_model_training",
    default_args=default_args,
    schedule_interval="@weekly",
    start_date=datetime(2026, 1, 1),
    catchup=False,
    tags=["ml", "churn", "production"],
) as dag:

    # Wait for new data
    wait_for_data = S3KeySensor(
        task_id="wait_for_data",
        bucket_name="data-lake",
        bucket_key="churn/weekly/{{ ds }}/_SUCCESS",
        timeout=3600,
    )

    # Validate data quality
    validate_data = KubernetesPodOperator(
        task_id="validate_data",
        image="myregistry/data-validator:v2",
        cmds=["python", "validate.py"],
        arguments=["--date={{ ds }}", "--suite=training_quality"],
        namespace="ml-pipelines",
        get_logs=True,
    )

    # Feature engineering
    build_features = KubernetesPodOperator(
        task_id="build_features",
        image="myregistry/feature-builder:v3",
        cmds=["python", "build_features.py"],
        arguments=["--date={{ ds }}", "--output=s3://features/churn/{{ ds }}/"],
        namespace="ml-pipelines",
        resources={"request_memory": "8Gi", "request_cpu": "4"},
    )

    # Train model
    train_model = SageMakerTrainingOperator(
        task_id="train_model",
        config={
            "TrainingJobName": "churn-{{ ds_nodash }}",
            "AlgorithmSpecification": {"TrainingImage": "...", "TrainingInputMode": "File"},
            "InputDataConfig": [{"ChannelName": "train", "DataSource": {...}}],
            "OutputDataConfig": {"S3OutputPath": "s3://models/churn/"},
            "ResourceConfig": {"InstanceType": "ml.p3.2xlarge", "InstanceCount": 1},
        },
    )

    # Evaluate model
    evaluate = PythonOperator(
        task_id="evaluate_model",
        python_callable=evaluate_model,
        op_kwargs={"model_path": "s3://models/churn/{{ ds_nodash }}/"},
    )

    # Branch: deploy or alert
    def check_metrics(**context):
        metrics = context["ti"].xcom_pull(task_ids="evaluate_model")
        if metrics["f1_score"] >= 0.85:
            return "deploy_model"
        return "alert_team"

    branch = BranchPythonOperator(
        task_id="check_metrics",
        python_callable=check_metrics,
    )

    deploy = SageMakerEndpointOperator(
        task_id="deploy_model",
        operation="update",
        config={...},
    )

    alert = PythonOperator(
        task_id="alert_team",
        python_callable=send_slack_alert,
    )

    # DAG dependencies
    wait_for_data >> validate_data >> build_features >> train_model >> evaluate >> branch
    branch >> [deploy, alert]

Airflow ML Provider Packages

Provider	Operators	Use Case
amazon	SageMaker (Training, Endpoint, Transform)	AWS ML jobs
google	Vertex AI (Training, Prediction, AutoML)	GCP ML jobs
microsoft.azure	AzureML (Run, Endpoint)	Azure ML jobs
cncf.kubernetes	KubernetesPodOperator	Any containerized task
databricks	DatabricksRunNow, DatabricksSubmitRun	Spark/ML on Databricks
dbt	DbtCloudRunJob, DbtRunOperator	Data transformation

Airflow vs ML-Specific Orchestrators

Feature	Apache Airflow	Kubeflow Pipelines	SageMaker Pipelines
Scope	General workflow orchestration	ML-specific on K8s	ML-specific on AWS
Scheduling	Rich (cron, sensors, data-aware)	Basic (cron, manual)	EventBridge, API
ML integration	Via operators/providers	Native (KFP components)	Native (step types)
Caching	Manual (check before run)	Built-in (step-level)	Built-in (step-level)
Data lineage	Via plugins (OpenLineage)	Built-in artifacts	Built-in
Learning curve	Moderate	High (K8s + KFP)	Low (SDK)
Best for	Mixed workloads (data + ML)	K8s-native ML teams	AWS-native ML teams

Q9: How Do You Use Terraform/Pulumi for ML Infrastructure as Code?

Answer:

Infrastructure as Code (IaC) for ML ensures reproducible, version-controlled environments across development, staging, and production. Terraform (HCL) and Pulumi (Python/TypeScript) define ML infrastructure — compute clusters, model endpoints, feature stores, networking, IAM — as code that’s reviewed, tested, and deployed through CI/CD.

graph TD
    linkStyle default stroke:#000,color:#000
    subgraph IaC["Infrastructure as Code"]
        TF["Terraform<br/>(HCL, declarative)"]
        PULUMI["Pulumi<br/>(Python/TS, imperative)"]
        CDK["AWS CDK / Bicep<br/>(cloud-specific)"]
    end

    subgraph MLInfra["ML Infrastructure"]
        COMPUTE["Compute<br/>(GPU clusters, K8s, VMs)"]
        STORAGE_I["Storage<br/>(S3, GCS, ADLS)"]
        NETWORK["Networking<br/>(VPC, subnets, endpoints)"]
        SERVE_I["Serving<br/>(endpoints, load balancers)"]
        MONITOR_I["Monitoring<br/>(CloudWatch, Prometheus)"]
        IAM_I["IAM<br/>(roles, policies)"]
    end

    subgraph Workflow_IaC["IaC Workflow"]
        CODE_I["Write Code<br/>(tf / pulumi)"]
        PLAN["Plan<br/>(preview changes)"]
        REVIEW["Code Review<br/>(PR approval)"]
        APPLY["Apply<br/>(provision infra)"]
    end

    IaC --> MLInfra
    CODE_I --> PLAN --> REVIEW --> APPLY

    style IaC fill:#6cc3d5,stroke:#333,color:#fff
    style MLInfra fill:#56cc9d,stroke:#333,color:#fff
    style Workflow_IaC fill:#fff

Why IaC for ML?

Challenge	Without IaC	With IaC
Environment drift	Manual setup differs between envs	Identical infrastructure everywhere
Audit trail	Who changed what?	Git history tracks all changes
Disaster recovery	Manual rebuild	`terraform apply` recreates everything
Team onboarding	Undocumented setup steps	Self-documenting code
Cost control	Forgotten resources running	Destroy unused envs: `terraform destroy`
Compliance	Manual security checks	Policy-as-code (Sentinel, OPA)

Terraform for SageMaker

# main.tf - SageMaker ML infrastructure
terraform {
  required_providers {
    aws = { source = "hashicorp/aws", version = "~> 5.0" }
  }
  backend "s3" {
    bucket = "terraform-state-ml"
    key    = "ml-platform/terraform.tfstate"
    region = "us-east-1"
  }
}

# IAM Role for SageMaker
resource "aws_iam_role" "sagemaker_execution" {
  name = "sagemaker-execution-role"
  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Action = "sts:AssumeRole"
      Effect = "Allow"
      Principal = { Service = "sagemaker.amazonaws.com" }
    }]
  })
}

resource "aws_iam_role_policy_attachment" "sagemaker_full" {
  role       = aws_iam_role.sagemaker_execution.name
  policy_arn = "arn:aws:iam::aws:policy/AmazonSageMakerFullAccess"
}

# S3 bucket for ML artifacts
resource "aws_s3_bucket" "ml_artifacts" {
  bucket = "ml-artifacts-${var.environment}"
  tags   = { Environment = var.environment, Team = "ml-platform" }
}

resource "aws_s3_bucket_server_side_encryption_configuration" "ml_artifacts" {
  bucket = aws_s3_bucket.ml_artifacts.id
  rule {
    apply_server_side_encryption_by_default {
      sse_algorithm     = "aws:kms"
      kms_master_key_id = aws_kms_key.ml_key.arn
    }
  }
}

# SageMaker Domain (Studio)
resource "aws_sagemaker_domain" "ml_studio" {
  domain_name = "ml-studio-${var.environment}"
  auth_mode   = "IAM"
  vpc_id      = var.vpc_id
  subnet_ids  = var.private_subnet_ids

  default_user_settings {
    execution_role = aws_iam_role.sagemaker_execution.arn
    security_groups = [aws_security_group.sagemaker.id]
  }
}

# SageMaker Model (for endpoint)
resource "aws_sagemaker_model" "churn" {
  name               = "churn-model-${var.model_version}"
  execution_role_arn = aws_iam_role.sagemaker_execution.arn

  primary_container {
    image          = var.inference_image
    model_data_url = "s3://${aws_s3_bucket.ml_artifacts.id}/models/churn/${var.model_version}/model.tar.gz"
  }

  vpc_config {
    subnets            = var.private_subnet_ids
    security_group_ids = [aws_security_group.sagemaker.id]
  }
}

# SageMaker Endpoint
resource "aws_sagemaker_endpoint_configuration" "churn" {
  name = "churn-endpoint-config-${var.model_version}"

  production_variants {
    variant_name           = "primary"
    model_name             = aws_sagemaker_model.churn.name
    initial_instance_count = var.endpoint_instance_count
    instance_type          = var.endpoint_instance_type
  }

  data_capture_config {
    enable_capture              = true
    initial_sampling_percentage = 20
    destination_s3_uri          = "s3://${aws_s3_bucket.ml_artifacts.id}/data-capture/"
    capture_options { capture_mode = "Input" }
    capture_options { capture_mode = "Output" }
  }
}

resource "aws_sagemaker_endpoint" "churn" {
  name                 = "churn-prediction-${var.environment}"
  endpoint_config_name = aws_sagemaker_endpoint_configuration.churn.name
  tags                 = { Environment = var.environment }
}

# Auto-scaling
resource "aws_appautoscaling_target" "endpoint" {
  max_capacity       = 10
  min_capacity       = var.environment == "production" ? 2 : 1
  resource_id        = "endpoint/${aws_sagemaker_endpoint.churn.name}/variant/primary"
  scalable_dimension = "sagemaker:variant:DesiredInstanceCount"
  service_namespace  = "sagemaker"
}

Pulumi for ML (Python)

import pulumi
import pulumi_aws as aws
import pulumi_kubernetes as k8s

# Configuration
config = pulumi.Config()
environment = config.require("environment")

# S3 bucket for ML artifacts
ml_bucket = aws.s3.Bucket(
    f"ml-artifacts-{environment}",
    bucket=f"ml-artifacts-{environment}",
    server_side_encryption_configuration={
        "rule": {"apply_server_side_encryption_by_default": {"sse_algorithm": "aws:kms"}}
    },
    tags={"Environment": environment, "Team": "ml-platform"},
)

# Kubernetes namespace for ML workloads
ml_namespace = k8s.core.v1.Namespace(
    "ml-serving",
    metadata={"name": f"ml-serving-{environment}"},
)

# Deploy KServe InferenceService via Pulumi K8s
inference_service = k8s.apiextensions.CustomResource(
    "churn-model",
    api_version="serving.kserve.io/v1beta1",
    kind="InferenceService",
    metadata={"name": "churn-classifier", "namespace": ml_namespace.metadata.name},
    spec={
        "predictor": {
            "model": {
                "modelFormat": {"name": "sklearn"},
                "storageUri": pulumi.Output.concat("s3://", ml_bucket.id, "/models/churn/latest"),
            },
            "minReplicas": 1 if environment == "dev" else 2,
            "maxReplicas": 10,
        }
    },
)

pulumi.export("endpoint_url", inference_service.metadata.name)
pulumi.export("bucket_name", ml_bucket.id)

IaC Best Practices for ML

Practice	Implementation
Module per component	`modules/sagemaker-endpoint/`, `modules/feature-store/`
Environment separation	`envs/dev/`, `envs/staging/`, `envs/prod/` (different tfvars)
State locking	S3 + DynamoDB (Terraform) or Pulumi Cloud
Policy-as-code	OPA/Sentinel to enforce security (no public endpoints, encryption required)
Cost estimation	`infracost` in CI to preview cost changes
Drift detection	Scheduled `terraform plan` to detect manual changes
Secrets management	AWS Secrets Manager / Vault (never in state)

Q10: How Do You Build a Best-of-Breed Cloud-Agnostic MLOps Stack?

Answer:

A cloud-agnostic MLOps stack combines best-of-breed open-source tools to cover the full ML lifecycle — avoiding vendor lock-in while maintaining production-grade capabilities. The key is choosing tools that integrate well, have active communities, and support your deployment targets (cloud, on-prem, hybrid).

graph TD
    linkStyle default stroke:#000,color:#000
    subgraph Stack["Cloud-Agnostic MLOps Stack"]
        subgraph Versioning["Versioning & Tracking"]
            DVC_S["DVC<br/>(data/model versioning)"]
            MLFLOW_S["MLflow / W&B<br/>(experiment tracking)"]
        end

        subgraph Orchestration["Orchestration"]
            AIRFLOW_S["Airflow / Prefect<br/>(workflow scheduling)"]
            KFP_S["Kubeflow Pipelines<br/>(ML DAGs on K8s)"]
        end

        subgraph Data["Data Quality & Features"]
            GX_S["Great Expectations<br/>(data validation)"]
            FEAST_S["Feast<br/>(feature store)"]
        end

        subgraph Serving_S["Model Serving"]
            SELDON_S["Seldon / KServe<br/>(K8s inference)"]
            BENTO_S["BentoML<br/>(model packaging)"]
        end

        subgraph Monitoring_S["Monitoring"]
            EVIDENTLY["Evidently AI<br/>(drift detection)"]
            PROM["Prometheus + Grafana<br/>(metrics & dashboards)"]
        end

        subgraph Infra_S["Infrastructure"]
            TF_S["Terraform / Pulumi<br/>(IaC)"]
            K8S_S["Kubernetes<br/>(runtime platform)"]
        end
    end

    style Stack fill:#f8f9fa,stroke:#333
    style Versioning fill:#6cc3d5,stroke:#333,color:#fff
    style Orchestration fill:#56cc9d,stroke:#333,color:#fff
    style Data fill:#ffce67,stroke:#333
    style Serving_S fill:#ff6b6b,stroke:#333,color:#fff
    style Monitoring_S fill:#c3aed6,stroke:#333
    style Infra_S fill:#78c2ad,stroke:#333,color:#fff

Reference Stack by MLOps Stage

Stage	Open-Source Tool	Alternative	Purpose
Data versioning	DVC	LakeFS, Delta Lake	Track data/model versions alongside code
Experiment tracking	MLflow	W&B, Neptune, CometML	Log metrics, params, compare runs
Pipeline orchestration	Apache Airflow	Prefect, Dagster, Flyte	Schedule and orchestrate workflows
ML pipelines	Kubeflow Pipelines	Metaflow, ZenML	ML-specific DAGs with caching
Data validation	Great Expectations	Pandera, Deequ, TFDV	Validate data quality
Feature store	Feast	Hopsworks, Tecton	Consistent feature serving
Model serving	Seldon Core / KServe	BentoML, Ray Serve, Triton	Low-latency inference on K8s
Monitoring	Evidently AI	NannyML, Whylabs, Arize	Drift detection, model quality
Infrastructure	Terraform	Pulumi, Crossplane	Provision and manage infra
Container runtime	Kubernetes	Nomad, Docker Swarm	Run all workloads
CI/CD	GitHub Actions	GitLab CI, Jenkins, Argo CD	Build, test, deploy
Secrets	HashiCorp Vault	Sealed Secrets, SOPS	Manage credentials

Example Integration Architecture

# docker-compose.yml - Local MLOps development stack
version: "3.8"
services:
  mlflow:
    image: ghcr.io/mlflow/mlflow:2.15.0
    ports: ["5000:5000"]
    command: >
      mlflow server --host 0.0.0.0
      --backend-store-uri postgresql://mlflow:mlflow@postgres:5432/mlflow
      --default-artifact-root s3://mlflow-artifacts/
    environment:
      AWS_ACCESS_KEY_ID: ${AWS_ACCESS_KEY_ID}
      AWS_SECRET_ACCESS_KEY: ${AWS_SECRET_ACCESS_KEY}

  feast:
    image: feastdev/feature-server:0.38.0
    ports: ["6566:6566"]
    volumes: ["./feature_repo:/feature_repo"]
    command: feast serve --host 0.0.0.0

  great-expectations:
    image: greatexpectations/great_expectations:latest
    volumes: ["./gx:/gx"]

  grafana:
    image: grafana/grafana:latest
    ports: ["3000:3000"]
    volumes: ["./grafana/dashboards:/var/lib/grafana/dashboards"]

  prometheus:
    image: prom/prometheus:latest
    ports: ["9090:9090"]
    volumes: ["./prometheus.yml:/etc/prometheus/prometheus.yml"]

  postgres:
    image: postgres:15
    environment:
      POSTGRES_DB: mlflow
      POSTGRES_USER: mlflow
      POSTGRES_PASSWORD: mlflow

  redis:
    image: redis:7
    ports: ["6379:6379"]

Decision Framework: When to Use What

Scenario	Recommended Stack	Why
Startup, AWS-only	SageMaker (managed)	Fastest to production, no ops overhead
Enterprise, multi-cloud	Kubeflow + MLflow + Feast + Seldon	Portable, no lock-in
Small team, quick iteration	MLflow + DVC + BentoML + GitHub Actions	Simple, low overhead
Regulated industry	Cloud-managed + Terraform + OPA	Compliance, audit trail
On-prem/hybrid	Kubeflow + Feast + Airflow + Terraform	Full control, any environment
Large org, many teams	W&B + Feast + Airflow + KServe + Terraform	Collaboration, governance

Migration Strategy (Cloud → Agnostic)

Step	Action	Risk Mitigation
1	Abstract model packaging — use MLflow model format	Standard format works everywhere
2	Adopt Feast — decouple feature serving from cloud feature store	Dual-write during transition
3	Containerize training — Docker + KFP components	Runs on any K8s cluster
4	IaC everything — Terraform modules per provider	Swap providers by changing modules
5	Portable CI/CD — GitHub Actions with provider-agnostic steps	Same workflow, different targets
6	Monitoring abstraction — Evidently + Prometheus (cloud-agnostic)	Consistent metrics everywhere

Summary Table

#	Topic	Key Tools
1	Experiment Tracking	MLflow (Tracking, Registry, Projects, Models)
2	ML Pipelines on K8s	Kubeflow Pipelines (KFP), Katib, KServe
3	Data & Model Versioning	DVC (dvc add, dvc repro, dvc exp)
4	Experiment Management	Weights & Biases (Experiments, Sweeps, Artifacts)
5	Feature Store	Feast (online/offline stores, point-in-time joins)
6	Model Serving on K8s	Seldon Core, KServe, BentoML
7	Data Validation	Great Expectations (suites, checkpoints, Data Docs)
8	Workflow Orchestration	Apache Airflow (DAGs, operators, sensors)
9	ML Infrastructure as Code	Terraform, Pulumi (multi-cloud IaC)
10	Best-of-Breed Stack	Reference architecture combining all tools

What’s Next?

This article covered cloud-agnostic MLOps tools. For related content:

General MLOps concepts: MLOps Interview QA - 1
Azure MLOps: MLOps Interview QA - 2
GCP MLOps: MLOps Interview QA - 3
AWS MLOps: MLOps Interview QA - 4
LLMOps: LLMOps Interview QA - 1
DevOps foundations: DevOps Interview QA - 1

Enjoyed this article?

If this article helped you, your support helps us deliver more useful content. Here are a few ways to support our work:

Subscribe to Vectoring AI on YouTube
Share this article with your networks
Support with a coffee