DevOps Interview QA - 1

10 most-asked DevOps interview questions covering CI/CD pipelines, Docker, Kubernetes, Infrastructure as Code, GitOps, monitoring, deployment strategies, secrets management, incident response, and observability.

Author

Vectoring AI

Published

21 May 2026

Keywords

DevOps interview, CI/CD pipeline, Docker, Kubernetes, Infrastructure as Code, Terraform, GitOps, monitoring, deployment strategies, observability, Helm, Prometheus, Grafana, secrets management

Introduction

This is Part 1 of our DevOps Interview QA series, covering the 10 most frequently asked DevOps interview questions. DevOps bridges software development and IT operations to deliver software faster, more reliably, and with tighter feedback loops — emphasizing automation, collaboration, and continuous improvement.

For MLOps (ML-specific DevOps), see MLOps Interview QA - 1. For LLMOps, see LLMOps Interview QA - 1. For system design, see System Design Interview QA - 1.

Q1: What Is CI/CD and How Do You Design a Pipeline?

Answer:

CI/CD (Continuous Integration / Continuous Delivery or Deployment) is the backbone of DevOps automation. CI merges code frequently into a shared repository with automated builds and tests. CD ensures validated code is automatically deployed to staging or production. Together they reduce manual errors, accelerate releases, and provide rapid feedback.

graph LR
    linkStyle default stroke:#000,color:#000
    subgraph CI["Continuous Integration"]
        COMMIT["Code Commit<br/>(Git push)"]
        COMMIT --> LINT["Lint &<br/>Static Analysis"]
        LINT --> BUILD["Build<br/>(compile, package)"]
        BUILD --> UNIT["Unit Tests"]
        UNIT --> INTEG["Integration Tests"]
    end

    subgraph CD["Continuous Delivery / Deployment"]
        INTEG --> ARTIFACT["Push Artifact<br/>(container image)"]
        ARTIFACT --> STAGING["Deploy to Staging"]
        STAGING --> E2E["E2E / Smoke Tests"]
        E2E --> GATE["Approval Gate<br/>(manual or auto)"]
        GATE --> PROD["Deploy to Production"]
        PROD --> MONITOR["Monitor &<br/>Rollback if needed"]
    end

    style CI fill:#6cc3d5,stroke:#333,color:#fff
    style CD fill:#56cc9d,stroke:#333,color:#fff

Continuous Delivery vs Continuous Deployment

Aspect	Continuous Delivery	Continuous Deployment
Definition	Code is always release-ready; deployment requires manual approval	Every change passing tests is deployed to production automatically
Human gate	Yes (manual approval before prod)	No (fully automated)
Risk	Lower (human review)	Requires robust automated testing
Speed	Fast, but gated	Fastest possible
Best for	Regulated industries, critical systems	High-velocity teams with strong test coverage

CI/CD Pipeline Best Practices

Practice	Description
Fast feedback	Unit tests run first (<5 min); slow tests run later
Fail fast	Pipeline stops on first failure, team notified immediately
Immutable artifacts	Build once, deploy same artifact to all environments
Environment parity	Dev/staging/prod are as similar as possible
Secrets isolation	Use vault/secrets manager, never hardcode credentials
Caching	Cache dependencies, Docker layers, test results
Parallelization	Run independent test suites concurrently
Idempotent deployments	Re-running deployment produces same result

CI/CD Tools Comparison

Tool	Type	Key Feature	Best For
GitHub Actions	SaaS, YAML workflows	Deep GitHub integration, marketplace	GitHub-centric teams
GitLab CI/CD	Integrated, YAML	Built into GitLab, Auto DevOps	GitLab users, all-in-one
Jenkins	Self-hosted, plugins	Maximum flexibility, huge ecosystem	Complex enterprise pipelines
CircleCI	SaaS	Fast, parallelism, Docker-native	Speed-focused teams
ArgoCD	GitOps, K8s-native	Declarative, auto-sync from Git	Kubernetes deployments
Tekton	K8s-native, CRDs	Cloud-native, reusable tasks	K8s-native CI/CD

Q2: How Do Docker Containers Work and Why Are They Used in DevOps?

Answer:

Docker containers package an application with all its dependencies (code, runtime, libraries, config) into a lightweight, portable unit that runs consistently across any environment. Unlike VMs, containers share the host OS kernel, making them fast to start and resource-efficient.

graph TD
    linkStyle default stroke:#000,color:#000
    subgraph VM["Virtual Machines"]
        HW1["Hardware"]
        HW1 --> HYP["Hypervisor"]
        HYP --> OS1["Guest OS 1<br/>(full OS)"]
        HYP --> OS2["Guest OS 2<br/>(full OS)"]
        OS1 --> APP1["App A + Libs"]
        OS2 --> APP2["App B + Libs"]
    end

    subgraph Container["Docker Containers"]
        HW2["Hardware"]
        HW2 --> HOST["Host OS + Docker Engine"]
        HOST --> C1["Container 1<br/>(App A + Libs)"]
        HOST --> C2["Container 2<br/>(App B + Libs)"]
        HOST --> C3["Container 3<br/>(App C + Libs)"]
    end

    style VM fill:#6cc3d5,stroke:#333,color:#fff
    style Container fill:#56cc9d,stroke:#333,color:#fff

Docker vs Virtual Machines

Feature	Docker Containers	Virtual Machines
Startup time	Seconds	Minutes
Size	MBs (application layer only)	GBs (full OS)
Resource usage	Lightweight (shared kernel)	Heavy (dedicated OS per VM)
Isolation	Process-level (namespaces, cgroups)	Hardware-level (hypervisor)
Portability	Run anywhere Docker is installed	Tied to hypervisor
Density	100s of containers per host	10s of VMs per host
Use case	Microservices, CI/CD, dev environments	Legacy apps, strong isolation, different OS

Docker Architecture

Component	Purpose
Dockerfile	Recipe to build an image (FROM, RUN, COPY, CMD)
Image	Immutable template; layers of filesystem changes
Container	Running instance of an image
Registry	Store and distribute images (Docker Hub, ECR, GCR)
Docker Compose	Define multi-container applications in YAML
Docker Engine	Daemon that builds, runs, manages containers

Dockerfile Best Practices

# Multi-stage build: smaller final image
FROM python:3.12-slim AS builder
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

FROM python:3.12-slim
WORKDIR /app
COPY --from=builder /usr/local/lib/python3.12/site-packages /usr/local/lib/python3.12/site-packages
COPY . .

# Run as non-root user (security)
RUN useradd -r appuser
USER appuser

EXPOSE 8000
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

Key Practices

1. Use multi-stage builds → smaller images
2. Pin base image versions → reproducibility
3. Run as non-root → security
4. Use .dockerignore → exclude unnecessary files
5. Order layers by change frequency → better caching
6. One process per container → composability
7. Health checks → orchestrator can detect unhealthy containers
8. No secrets in images → use runtime env vars or secrets mounts

Q3: How Does Kubernetes Orchestrate Containers at Scale?

Answer:

Kubernetes (K8s) is an open-source container orchestration platform that automates deployment, scaling, self-healing, and management of containerized applications. It abstracts infrastructure into a declarative API — you describe the desired state, and Kubernetes makes it happen.

graph TD
    linkStyle default stroke:#000,color:#000
    subgraph ControlPlane["Control Plane"]
        API["API Server<br/>(kubectl, REST)"]
        ETCD["etcd<br/>(cluster state store)"]
        SCHED["Scheduler<br/>(assigns pods to nodes)"]
        CM["Controller Manager<br/>(reconciliation loops)"]
    end

    subgraph WorkerNode["Worker Node"]
        KUBELET["Kubelet<br/>(node agent)"]
        PROXY["Kube-Proxy<br/>(networking)"]
        RUNTIME["Container Runtime<br/>(containerd)"]
        POD1["Pod<br/>(container(s))"]
        POD2["Pod<br/>(container(s))"]
    end

    API --> ETCD
    API --> SCHED
    API --> CM
    SCHED --> KUBELET
    KUBELET --> RUNTIME
    RUNTIME --> POD1
    RUNTIME --> POD2

    style ControlPlane fill:#6cc3d5,stroke:#333,color:#fff
    style WorkerNode fill:#56cc9d,stroke:#333,color:#fff

Core Kubernetes Objects

Object	Purpose	Example
Pod	Smallest deployable unit (1+ containers)	Single app instance
Deployment	Manages ReplicaSets, rolling updates, rollbacks	Stateless web app
Service	Stable network endpoint for pods (ClusterIP, NodePort, LoadBalancer)	Internal or external access
ConfigMap	Non-sensitive configuration data	App settings, feature flags
Secret	Sensitive data (base64 encoded)	DB passwords, API keys
Ingress	HTTP/S routing rules, TLS termination	Domain-based routing
StatefulSet	Ordered, persistent pods with stable IDs	Databases, message queues
DaemonSet	One pod per node	Log collectors, monitoring agents
Job / CronJob	Run-to-completion tasks	Batch processing, scheduled tasks
HPA	Horizontal Pod Autoscaler	Scale pods by CPU/memory/custom

Kubernetes Self-Healing

Mechanism	What It Does
Liveness probe	Restarts container if health check fails
Readiness probe	Removes pod from service if not ready
ReplicaSet	Ensures desired number of pods always running
Node failure	Scheduler reschedules pods to healthy nodes
PodDisruptionBudget	Ensures minimum available pods during updates

Kubernetes Networking

Concept	Description
Pod-to-Pod	All pods can communicate without NAT (flat network)
Service	Virtual IP + DNS name load-balanced across pods
Ingress	L7 routing (path/host-based) from external traffic to services
NetworkPolicy	Firewall rules between pods (namespace/label selectors)
Service Mesh	Sidecar proxies for mTLS, retries, observability (Istio, Linkerd)

Q4: What Is Infrastructure as Code (IaC) and How Do You Use Terraform?

Answer:

Infrastructure as Code (IaC) is the practice of managing and provisioning infrastructure through machine-readable configuration files rather than manual processes. It makes infrastructure reproducible, version-controlled, auditable, and testable — treating infrastructure the same way you treat application code.

graph TD
    linkStyle default stroke:#000,color:#000
    subgraph Traditional["Manual Infrastructure"]
        MANUAL["Click in Console<br/>(AWS/GCP/Azure)"]
        MANUAL --> DRIFT["Configuration Drift<br/>(snowflake servers)"]
        DRIFT --> UNDOC["Undocumented<br/>Changes"]
    end

    subgraph IaC["Infrastructure as Code"]
        CODE["Define in Code<br/>(Terraform/CloudFormation)"]
        CODE --> GIT["Version Control<br/>(Git)"]
        GIT --> REVIEW["Code Review<br/>(PR/MR)"]
        REVIEW --> PLAN["Plan<br/>(preview changes)"]
        PLAN --> APPLY["Apply<br/>(provision infra)"]
        APPLY --> STATE["State File<br/>(tracks what exists)"]
    end

    style Traditional fill:#ff6b6b,stroke:#333,color:#fff
    style IaC fill:#56cc9d,stroke:#333,color:#fff

IaC Tools Comparison

Tool	Approach	Language	State	Best For
Terraform	Declarative, multi-cloud	HCL	Remote state file	Multi-cloud, cloud-agnostic
AWS CloudFormation	Declarative, AWS-only	JSON/YAML	Managed by AWS	AWS-only shops
Pulumi	Imperative, multi-cloud	Python/TypeScript/Go	Managed or self-hosted	Devs who prefer real code
Ansible	Procedural, config management	YAML (playbooks)	Stateless	Server config, provisioning
OpenTofu	Declarative, open-source Terraform fork	HCL	Remote state file	Terraform without licensing concerns

Terraform Workflow

Step	Command	What Happens
Init	`terraform init`	Downloads providers, initializes backend
Plan	`terraform plan`	Shows what will be created/changed/destroyed
Apply	`terraform apply`	Executes the plan, provisions infrastructure
Destroy	`terraform destroy`	Tears down all managed resources

Terraform Example

# Define provider
provider "aws" {
  region = "us-east-1"
}

# Create VPC
resource "aws_vpc" "main" {
  cidr_block = "10.0.0.0/16"
  tags = { Name = "production-vpc" }
}

# Create EKS cluster
resource "aws_eks_cluster" "app" {
  name     = "production-cluster"
  role_arn = aws_iam_role.eks.arn
  version  = "1.29"

  vpc_config {
    subnet_ids = aws_subnet.private[*].id
  }
}

# Auto-scaling group for worker nodes
resource "aws_eks_node_group" "workers" {
  cluster_name    = aws_eks_cluster.app.name
  node_group_name = "workers"
  instance_types  = ["m5.large"]

  scaling_config {
    desired_size = 3
    max_size     = 10
    min_size     = 1
  }
}

IaC Best Practices

Practice	Description
Modularize	Reusable modules for common patterns (VPC, EKS, RDS)
Remote state	Store state in S3/GCS with locking (DynamoDB/GCS)
State isolation	Separate state per environment (dev/staging/prod)
Plan in CI	Auto-run `terraform plan` on PRs for review
Drift detection	Periodically compare actual vs desired state
Secrets out of code	Use variables, vault references, or encrypted values
Tagging	Tag all resources (team, env, cost-center)
Blast radius	Small, focused modules limit impact of mistakes

Q5: What Are Deployment Strategies and When Do You Use Each?

Answer:

A deployment strategy defines how new application versions are rolled out to production. The right strategy depends on risk tolerance, rollback requirements, infrastructure complexity, and team capabilities.

graph TD
    linkStyle default stroke:#000,color:#000
    subgraph BlueGreen["Blue-Green"]
        BG_OLD["Blue (v1)<br/>100% traffic"]
        BG_NEW["Green (v2)<br/>0% traffic"]
        BG_OLD -->|"Switch LB"| BG_SWITCH["Green (v2)<br/>100% traffic"]
    end

    subgraph Canary["Canary"]
        C_OLD["v1: 95% traffic"]
        C_NEW["v2: 5% traffic"]
        C_NEW -->|"Gradually increase"| C_FULL["v2: 100% traffic"]
    end

    subgraph Rolling["Rolling Update"]
        R1["Instance 1: v1 → v2"]
        R2["Instance 2: v1 → v2"]
        R3["Instance 3: v1 → v2"]
    end

    style BlueGreen fill:#6cc3d5,stroke:#333,color:#fff
    style Canary fill:#56cc9d,stroke:#333,color:#fff
    style Rolling fill:#ffce67,stroke:#333

Deployment Strategies Comparison

Strategy	How It Works	Downtime	Rollback Speed	Cost	Risk
Recreate	Kill all old → start all new	Yes	Slow (redeploy)	Low	High
Rolling update	Replace instances one-by-one	No	Medium (roll forward/back)	Low	Medium
Blue-green	Two environments; switch traffic	No	Instant (switch back)	High (2x infra)	Low
Canary	Route small % to new version	No	Instant (route back)	Medium	Low
Shadow (dark launch)	New version gets copy of traffic, results discarded	No	N/A (not serving)	Medium	Zero
A/B testing	Split users by segment	No	Instant	Medium	Low
Feature flags	Toggle features in code without deploy	No	Instant (flip flag)	Low	Low

When to Use Each Strategy

Strategy	Use When
Recreate	Dev/test environments; can tolerate downtime
Rolling	Standard choice for K8s; good test coverage exists
Blue-green	Need instant rollback; can afford 2x infrastructure
Canary	High-risk changes; want to validate with real traffic
Shadow	Major rewrites; need production validation without risk
Feature flags	Decouple deployment from release; gradual feature rollout

Zero-Downtime Deployment Requirements

For zero-downtime deployments, ensure:
  1. Backward-compatible API changes (old clients must still work)
  2. Database migrations are non-breaking (add column, NOT rename)
  3. Health checks configured (readiness + liveness probes)
  4. Graceful shutdown (drain connections before terminating)
  5. Load balancer removes unhealthy instances automatically
  6. Session handling is stateless (or externalized to Redis)
  7. Enough capacity to serve traffic during rollout

Q6: How Do You Implement GitOps?

Answer:

GitOps is an operational framework that uses Git as the single source of truth for both application code and infrastructure declarations. Changes are made via pull requests, and an operator (ArgoCD, Flux) continuously reconciles the cluster state to match what’s declared in Git.

graph TD
    linkStyle default stroke:#000,color:#000
    DEV["Developer"]
    DEV -->|"1. Push code"| APP_REPO["App Repo<br/>(source code)"]
    APP_REPO -->|"2. CI pipeline<br/>builds image"| REGISTRY["Container<br/>Registry"]

    DEV -->|"3. Update manifest"| CONFIG_REPO["Config Repo<br/>(K8s manifests / Helm)"]

    CONFIG_REPO -->|"4. Operator detects<br/>change"| OPERATOR["GitOps Operator<br/>(ArgoCD / Flux)"]
    OPERATOR -->|"5. Sync to cluster"| CLUSTER["Kubernetes<br/>Cluster"]

    CLUSTER -->|"6. Drift detected?"| OPERATOR
    OPERATOR -->|"Auto-remediate"| CLUSTER

    style APP_REPO fill:#6cc3d5,stroke:#333,color:#fff
    style CONFIG_REPO fill:#56cc9d,stroke:#333,color:#fff
    style OPERATOR fill:#ffce67,stroke:#333

GitOps Principles

Principle	Description
Declarative	Desired system state is described declaratively (YAML/Helm)
Versioned	All state changes tracked in Git (full audit trail)
Automated	Approved changes are automatically applied to the system
Continuously reconciled	Operator ensures actual state == desired state; auto-heals drift

GitOps vs Traditional DevOps

Aspect	Traditional CI/CD	GitOps
Deployment trigger	CI pipeline pushes to cluster	Git commit triggers reconciliation
Source of truth	Pipeline scripts + cluster state	Git repository
Drift handling	Manual detection and fix	Auto-remediation by operator
Audit trail	CI logs (may be lost)	Git history (permanent)
Rollback	Re-run old pipeline or manual	`git revert` + auto-sync
Access control	CI system needs cluster credentials	Only operator has cluster access

GitOps Tools

Tool	Type	Key Feature
ArgoCD	Pull-based, K8s-native	Web UI, app-of-apps pattern, multi-cluster
Flux	Pull-based, K8s-native	Lightweight, Helm/Kustomize support, image automation
Jenkins X	CI/CD + GitOps	Full pipeline + GitOps for K8s
Weave GitOps	Enterprise GitOps	Policy enforcement, multi-tenancy

GitOps Repository Structure

config-repo/
├── base/                    # Shared manifests
│   ├── deployment.yaml
│   ├── service.yaml
│   └── kustomization.yaml
├── overlays/
│   ├── dev/                 # Dev-specific patches
│   │   ├── kustomization.yaml
│   │   └── replicas-patch.yaml
│   ├── staging/             # Staging config
│   │   └── kustomization.yaml
│   └── production/          # Production config
│       ├── kustomization.yaml
│       ├── replicas-patch.yaml
│       └── hpa.yaml
└── argocd/
    └── application.yaml     # ArgoCD app definition

Q7: How Do You Implement Monitoring and Observability?

Answer:

Observability is the ability to understand a system’s internal state from its external outputs. It combines three pillars — metrics, logs, and traces — to provide complete visibility into distributed systems. Monitoring is proactive alerting on known failure modes; observability enables investigation of unknown unknowns.

graph TD
    linkStyle default stroke:#000,color:#000
    subgraph ThreePillars["Three Pillars of Observability"]
        METRICS["Metrics<br/>(time-series numbers)<br/>Prometheus, Datadog"]
        LOGS["Logs<br/>(structured events)<br/>ELK, Loki, CloudWatch"]
        TRACES["Traces<br/>(request flow across services)<br/>Jaeger, Tempo, Zipkin"]
    end

    METRICS --> DASHBOARD["Dashboards<br/>(Grafana)"]
    LOGS --> SEARCH["Search & Analyze<br/>(Kibana, Grafana)"]
    TRACES --> FLOW["Request Flow<br/>Visualization"]

    DASHBOARD --> ALERT["Alerting<br/>(PagerDuty, OpsGenie)"]
    SEARCH --> ALERT
    FLOW --> DEBUG["Root Cause<br/>Analysis"]

    style ThreePillars fill:#6cc3d5,stroke:#333,color:#fff
    style ALERT fill:#ff6b6b,stroke:#333,color:#fff

Three Pillars of Observability

Pillar	What It Captures	Format	Tools
Metrics	Numeric measurements over time (counters, gauges, histograms)	Time-series	Prometheus, Datadog, CloudWatch
Logs	Discrete events with context (structured JSON preferred)	Text/JSON	ELK Stack, Loki, Fluentd
Traces	Request path across services with timing	Spans + trace ID	Jaeger, Tempo, OpenTelemetry

Key Metrics to Monitor (USE/RED)

Method	Metrics	Apply To
USE (Utilization, Saturation, Errors)	CPU %, queue depth, error count	Infrastructure (servers, disks, network)
RED (Rate, Errors, Duration)	Requests/sec, error rate %, p99 latency	Services (APIs, microservices)
Four Golden Signals (Google SRE)	Latency, traffic, errors, saturation	Any production system

Monitoring Stack

Component	Purpose	Tool Options
Metric collection	Scrape/push metrics from services	Prometheus, Telegraf, StatsD
Log aggregation	Centralize logs from all services	Fluentd/Fluent Bit → Loki/Elasticsearch
Distributed tracing	Track requests across microservices	OpenTelemetry → Jaeger/Tempo
Visualization	Dashboards and exploration	Grafana, Kibana, Datadog
Alerting	Notify on-call when thresholds breach	Alertmanager, PagerDuty, OpsGenie
SLO tracking	Monitor service level objectives	Sloth, Nobl9, custom Prometheus rules

Alerting Best Practices

Practice	Description
Alert on symptoms, not causes	Alert on “API error rate > 5%” not “CPU > 80%”
Reduce noise	Group related alerts; avoid duplicate pages
Actionable alerts	Every alert should have a clear runbook/response
Severity levels	Critical (page), Warning (ticket), Info (dashboard)
SLO-based alerts	Burn rate alerts: “burning error budget 10x faster than normal”
Test your alerts	Periodically verify alerts fire correctly

Q8: How Do You Manage Secrets in DevOps?

Answer:

Secrets management ensures sensitive data (API keys, database passwords, TLS certificates, tokens) is stored securely, accessed with least privilege, rotated regularly, and never exposed in code or logs.

graph TD
    linkStyle default stroke:#000,color:#000
    subgraph Bad["Anti-Patterns ✗"]
        HARDCODE["Hardcoded in code"]
        ENV_FILE[".env committed to Git"]
        PLAIN["Plain text ConfigMap"]
    end

    subgraph Good["Best Practices ✓"]
        VAULT["Secrets Manager<br/>(Vault, AWS SM)"]
        INJECT["Runtime Injection<br/>(env vars, volumes)"]
        ROTATE["Auto-Rotation<br/>(scheduled renewal)"]
        AUDIT["Audit Logging<br/>(who accessed what)"]
    end

    Bad -->|"Migrate to"| Good

    style Bad fill:#ff6b6b,stroke:#333,color:#fff
    style Good fill:#56cc9d,stroke:#333,color:#fff

Secrets Management Tools

Tool	Type	Key Feature	Best For
HashiCorp Vault	Self-hosted/SaaS	Dynamic secrets, PKI, transit encryption	Enterprise, multi-cloud
AWS Secrets Manager	Managed (AWS)	Auto-rotation for RDS, Lambda integration	AWS-native
AWS SSM Parameter Store	Managed (AWS)	Free tier, hierarchical keys	Simple AWS use cases
Azure Key Vault	Managed (Azure)	HSM-backed, RBAC integration	Azure-native
GCP Secret Manager	Managed (GCP)	IAM integration, versioning	GCP-native
Sealed Secrets	K8s-native	Encrypt secrets in Git, decrypt in cluster	GitOps workflows
External Secrets Operator	K8s-native	Sync secrets from external vault into K8s	Multi-provider
SOPS	CLI tool	Encrypt YAML/JSON files with cloud KMS	Config files in Git

Secrets in Kubernetes

Approach	Security Level	Complexity
K8s Secret (base64)	Low (not encrypted at rest by default)	Simple
K8s Secret + etcd encryption	Medium	Moderate
Sealed Secrets	Medium-High (encrypted in Git)	Moderate
External Secrets Operator	High (pulls from Vault/SM at runtime)	Higher
CSI Secrets Store Driver	High (mounts secrets as volumes)	Higher

Secrets Management Principles

1. Never store secrets in source code or container images
2. Use separate secrets per environment (dev/staging/prod)
3. Apply least-privilege access (RBAC, IAM policies)
4. Rotate secrets automatically on a schedule
5. Audit all secret access (who, when, from where)
6. Encrypt secrets at rest AND in transit
7. Use short-lived credentials where possible (dynamic secrets)
8. Detect secrets in code with pre-commit hooks (gitleaks, trufflehog)

Q9: How Do You Handle Incident Response and Post-Mortems?

Answer:

Incident response is the structured process of detecting, diagnosing, resolving, and learning from production failures. DevOps teams need clear processes, defined roles, and a blame-free culture to handle incidents effectively and prevent recurrence.

graph LR
    linkStyle default stroke:#000,color:#000
    DETECT["1. Detect<br/>(alert fires)"]
    DETECT --> TRIAGE["2. Triage<br/>(severity, impact)"]
    TRIAGE --> RESPOND["3. Respond<br/>(diagnose, mitigate)"]
    RESPOND --> RESOLVE["4. Resolve<br/>(fix or rollback)"]
    RESOLVE --> REVIEW["5. Post-Mortem<br/>(learn, prevent)"]
    REVIEW --> IMPROVE["6. Improve<br/>(action items)"]

    style DETECT fill:#ff6b6b,stroke:#333,color:#fff
    style RESPOND fill:#ffce67,stroke:#333
    style REVIEW fill:#56cc9d,stroke:#333,color:#fff

Incident Severity Levels

Severity	Impact	Response Time	Example
SEV-1 (Critical)	Complete outage, all users affected	Immediate page, war room	Production database down
SEV-2 (Major)	Partial outage, significant degradation	Page within 15 min	Payment processing failing
SEV-3 (Minor)	Limited impact, workaround available	Next business hours	Slow dashboard loading
SEV-4 (Low)	Cosmetic or future risk	Backlog ticket	Deprecated library warning

Incident Response Process

Phase	Actions	Tools
Detection	Monitoring alerts, user reports, synthetic checks	PagerDuty, OpsGenie, Grafana Alerting
Triage	Assess severity, assign incident commander	Incident management platform
Communication	Status page update, stakeholder notification	Statuspage, Slack channel
Diagnosis	Check dashboards, logs, traces; identify root cause	Grafana, Kibana, Jaeger
Mitigation	Rollback, feature flag off, scale up, failover	ArgoCD, kubectl, feature flags
Resolution	Deploy fix, verify recovery, close incident	CI/CD pipeline
Post-mortem	Blameless review, timeline, action items	Confluence, Google Docs

Post-Mortem Template

## Incident Post-Mortem: [Title]

**Date:** 2026-05-21
**Duration:** 47 minutes (10:15 - 11:02 UTC)
**Severity:** SEV-2
**Impact:** 30% of users experienced 500 errors on checkout

### Timeline
- 10:15 — Alert: checkout error rate > 10%
- 10:18 — On-call engineer acknowledged
- 10:25 — Root cause identified: bad config deployment
- 10:32 — Rollback initiated
- 11:02 — Error rate returned to baseline

### Root Cause
A config change removed the database connection pool setting,
causing connection exhaustion under load.

### What Went Well
- Alert fired within 2 minutes of impact
- Rollback was fast (< 10 minutes)

### What Could Be Improved
- Config changes lacked validation tests
- No canary stage for config deployments

### Action Items
1. [ ] Add schema validation for config files (Owner: Alice, Due: May 28)
2. [ ] Canary deploy for config changes (Owner: Bob, Due: June 4)
3. [ ] Add integration test for DB pool settings (Owner: Carol, Due: May 25)

SRE Concepts

Concept	Definition
SLI (Service Level Indicator)	Metric measuring service quality (e.g., % requests < 200ms)
SLO (Service Level Objective)	Target value for SLI (e.g., 99.9% requests < 200ms)
SLA (Service Level Agreement)	Contract with consequences for missing SLO
Error Budget	1 - SLO = allowed downtime (e.g., 99.9% SLO → 43 min/month)
MTTR	Mean Time To Recovery
MTTF	Mean Time To Failure
MTBF	Mean Time Between Failures (MTTF + MTTR)

Q10: How Do You Secure a DevOps Pipeline (DevSecOps)?

Answer:

DevSecOps integrates security practices into every stage of the DevOps lifecycle — “shifting security left” so vulnerabilities are caught early rather than discovered in production. Security is automated, continuous, and everyone’s responsibility.

graph LR
    linkStyle default stroke:#000,color:#000
    subgraph ShiftLeft["Shift Left Security"]
        PLAN["Plan<br/>(threat modeling)"]
        CODE["Code<br/>(SAST, secrets scan)"]
        BUILD["Build<br/>(dependency scan,<br/>image scan)"]
        TEST["Test<br/>(DAST, pen test)"]
        DEPLOY["Deploy<br/>(policy gates,<br/>signed images)"]
        OPERATE["Operate<br/>(runtime security,<br/>monitoring)"]
    end

    PLAN --> CODE --> BUILD --> TEST --> DEPLOY --> OPERATE

    style ShiftLeft fill:#56cc9d,stroke:#333,color:#fff

Security at Each Pipeline Stage

Stage	Security Practice	Tools
Code	Static analysis (SAST), secrets detection	SonarQube, Semgrep, gitleaks, trufflehog
Dependencies	Vulnerability scanning (SCA)	Snyk, Dependabot, Trivy, OWASP Dependency-Check
Container images	Image scanning, base image verification	Trivy, Grype, Docker Scout
Infrastructure	IaC security scanning	Checkov, tfsec, KICS
Deployment	Image signing, admission control	Cosign/Sigstore, OPA Gatekeeper, Kyverno
Runtime	Runtime threat detection, network policies	Falco, Sysdig, Calico
Access	RBAC, least privilege, MFA	Cloud IAM, K8s RBAC, Vault

CI/CD Security Pipeline

# Example: GitHub Actions security pipeline
name: Security Checks
on: [push, pull_request]

jobs:
  security:
    runs-on: ubuntu-latest
    steps:
      # 1. Secret scanning
      - uses: gitleaks/gitleaks-action@v2

      # 2. SAST - Static Application Security Testing
      - uses: returntocorp/semgrep-action@v1
        with:
          config: p/owasp-top-ten

      # 3. Dependency vulnerability scan
      - run: trivy fs --severity HIGH,CRITICAL .

      # 4. Container image scan
      - run: |
          docker build -t myapp:${{ github.sha }} .
          trivy image --severity HIGH,CRITICAL myapp:${{ github.sha }}

      # 5. IaC security scan
      - uses: bridgecrewio/checkov-action@v12
        with:
          directory: terraform/

      # 6. Fail pipeline if critical vulnerabilities found
      - run: |
          if [ "$CRITICAL_VULNS" -gt 0 ]; then
            echo "Critical vulnerabilities found!"
            exit 1
          fi

DevSecOps Principles

Principle	Implementation
Shift left	Find vulnerabilities in dev, not production
Automate everything	Security checks run automatically in CI/CD
Least privilege	Minimal permissions for services, users, CI runners
Defense in depth	Multiple security layers (network, application, data)
Immutable infrastructure	Don’t patch servers; replace with new secure images
Zero trust	Verify every request; no implicit trust by network location
Supply chain security	Sign artifacts, verify provenance, pin dependencies
Continuous compliance	Policy-as-code enforced automatically

Summary Table

#	Topic	Key Concepts
1	CI/CD Pipelines	Build → test → deploy automation, fast feedback, immutable artifacts
2	Docker Containers	Lightweight packaging, multi-stage builds, image security
3	Kubernetes	Orchestration, self-healing, declarative state, HPA, probes
4	Infrastructure as Code	Terraform, declarative infra, state management, modules
5	Deployment Strategies	Blue-green, canary, rolling, shadow, feature flags
6	GitOps	Git as source of truth, ArgoCD/Flux, auto-reconciliation
7	Monitoring & Observability	Metrics/logs/traces, USE/RED, SLO-based alerting
8	Secrets Management	Vault, rotation, least privilege, External Secrets Operator
9	Incident Response	Severity levels, post-mortems, SRE concepts, error budgets
10	DevSecOps	Shift left, SAST/DAST/SCA, image scanning, policy-as-code

What’s Next?

This article covered core DevOps concepts and practices. For related content:

MLOps (ML + DevOps): MLOps Interview QA - 1
LLMOps (LLM-specific ops): LLMOps Interview QA - 1
System design foundations: System Design Interview QA - 1
Infrastructure deep dives: System Design Interview QA - 2
Design patterns: Design Pattern Interview QA - 1

Enjoyed this article?

If this article helped you, your support helps us deliver more useful content. Here are a few ways to support our work:

Subscribe to Vectoring AI on YouTube
Share this article with your networks
Support with a coffee