DevOps Interview QA - 1

10 most-asked DevOps interview questions covering CI/CD pipelines, Docker, Kubernetes, Infrastructure as Code, GitOps, monitoring, deployment strategies, secrets management, incident response, and observability.
Author
Published

21 May 2026

Keywords

DevOps interview, CI/CD pipeline, Docker, Kubernetes, Infrastructure as Code, Terraform, GitOps, monitoring, deployment strategies, observability, Helm, Prometheus, Grafana, secrets management

Introduction

This is Part 1 of our DevOps Interview QA series, covering the 10 most frequently asked DevOps interview questions. DevOps bridges software development and IT operations to deliver software faster, more reliably, and with tighter feedback loops — emphasizing automation, collaboration, and continuous improvement.

For MLOps (ML-specific DevOps), see MLOps Interview QA - 1. For LLMOps, see LLMOps Interview QA - 1. For system design, see System Design Interview QA - 1.


Q1: What Is CI/CD and How Do You Design a Pipeline?

Answer:

CI/CD (Continuous Integration / Continuous Delivery or Deployment) is the backbone of DevOps automation. CI merges code frequently into a shared repository with automated builds and tests. CD ensures validated code is automatically deployed to staging or production. Together they reduce manual errors, accelerate releases, and provide rapid feedback.

graph LR
    subgraph CI["Continuous Integration"]
        COMMIT["Code Commit<br/>(Git push)"]
        COMMIT --> LINT["Lint &<br/>Static Analysis"]
        LINT --> BUILD["Build<br/>(compile, package)"]
        BUILD --> UNIT["Unit Tests"]
        UNIT --> INTEG["Integration Tests"]
    end

    subgraph CD["Continuous Delivery / Deployment"]
        INTEG --> ARTIFACT["Push Artifact<br/>(container image)"]
        ARTIFACT --> STAGING["Deploy to Staging"]
        STAGING --> E2E["E2E / Smoke Tests"]
        E2E --> GATE["Approval Gate<br/>(manual or auto)"]
        GATE --> PROD["Deploy to Production"]
        PROD --> MONITOR["Monitor &<br/>Rollback if needed"]
    end

    style CI fill:#6cc3d5,stroke:#333,color:#fff
    style CD fill:#56cc9d,stroke:#333,color:#fff

Continuous Delivery vs Continuous Deployment

Aspect Continuous Delivery Continuous Deployment
Definition Code is always release-ready; deployment requires manual approval Every change passing tests is deployed to production automatically
Human gate Yes (manual approval before prod) No (fully automated)
Risk Lower (human review) Requires robust automated testing
Speed Fast, but gated Fastest possible
Best for Regulated industries, critical systems High-velocity teams with strong test coverage

CI/CD Pipeline Best Practices

Practice Description
Fast feedback Unit tests run first (<5 min); slow tests run later
Fail fast Pipeline stops on first failure, team notified immediately
Immutable artifacts Build once, deploy same artifact to all environments
Environment parity Dev/staging/prod are as similar as possible
Secrets isolation Use vault/secrets manager, never hardcode credentials
Caching Cache dependencies, Docker layers, test results
Parallelization Run independent test suites concurrently
Idempotent deployments Re-running deployment produces same result

CI/CD Tools Comparison

Tool Type Key Feature Best For
GitHub Actions SaaS, YAML workflows Deep GitHub integration, marketplace GitHub-centric teams
GitLab CI/CD Integrated, YAML Built into GitLab, Auto DevOps GitLab users, all-in-one
Jenkins Self-hosted, plugins Maximum flexibility, huge ecosystem Complex enterprise pipelines
CircleCI SaaS Fast, parallelism, Docker-native Speed-focused teams
ArgoCD GitOps, K8s-native Declarative, auto-sync from Git Kubernetes deployments
Tekton K8s-native, CRDs Cloud-native, reusable tasks K8s-native CI/CD

Q2: How Do Docker Containers Work and Why Are They Used in DevOps?

Answer:

Docker containers package an application with all its dependencies (code, runtime, libraries, config) into a lightweight, portable unit that runs consistently across any environment. Unlike VMs, containers share the host OS kernel, making them fast to start and resource-efficient.

graph TD
    subgraph VM["Virtual Machines"]
        HW1["Hardware"]
        HW1 --> HYP["Hypervisor"]
        HYP --> OS1["Guest OS 1<br/>(full OS)"]
        HYP --> OS2["Guest OS 2<br/>(full OS)"]
        OS1 --> APP1["App A + Libs"]
        OS2 --> APP2["App B + Libs"]
    end

    subgraph Container["Docker Containers"]
        HW2["Hardware"]
        HW2 --> HOST["Host OS + Docker Engine"]
        HOST --> C1["Container 1<br/>(App A + Libs)"]
        HOST --> C2["Container 2<br/>(App B + Libs)"]
        HOST --> C3["Container 3<br/>(App C + Libs)"]
    end

    style VM fill:#6cc3d5,stroke:#333,color:#fff
    style Container fill:#56cc9d,stroke:#333,color:#fff

Docker vs Virtual Machines

Feature Docker Containers Virtual Machines
Startup time Seconds Minutes
Size MBs (application layer only) GBs (full OS)
Resource usage Lightweight (shared kernel) Heavy (dedicated OS per VM)
Isolation Process-level (namespaces, cgroups) Hardware-level (hypervisor)
Portability Run anywhere Docker is installed Tied to hypervisor
Density 100s of containers per host 10s of VMs per host
Use case Microservices, CI/CD, dev environments Legacy apps, strong isolation, different OS

Docker Architecture

Component Purpose
Dockerfile Recipe to build an image (FROM, RUN, COPY, CMD)
Image Immutable template; layers of filesystem changes
Container Running instance of an image
Registry Store and distribute images (Docker Hub, ECR, GCR)
Docker Compose Define multi-container applications in YAML
Docker Engine Daemon that builds, runs, manages containers

Dockerfile Best Practices

# Multi-stage build: smaller final image
FROM python:3.12-slim AS builder
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

FROM python:3.12-slim
WORKDIR /app
COPY --from=builder /usr/local/lib/python3.12/site-packages /usr/local/lib/python3.12/site-packages
COPY . .

# Run as non-root user (security)
RUN useradd -r appuser
USER appuser

EXPOSE 8000
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

Key Practices

1. Use multi-stage builds → smaller images
2. Pin base image versions → reproducibility
3. Run as non-root → security
4. Use .dockerignore → exclude unnecessary files
5. Order layers by change frequency → better caching
6. One process per container → composability
7. Health checks → orchestrator can detect unhealthy containers
8. No secrets in images → use runtime env vars or secrets mounts

Q3: How Does Kubernetes Orchestrate Containers at Scale?

Answer:

Kubernetes (K8s) is an open-source container orchestration platform that automates deployment, scaling, self-healing, and management of containerized applications. It abstracts infrastructure into a declarative API — you describe the desired state, and Kubernetes makes it happen.

graph TD
    subgraph ControlPlane["Control Plane"]
        API["API Server<br/>(kubectl, REST)"]
        ETCD["etcd<br/>(cluster state store)"]
        SCHED["Scheduler<br/>(assigns pods to nodes)"]
        CM["Controller Manager<br/>(reconciliation loops)"]
    end

    subgraph WorkerNode["Worker Node"]
        KUBELET["Kubelet<br/>(node agent)"]
        PROXY["Kube-Proxy<br/>(networking)"]
        RUNTIME["Container Runtime<br/>(containerd)"]
        POD1["Pod<br/>(container(s))"]
        POD2["Pod<br/>(container(s))"]
    end

    API --> ETCD
    API --> SCHED
    API --> CM
    SCHED --> KUBELET
    KUBELET --> RUNTIME
    RUNTIME --> POD1
    RUNTIME --> POD2

    style ControlPlane fill:#6cc3d5,stroke:#333,color:#fff
    style WorkerNode fill:#56cc9d,stroke:#333,color:#fff

Core Kubernetes Objects

Object Purpose Example
Pod Smallest deployable unit (1+ containers) Single app instance
Deployment Manages ReplicaSets, rolling updates, rollbacks Stateless web app
Service Stable network endpoint for pods (ClusterIP, NodePort, LoadBalancer) Internal or external access
ConfigMap Non-sensitive configuration data App settings, feature flags
Secret Sensitive data (base64 encoded) DB passwords, API keys
Ingress HTTP/S routing rules, TLS termination Domain-based routing
StatefulSet Ordered, persistent pods with stable IDs Databases, message queues
DaemonSet One pod per node Log collectors, monitoring agents
Job / CronJob Run-to-completion tasks Batch processing, scheduled tasks
HPA Horizontal Pod Autoscaler Scale pods by CPU/memory/custom

Kubernetes Self-Healing

Mechanism What It Does
Liveness probe Restarts container if health check fails
Readiness probe Removes pod from service if not ready
ReplicaSet Ensures desired number of pods always running
Node failure Scheduler reschedules pods to healthy nodes
PodDisruptionBudget Ensures minimum available pods during updates

Kubernetes Networking

Concept Description
Pod-to-Pod All pods can communicate without NAT (flat network)
Service Virtual IP + DNS name load-balanced across pods
Ingress L7 routing (path/host-based) from external traffic to services
NetworkPolicy Firewall rules between pods (namespace/label selectors)
Service Mesh Sidecar proxies for mTLS, retries, observability (Istio, Linkerd)

Q4: What Is Infrastructure as Code (IaC) and How Do You Use Terraform?

Answer:

Infrastructure as Code (IaC) is the practice of managing and provisioning infrastructure through machine-readable configuration files rather than manual processes. It makes infrastructure reproducible, version-controlled, auditable, and testable — treating infrastructure the same way you treat application code.

graph TD
    subgraph Traditional["Manual Infrastructure"]
        MANUAL["Click in Console<br/>(AWS/GCP/Azure)"]
        MANUAL --> DRIFT["Configuration Drift<br/>(snowflake servers)"]
        DRIFT --> UNDOC["Undocumented<br/>Changes"]
    end

    subgraph IaC["Infrastructure as Code"]
        CODE["Define in Code<br/>(Terraform/CloudFormation)"]
        CODE --> GIT["Version Control<br/>(Git)"]
        GIT --> REVIEW["Code Review<br/>(PR/MR)"]
        REVIEW --> PLAN["Plan<br/>(preview changes)"]
        PLAN --> APPLY["Apply<br/>(provision infra)"]
        APPLY --> STATE["State File<br/>(tracks what exists)"]
    end

    style Traditional fill:#ff6b6b,stroke:#333,color:#fff
    style IaC fill:#56cc9d,stroke:#333,color:#fff

IaC Tools Comparison

Tool Approach Language State Best For
Terraform Declarative, multi-cloud HCL Remote state file Multi-cloud, cloud-agnostic
AWS CloudFormation Declarative, AWS-only JSON/YAML Managed by AWS AWS-only shops
Pulumi Imperative, multi-cloud Python/TypeScript/Go Managed or self-hosted Devs who prefer real code
Ansible Procedural, config management YAML (playbooks) Stateless Server config, provisioning
OpenTofu Declarative, open-source Terraform fork HCL Remote state file Terraform without licensing concerns

Terraform Workflow

Step Command What Happens
Init terraform init Downloads providers, initializes backend
Plan terraform plan Shows what will be created/changed/destroyed
Apply terraform apply Executes the plan, provisions infrastructure
Destroy terraform destroy Tears down all managed resources

Terraform Example

# Define provider
provider "aws" {
  region = "us-east-1"
}

# Create VPC
resource "aws_vpc" "main" {
  cidr_block = "10.0.0.0/16"
  tags = { Name = "production-vpc" }
}

# Create EKS cluster
resource "aws_eks_cluster" "app" {
  name     = "production-cluster"
  role_arn = aws_iam_role.eks.arn
  version  = "1.29"

  vpc_config {
    subnet_ids = aws_subnet.private[*].id
  }
}

# Auto-scaling group for worker nodes
resource "aws_eks_node_group" "workers" {
  cluster_name    = aws_eks_cluster.app.name
  node_group_name = "workers"
  instance_types  = ["m5.large"]

  scaling_config {
    desired_size = 3
    max_size     = 10
    min_size     = 1
  }
}

IaC Best Practices

Practice Description
Modularize Reusable modules for common patterns (VPC, EKS, RDS)
Remote state Store state in S3/GCS with locking (DynamoDB/GCS)
State isolation Separate state per environment (dev/staging/prod)
Plan in CI Auto-run terraform plan on PRs for review
Drift detection Periodically compare actual vs desired state
Secrets out of code Use variables, vault references, or encrypted values
Tagging Tag all resources (team, env, cost-center)
Blast radius Small, focused modules limit impact of mistakes

Q5: What Are Deployment Strategies and When Do You Use Each?

Answer:

A deployment strategy defines how new application versions are rolled out to production. The right strategy depends on risk tolerance, rollback requirements, infrastructure complexity, and team capabilities.

graph TD
    subgraph BlueGreen["Blue-Green"]
        BG_OLD["Blue (v1)<br/>100% traffic"]
        BG_NEW["Green (v2)<br/>0% traffic"]
        BG_OLD -->|"Switch LB"| BG_SWITCH["Green (v2)<br/>100% traffic"]
    end

    subgraph Canary["Canary"]
        C_OLD["v1: 95% traffic"]
        C_NEW["v2: 5% traffic"]
        C_NEW -->|"Gradually increase"| C_FULL["v2: 100% traffic"]
    end

    subgraph Rolling["Rolling Update"]
        R1["Instance 1: v1 → v2"]
        R2["Instance 2: v1 → v2"]
        R3["Instance 3: v1 → v2"]
    end

    style BlueGreen fill:#6cc3d5,stroke:#333,color:#fff
    style Canary fill:#56cc9d,stroke:#333,color:#fff
    style Rolling fill:#ffce67,stroke:#333

Deployment Strategies Comparison

Strategy How It Works Downtime Rollback Speed Cost Risk
Recreate Kill all old → start all new Yes Slow (redeploy) Low High
Rolling update Replace instances one-by-one No Medium (roll forward/back) Low Medium
Blue-green Two environments; switch traffic No Instant (switch back) High (2x infra) Low
Canary Route small % to new version No Instant (route back) Medium Low
Shadow (dark launch) New version gets copy of traffic, results discarded No N/A (not serving) Medium Zero
A/B testing Split users by segment No Instant Medium Low
Feature flags Toggle features in code without deploy No Instant (flip flag) Low Low

When to Use Each Strategy

Strategy Use When
Recreate Dev/test environments; can tolerate downtime
Rolling Standard choice for K8s; good test coverage exists
Blue-green Need instant rollback; can afford 2x infrastructure
Canary High-risk changes; want to validate with real traffic
Shadow Major rewrites; need production validation without risk
Feature flags Decouple deployment from release; gradual feature rollout

Zero-Downtime Deployment Requirements

For zero-downtime deployments, ensure:
  1. Backward-compatible API changes (old clients must still work)
  2. Database migrations are non-breaking (add column, NOT rename)
  3. Health checks configured (readiness + liveness probes)
  4. Graceful shutdown (drain connections before terminating)
  5. Load balancer removes unhealthy instances automatically
  6. Session handling is stateless (or externalized to Redis)
  7. Enough capacity to serve traffic during rollout

Q6: How Do You Implement GitOps?

Answer:

GitOps is an operational framework that uses Git as the single source of truth for both application code and infrastructure declarations. Changes are made via pull requests, and an operator (ArgoCD, Flux) continuously reconciles the cluster state to match what’s declared in Git.

graph TD
    DEV["Developer"]
    DEV -->|"1. Push code"| APP_REPO["App Repo<br/>(source code)"]
    APP_REPO -->|"2. CI pipeline<br/>builds image"| REGISTRY["Container<br/>Registry"]

    DEV -->|"3. Update manifest"| CONFIG_REPO["Config Repo<br/>(K8s manifests / Helm)"]

    CONFIG_REPO -->|"4. Operator detects<br/>change"| OPERATOR["GitOps Operator<br/>(ArgoCD / Flux)"]
    OPERATOR -->|"5. Sync to cluster"| CLUSTER["Kubernetes<br/>Cluster"]

    CLUSTER -->|"6. Drift detected?"| OPERATOR
    OPERATOR -->|"Auto-remediate"| CLUSTER

    style APP_REPO fill:#6cc3d5,stroke:#333,color:#fff
    style CONFIG_REPO fill:#56cc9d,stroke:#333,color:#fff
    style OPERATOR fill:#ffce67,stroke:#333

GitOps Principles

Principle Description
Declarative Desired system state is described declaratively (YAML/Helm)
Versioned All state changes tracked in Git (full audit trail)
Automated Approved changes are automatically applied to the system
Continuously reconciled Operator ensures actual state == desired state; auto-heals drift

GitOps vs Traditional DevOps

Aspect Traditional CI/CD GitOps
Deployment trigger CI pipeline pushes to cluster Git commit triggers reconciliation
Source of truth Pipeline scripts + cluster state Git repository
Drift handling Manual detection and fix Auto-remediation by operator
Audit trail CI logs (may be lost) Git history (permanent)
Rollback Re-run old pipeline or manual git revert + auto-sync
Access control CI system needs cluster credentials Only operator has cluster access

GitOps Tools

Tool Type Key Feature
ArgoCD Pull-based, K8s-native Web UI, app-of-apps pattern, multi-cluster
Flux Pull-based, K8s-native Lightweight, Helm/Kustomize support, image automation
Jenkins X CI/CD + GitOps Full pipeline + GitOps for K8s
Weave GitOps Enterprise GitOps Policy enforcement, multi-tenancy

GitOps Repository Structure

config-repo/
├── base/                    # Shared manifests
│   ├── deployment.yaml
│   ├── service.yaml
│   └── kustomization.yaml
├── overlays/
│   ├── dev/                 # Dev-specific patches
│   │   ├── kustomization.yaml
│   │   └── replicas-patch.yaml
│   ├── staging/             # Staging config
│   │   └── kustomization.yaml
│   └── production/          # Production config
│       ├── kustomization.yaml
│       ├── replicas-patch.yaml
│       └── hpa.yaml
└── argocd/
    └── application.yaml     # ArgoCD app definition

Q7: How Do You Implement Monitoring and Observability?

Answer:

Observability is the ability to understand a system’s internal state from its external outputs. It combines three pillars — metrics, logs, and traces — to provide complete visibility into distributed systems. Monitoring is proactive alerting on known failure modes; observability enables investigation of unknown unknowns.

graph TD
    subgraph ThreePillars["Three Pillars of Observability"]
        METRICS["Metrics<br/>(time-series numbers)<br/>Prometheus, Datadog"]
        LOGS["Logs<br/>(structured events)<br/>ELK, Loki, CloudWatch"]
        TRACES["Traces<br/>(request flow across services)<br/>Jaeger, Tempo, Zipkin"]
    end

    METRICS --> DASHBOARD["Dashboards<br/>(Grafana)"]
    LOGS --> SEARCH["Search & Analyze<br/>(Kibana, Grafana)"]
    TRACES --> FLOW["Request Flow<br/>Visualization"]

    DASHBOARD --> ALERT["Alerting<br/>(PagerDuty, OpsGenie)"]
    SEARCH --> ALERT
    FLOW --> DEBUG["Root Cause<br/>Analysis"]

    style ThreePillars fill:#6cc3d5,stroke:#333,color:#fff
    style ALERT fill:#ff6b6b,stroke:#333,color:#fff

Three Pillars of Observability

Pillar What It Captures Format Tools
Metrics Numeric measurements over time (counters, gauges, histograms) Time-series Prometheus, Datadog, CloudWatch
Logs Discrete events with context (structured JSON preferred) Text/JSON ELK Stack, Loki, Fluentd
Traces Request path across services with timing Spans + trace ID Jaeger, Tempo, OpenTelemetry

Key Metrics to Monitor (USE/RED)

Method Metrics Apply To
USE (Utilization, Saturation, Errors) CPU %, queue depth, error count Infrastructure (servers, disks, network)
RED (Rate, Errors, Duration) Requests/sec, error rate %, p99 latency Services (APIs, microservices)
Four Golden Signals (Google SRE) Latency, traffic, errors, saturation Any production system

Monitoring Stack

Component Purpose Tool Options
Metric collection Scrape/push metrics from services Prometheus, Telegraf, StatsD
Log aggregation Centralize logs from all services Fluentd/Fluent Bit → Loki/Elasticsearch
Distributed tracing Track requests across microservices OpenTelemetry → Jaeger/Tempo
Visualization Dashboards and exploration Grafana, Kibana, Datadog
Alerting Notify on-call when thresholds breach Alertmanager, PagerDuty, OpsGenie
SLO tracking Monitor service level objectives Sloth, Nobl9, custom Prometheus rules

Alerting Best Practices

Practice Description
Alert on symptoms, not causes Alert on “API error rate > 5%” not “CPU > 80%”
Reduce noise Group related alerts; avoid duplicate pages
Actionable alerts Every alert should have a clear runbook/response
Severity levels Critical (page), Warning (ticket), Info (dashboard)
SLO-based alerts Burn rate alerts: “burning error budget 10x faster than normal”
Test your alerts Periodically verify alerts fire correctly

Q8: How Do You Manage Secrets in DevOps?

Answer:

Secrets management ensures sensitive data (API keys, database passwords, TLS certificates, tokens) is stored securely, accessed with least privilege, rotated regularly, and never exposed in code or logs.

graph TD
    subgraph Bad["Anti-Patterns ✗"]
        HARDCODE["Hardcoded in code"]
        ENV_FILE[".env committed to Git"]
        PLAIN["Plain text ConfigMap"]
    end

    subgraph Good["Best Practices ✓"]
        VAULT["Secrets Manager<br/>(Vault, AWS SM)"]
        INJECT["Runtime Injection<br/>(env vars, volumes)"]
        ROTATE["Auto-Rotation<br/>(scheduled renewal)"]
        AUDIT["Audit Logging<br/>(who accessed what)"]
    end

    Bad -->|"Migrate to"| Good

    style Bad fill:#ff6b6b,stroke:#333,color:#fff
    style Good fill:#56cc9d,stroke:#333,color:#fff

Secrets Management Tools

Tool Type Key Feature Best For
HashiCorp Vault Self-hosted/SaaS Dynamic secrets, PKI, transit encryption Enterprise, multi-cloud
AWS Secrets Manager Managed (AWS) Auto-rotation for RDS, Lambda integration AWS-native
AWS SSM Parameter Store Managed (AWS) Free tier, hierarchical keys Simple AWS use cases
Azure Key Vault Managed (Azure) HSM-backed, RBAC integration Azure-native
GCP Secret Manager Managed (GCP) IAM integration, versioning GCP-native
Sealed Secrets K8s-native Encrypt secrets in Git, decrypt in cluster GitOps workflows
External Secrets Operator K8s-native Sync secrets from external vault into K8s Multi-provider
SOPS CLI tool Encrypt YAML/JSON files with cloud KMS Config files in Git

Secrets in Kubernetes

Approach Security Level Complexity
K8s Secret (base64) Low (not encrypted at rest by default) Simple
K8s Secret + etcd encryption Medium Moderate
Sealed Secrets Medium-High (encrypted in Git) Moderate
External Secrets Operator High (pulls from Vault/SM at runtime) Higher
CSI Secrets Store Driver High (mounts secrets as volumes) Higher

Secrets Management Principles

1. Never store secrets in source code or container images
2. Use separate secrets per environment (dev/staging/prod)
3. Apply least-privilege access (RBAC, IAM policies)
4. Rotate secrets automatically on a schedule
5. Audit all secret access (who, when, from where)
6. Encrypt secrets at rest AND in transit
7. Use short-lived credentials where possible (dynamic secrets)
8. Detect secrets in code with pre-commit hooks (gitleaks, trufflehog)

Q9: How Do You Handle Incident Response and Post-Mortems?

Answer:

Incident response is the structured process of detecting, diagnosing, resolving, and learning from production failures. DevOps teams need clear processes, defined roles, and a blame-free culture to handle incidents effectively and prevent recurrence.

graph LR
    DETECT["1. Detect<br/>(alert fires)"]
    DETECT --> TRIAGE["2. Triage<br/>(severity, impact)"]
    TRIAGE --> RESPOND["3. Respond<br/>(diagnose, mitigate)"]
    RESPOND --> RESOLVE["4. Resolve<br/>(fix or rollback)"]
    RESOLVE --> REVIEW["5. Post-Mortem<br/>(learn, prevent)"]
    REVIEW --> IMPROVE["6. Improve<br/>(action items)"]

    style DETECT fill:#ff6b6b,stroke:#333,color:#fff
    style RESPOND fill:#ffce67,stroke:#333
    style REVIEW fill:#56cc9d,stroke:#333,color:#fff

Incident Severity Levels

Severity Impact Response Time Example
SEV-1 (Critical) Complete outage, all users affected Immediate page, war room Production database down
SEV-2 (Major) Partial outage, significant degradation Page within 15 min Payment processing failing
SEV-3 (Minor) Limited impact, workaround available Next business hours Slow dashboard loading
SEV-4 (Low) Cosmetic or future risk Backlog ticket Deprecated library warning

Incident Response Process

Phase Actions Tools
Detection Monitoring alerts, user reports, synthetic checks PagerDuty, OpsGenie, Grafana Alerting
Triage Assess severity, assign incident commander Incident management platform
Communication Status page update, stakeholder notification Statuspage, Slack channel
Diagnosis Check dashboards, logs, traces; identify root cause Grafana, Kibana, Jaeger
Mitigation Rollback, feature flag off, scale up, failover ArgoCD, kubectl, feature flags
Resolution Deploy fix, verify recovery, close incident CI/CD pipeline
Post-mortem Blameless review, timeline, action items Confluence, Google Docs

Post-Mortem Template

## Incident Post-Mortem: [Title]

**Date:** 2026-05-21
**Duration:** 47 minutes (10:15 - 11:02 UTC)
**Severity:** SEV-2
**Impact:** 30% of users experienced 500 errors on checkout

### Timeline
- 10:15 — Alert: checkout error rate > 10%
- 10:18 — On-call engineer acknowledged
- 10:25 — Root cause identified: bad config deployment
- 10:32 — Rollback initiated
- 11:02 — Error rate returned to baseline

### Root Cause
A config change removed the database connection pool setting,
causing connection exhaustion under load.

### What Went Well
- Alert fired within 2 minutes of impact
- Rollback was fast (< 10 minutes)

### What Could Be Improved
- Config changes lacked validation tests
- No canary stage for config deployments

### Action Items
1. [ ] Add schema validation for config files (Owner: Alice, Due: May 28)
2. [ ] Canary deploy for config changes (Owner: Bob, Due: June 4)
3. [ ] Add integration test for DB pool settings (Owner: Carol, Due: May 25)

SRE Concepts

Concept Definition
SLI (Service Level Indicator) Metric measuring service quality (e.g., % requests < 200ms)
SLO (Service Level Objective) Target value for SLI (e.g., 99.9% requests < 200ms)
SLA (Service Level Agreement) Contract with consequences for missing SLO
Error Budget 1 - SLO = allowed downtime (e.g., 99.9% SLO → 43 min/month)
MTTR Mean Time To Recovery
MTTF Mean Time To Failure
MTBF Mean Time Between Failures (MTTF + MTTR)

Q10: How Do You Secure a DevOps Pipeline (DevSecOps)?

Answer:

DevSecOps integrates security practices into every stage of the DevOps lifecycle — “shifting security left” so vulnerabilities are caught early rather than discovered in production. Security is automated, continuous, and everyone’s responsibility.

graph LR
    subgraph ShiftLeft["Shift Left Security"]
        PLAN["Plan<br/>(threat modeling)"]
        CODE["Code<br/>(SAST, secrets scan)"]
        BUILD["Build<br/>(dependency scan,<br/>image scan)"]
        TEST["Test<br/>(DAST, pen test)"]
        DEPLOY["Deploy<br/>(policy gates,<br/>signed images)"]
        OPERATE["Operate<br/>(runtime security,<br/>monitoring)"]
    end

    PLAN --> CODE --> BUILD --> TEST --> DEPLOY --> OPERATE

    style ShiftLeft fill:#56cc9d,stroke:#333,color:#fff

Security at Each Pipeline Stage

Stage Security Practice Tools
Code Static analysis (SAST), secrets detection SonarQube, Semgrep, gitleaks, trufflehog
Dependencies Vulnerability scanning (SCA) Snyk, Dependabot, Trivy, OWASP Dependency-Check
Container images Image scanning, base image verification Trivy, Grype, Docker Scout
Infrastructure IaC security scanning Checkov, tfsec, KICS
Deployment Image signing, admission control Cosign/Sigstore, OPA Gatekeeper, Kyverno
Runtime Runtime threat detection, network policies Falco, Sysdig, Calico
Access RBAC, least privilege, MFA Cloud IAM, K8s RBAC, Vault

CI/CD Security Pipeline

# Example: GitHub Actions security pipeline
name: Security Checks
on: [push, pull_request]

jobs:
  security:
    runs-on: ubuntu-latest
    steps:
      # 1. Secret scanning
      - uses: gitleaks/gitleaks-action@v2

      # 2. SAST - Static Application Security Testing
      - uses: returntocorp/semgrep-action@v1
        with:
          config: p/owasp-top-ten

      # 3. Dependency vulnerability scan
      - run: trivy fs --severity HIGH,CRITICAL .

      # 4. Container image scan
      - run: |
          docker build -t myapp:${{ github.sha }} .
          trivy image --severity HIGH,CRITICAL myapp:${{ github.sha }}

      # 5. IaC security scan
      - uses: bridgecrewio/checkov-action@v12
        with:
          directory: terraform/

      # 6. Fail pipeline if critical vulnerabilities found
      - run: |
          if [ "$CRITICAL_VULNS" -gt 0 ]; then
            echo "Critical vulnerabilities found!"
            exit 1
          fi

DevSecOps Principles

Principle Implementation
Shift left Find vulnerabilities in dev, not production
Automate everything Security checks run automatically in CI/CD
Least privilege Minimal permissions for services, users, CI runners
Defense in depth Multiple security layers (network, application, data)
Immutable infrastructure Don’t patch servers; replace with new secure images
Zero trust Verify every request; no implicit trust by network location
Supply chain security Sign artifacts, verify provenance, pin dependencies
Continuous compliance Policy-as-code enforced automatically

Summary Table

# Topic Key Concepts
1 CI/CD Pipelines Build → test → deploy automation, fast feedback, immutable artifacts
2 Docker Containers Lightweight packaging, multi-stage builds, image security
3 Kubernetes Orchestration, self-healing, declarative state, HPA, probes
4 Infrastructure as Code Terraform, declarative infra, state management, modules
5 Deployment Strategies Blue-green, canary, rolling, shadow, feature flags
6 GitOps Git as source of truth, ArgoCD/Flux, auto-reconciliation
7 Monitoring & Observability Metrics/logs/traces, USE/RED, SLO-based alerting
8 Secrets Management Vault, rotation, least privilege, External Secrets Operator
9 Incident Response Severity levels, post-mortems, SRE concepts, error budgets
10 DevSecOps Shift left, SAST/DAST/SCA, image scanning, policy-as-code

What’s Next?

This article covered core DevOps concepts and practices. For related content: