Chapter 36: Cloud and Container Security

"The cloud is just someone else's computer. And now you need to secure someone else's computer while only controlling half of it." --- Common security wisdom, uncomfortably accurate

The S3 Bucket That Leaked 100 Million Records

In 2019, Capital One suffered a breach that exposed the personal data of approximately 100 million customers and applicants. The attacker, a former AWS employee, exploited a misconfigured web application firewall to perform a Server-Side Request Forgery (SSRF) attack against the EC2 metadata service. This gave her temporary credentials from an IAM role with excessive permissions. That role could read any S3 bucket in the account. The data --- names, addresses, credit scores, Social Security numbers --- was sitting in S3 without additional encryption controls beyond AWS defaults.

Here is the critical distinction: AWS infrastructure was never compromised. The vulnerability was in how Capital One configured their WAF, how they designed their IAM roles, and how they stored sensitive data. This distinction is the entire foundation of cloud security. It is called the shared responsibility model, and misunderstanding it is the number one cause of cloud security incidents.


The Shared Responsibility Model

Every major cloud provider operates on the same principle: the provider secures the infrastructure, and the customer secures what they put on it. The boundary shifts depending on the service model.

graph TD
    subgraph "Customer Responsibility"
        C1["Data Classification & Encryption"]
        C2["IAM: Users, Roles, Policies, MFA"]
        C3["Application Security & Patching"]
        C4["OS Configuration & Patching (IaaS)"]
        C5["Network Config: Security Groups, NACLs"]
        C6["Firewall Rules, VPC Design"]
    end

    subgraph "Shared"
        S1["Network Controls"]
        S2["Encryption Options"]
        S3["Logging & Monitoring"]
    end

    subgraph "Provider Responsibility"
        P1["Physical Security of Data Centers"]
        P2["Hardware: Servers, Storage, Networking"]
        P3["Hypervisor / Virtualization Layer"]
        P4["Managed Service Infrastructure"]
        P5["Global Network Infrastructure"]
        P6["Environmental Controls (power, cooling)"]
    end

    C1 --- S1
    S1 --- P1

    style C1 fill:#3498db,color:#fff
    style C2 fill:#3498db,color:#fff
    style C3 fill:#3498db,color:#fff
    style C4 fill:#3498db,color:#fff
    style C5 fill:#3498db,color:#fff
    style C6 fill:#3498db,color:#fff
    style P1 fill:#e67e22,color:#fff
    style P2 fill:#e67e22,color:#fff
    style P3 fill:#e67e22,color:#fff
    style P4 fill:#e67e22,color:#fff
    style P5 fill:#e67e22,color:#fff
    style P6 fill:#e67e22,color:#fff

The boundary shifts significantly between IaaS, PaaS, and SaaS:

LayerIaaS (EC2)PaaS (RDS, Lambda)SaaS (Office 365)
DataCustomerCustomerCustomer
ApplicationCustomerCustomerProvider
RuntimeCustomerProviderProvider
OSCustomerProviderProvider
NetworkSharedSharedProvider
HardwareProviderProviderProvider
PhysicalProviderProviderProvider

When you use a managed database like RDS, you do not patch the OS, but you are still responsible for database access controls and encryption. This is where people make mistakes. They assume "managed" means "secured." AWS will patch the underlying PostgreSQL engine for your RDS instance, but it will not stop you from setting the admin password to "password123," making the instance publicly accessible, or granting the database role rds_superuser to your application service account.


IAM: The Foundation of Cloud Security

Identity and Access Management is the single most critical control in cloud environments. If you get IAM wrong, nothing else matters --- the attacker with admin credentials can disable every other security control you have configured.

The Principle of Least Privilege in Practice

// BAD: The Capital One problem
// This policy says "do anything to any S3 bucket in the account"
{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Action": "s3:*",
    "Resource": "*"
  }]
}

// GOOD: Specific actions on specific resources with conditions
{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Action": [
      "s3:GetObject",
      "s3:ListBucket"
    ],
    "Resource": [
      "arn:aws:s3:::my-app-data-bucket",
      "arn:aws:s3:::my-app-data-bucket/*"
    ],
    "Condition": {
      "StringEquals": {
        "aws:PrincipalTag/Team": "backend",
        "s3:ExistingObjectTag/classification": "internal"
      },
      "IpAddress": {
        "aws:SourceIp": "10.0.0.0/8"
      },
      "Bool": {
        "aws:MultiFactorAuthPresent": "true"
      }
    }
  }]
}

Yes, the second policy is longer and more complex. But the overhead is absolutely worth it. The first policy says "this identity can do anything to any S3 bucket in the account." When attached to a compromised role, you lose everything. The second says "this identity can read objects from one specific bucket, only if tagged as the backend team, only objects classified as internal, only from the internal network, and only with MFA." When compromised, you lose read access to one classification level of one bucket from the internal network. The complexity of the policy is proportional to the value of the data it protects.

Service Accounts and Machine Identity

Applications running in the cloud need credentials to access other services. How you manage these credentials determines your exposure:

graph TD
    B["BEST: Cloud-native identity<br/>AWS IAM Roles / Azure Managed Identity<br/>GCP Workload Identity<br/>──────────────────<br/>No credentials to manage, rotate, or leak<br/>Instance/pod automatically receives<br/>temporary tokens (15min-12hr lifetime)"] --> G["GOOD: Secrets manager<br/>AWS Secrets Manager / HashiCorp Vault<br/>──────────────────<br/>Centralized, audited, auto-rotated<br/>secrets with access logging"]
    G --> A["ACCEPTABLE: Environment variables<br/>In managed runtime (Lambda, ECS)<br/>──────────────────<br/>Not in code, but visible to process<br/>and in memory dumps"]
    A --> BA["BAD: Config files on disk<br/>──────────────────<br/>Readable by anyone with<br/>file system access"]
    BA --> T["TERRIBLE: Hard-coded in source<br/>──────────────────<br/>Visible in version control<br/>to everyone, forever"]
    T --> C["CATASTROPHIC: Committed to<br/>public GitHub repository<br/>──────────────────<br/>Automated scanners find these<br/>in under 60 seconds"]

    style B fill:#27ae60,color:#fff
    style G fill:#2ecc71,color:#fff
    style A fill:#f39c12,color:#fff
    style BA fill:#e67e22,color:#fff
    style T fill:#e74c3c,color:#fff
    style C fill:#c0392b,color:#fff
# Audit your AWS IAM posture
# List users with console access but no MFA
$ aws iam generate-credential-report && sleep 5
$ aws iam get-credential-report --output text --query Content | \
    base64 -d | awk -F, '$4=="true" && $8=="false" {print "NO MFA: "$1}'

# List access keys older than 90 days
$ aws iam get-credential-report --output text --query Content | \
    base64 -d | awk -F, '$9=="true" {print $1, "Key created:", $10}'

# Find IAM policies with wildcards (overly permissive)
$ aws iam list-policies --only-attached --query 'Policies[*].Arn' --output text | \
    tr '\t' '\n' | while read arn; do
      version=$(aws iam get-policy --policy-arn "$arn" --query 'Policy.DefaultVersionId' --output text)
      aws iam get-policy-version --policy-arn "$arn" --version-id "$version" \
        --query 'PolicyVersion.Document' --output json | \
        grep -l '"Action": "\*"' 2>/dev/null && echo "WILDCARD: $arn"
    done
A startup committed their AWS access keys to a public GitHub repository. Within 90 seconds --- not minutes, seconds --- automated scanners detected the keys and began spinning up cryptocurrency mining instances. By the time the developer noticed and revoked the keys (4 hours later), the attacker had launched over 200 c5.18xlarge instances across multiple regions. The AWS bill for those 4 hours: $47,000.

This is not unusual. Research from GitGuardian shows that over 10 million secrets were detected in public GitHub repositories in 2022 alone. AWS has since added credential scanning that automatically quarantines exposed keys, and tools like git-secrets, gitleaks, truffleHog, and GitHub's built-in secret scanning can prevent commits containing secrets. But the first rule remains: never commit credentials to version control, period.

Network Security in the Cloud

VPC Architecture with Defense in Depth

graph TD
    subgraph "VPC 10.0.0.0/16"
        subgraph "Public Subnet 10.0.1.0/24"
            IGW["Internet Gateway"] --> ALB["Application<br/>Load Balancer<br/>SG: 80,443 from 0.0.0.0/0"]
            NAT["NAT Gateway<br/>(outbound only)"]
        end

        subgraph "Private Subnet 10.0.2.0/24"
            APP1["App Server 1<br/>SG: 8080 from ALB-SG"]
            APP2["App Server 2<br/>SG: 8080 from ALB-SG"]
        end

        subgraph "Data Subnet 10.0.3.0/24"
            RDS1["RDS Primary<br/>SG: 5432 from App-SG"]
            RDS2["RDS Replica<br/>SG: 5432 from App-SG"]
        end

        ALB -->|"Port 8080 only"| APP1
        ALB -->|"Port 8080 only"| APP2
        APP1 -->|"Port 5432 only"| RDS1
        APP2 -->|"Port 5432 only"| RDS1
        APP1 -->|"Outbound via NAT"| NAT
    end

    NACL1["NACL: Public<br/>Allow 80,443 inbound<br/>from 0.0.0.0/0"] -.-> ALB
    NACL2["NACL: Private<br/>Allow from 10.0.1.0/24 only"] -.-> APP1
    NACL3["NACL: Data<br/>Allow from 10.0.2.0/24 only"] -.-> RDS1

    style ALB fill:#3498db,color:#fff
    style APP1 fill:#f39c12,color:#fff
    style APP2 fill:#f39c12,color:#fff
    style RDS1 fill:#27ae60,color:#fff
    style RDS2 fill:#27ae60,color:#fff

Key design principles:

  • Databases are never in public subnets and have no public IP addresses
  • App servers can only be reached from the load balancer, not directly from the internet
  • Security groups reference each other by group ID, not by IP address --- adding a new app server to the App-SG automatically grants it database access
  • NACLs provide a second layer of subnet-level filtering (stateless, evaluated in rule order)
  • Outbound internet access for private subnets goes through NAT Gateway, providing a single egress point for monitoring

Security Groups vs. NACLs

FeatureSecurity GroupsNetwork ACLs
StateStateful (return traffic auto-allowed)Stateless (must allow return traffic explicitly)
LevelInstance/ENI levelSubnet level
RulesAllow rules onlyAllow AND deny rules
EvaluationAll rules evaluatedRules evaluated in number order, first match wins
DefaultDeny all inbound, allow all outboundAllow all inbound and outbound
Use caseFine-grained per-instance controlBroad subnet-level guardrails

Container Security

Containers add another layer of abstraction --- and another attack surface. Understanding container isolation mechanisms is essential for securing containerized workloads.

Container Isolation: Not a VM

graph TD
    subgraph "Virtual Machine Isolation"
        VA["App A + Libraries"] --> VOS1["Guest OS (full)"]
        VB["App B + Libraries"] --> VOS2["Guest OS (full)"]
        VOS1 --> HV["Hypervisor<br/>(hardware-level isolation)"]
        VOS2 --> HV
        HV --> HW1["Host Hardware"]
    end

    subgraph "Container Isolation"
        CA["App A + Libraries"] --> CR["Container Runtime<br/>(namespaces + cgroups)<br/>SHARED KERNEL"]
        CB["App B + Libraries"] --> CR
        CR --> HW2["Host OS + Kernel"]
    end

    style HV fill:#27ae60,color:#fff
    style CR fill:#e67e22,color:#fff

The critical difference: containers share the host kernel. A kernel exploit in one container can compromise ALL containers on the same host. VMs have a complete hardware abstraction boundary between guests.

Linux Namespaces

Namespaces give each container its own isolated view of system resources:

NamespaceIsolatesSecurity Implication
PIDProcess IDsContainer sees only its own processes
NETNetwork stackOwn IP, ports, routing table
MNTFilesystem mountsOwn root filesystem
UTSHostnameOwn hostname and domain
IPCInter-process communicationOwn semaphores, message queues
USERUser/group IDsRoot in container != root on host (when enabled)
CGROUPCgroup rootOwn cgroup hierarchy

Control Groups (cgroups)

While namespaces isolate what a container can see, cgroups limit what it can use:

# Run a container with strict resource limits
$ docker run -d \
    --memory=512m \           # Max 512MB RAM
    --memory-swap=512m \      # No swap (same as memory = no swap)
    --cpus=1.0 \              # Max 1 CPU core
    --pids-limit=100 \        # Max 100 processes (prevents fork bombs)
    --read-only \             # Read-only root filesystem
    --tmpfs /tmp:rw,noexec,nosuid,size=64m \  # Writable /tmp with limits
    --security-opt=no-new-privileges \        # Prevent privilege escalation
    --cap-drop=ALL \          # Drop all Linux capabilities
    --cap-add=NET_BIND_SERVICE \  # Add back only what's needed
    myapp:v1.2.3

Seccomp Profiles

Seccomp (Secure Computing Mode) restricts which system calls a container can make. Docker's default seccomp profile blocks approximately 44 of 300+ syscalls, including dangerous ones like mount, reboot, kexec_load, and ptrace.

{
  "defaultAction": "SCMP_ACT_ERRNO",
  "architectures": ["SCMP_ARCH_X86_64"],
  "syscalls": [
    {
      "names": ["read", "write", "open", "close", "stat",
                "fstat", "poll", "lseek", "mmap", "mprotect",
                "munmap", "brk", "ioctl", "access", "pipe",
                "select", "dup2", "nanosleep", "getpid",
                "socket", "connect", "accept", "sendto",
                "recvfrom", "bind", "listen", "exit_group",
                "futex", "epoll_wait", "epoll_ctl",
                "clone", "execve", "openat", "newfstatat"],
      "action": "SCMP_ACT_ALLOW"
    }
  ]
}

Docker Security Best Practices

The most common Docker security mistakes are running as root, using latest tags, not scanning images, and mounting the Docker socket. Each one deserves attention.

1. Do Not Run as Root

# BAD: Runs as root by default
FROM node:18
COPY . /app
CMD ["node", "server.js"]

# GOOD: Create and use a non-root user
FROM node:18-alpine
RUN addgroup -S appgroup && adduser -S appuser -G appgroup
WORKDIR /app
COPY --chown=appuser:appgroup . .
USER appuser
CMD ["node", "server.js"]

2. Pin Image Versions

# BAD: "latest" can change at any time, breaking reproducibility
FROM node:latest

# GOOD: Pin to specific version
FROM node:18.19.1-alpine3.19

# BEST: Pin to digest (immutable, guaranteed same image)
FROM node@sha256:a1f3c5e22e5d89f15e6b3c2...

3. Scan Images for Vulnerabilities

# Scan with Trivy (open source, comprehensive)
$ trivy image myapp:v1.2.3
myapp:v1.2.3 (alpine 3.19.1)
Total: 3 (HIGH: 2, CRITICAL: 1)

# Integrate into CI/CD to block deployment of vulnerable images
$ trivy image --exit-code 1 --severity CRITICAL myapp:v1.2.3
# Exit code 1 = critical vulnerabilities found, fail the build

4. Never Mount the Docker Socket

# DANGEROUS: Gives the container full control over the Docker daemon
$ docker run -v /var/run/docker.sock:/var/run/docker.sock myapp

# This container can now:
# - Start/stop any container on the host
# - Create privileged containers (full host access)
# - Mount the host filesystem into a new container
# - Effectively has root on the host

5. Complete Secure Dockerfile

# Multi-stage build: build tools don't end up in production image
FROM node:18.19.1-alpine3.19 AS builder
WORKDIR /build
COPY package*.json ./
RUN npm ci --only=production && npm cache clean --force
COPY . .
RUN npm run build

# Production stage: minimal image
FROM node:18.19.1-alpine3.19
RUN apk add --no-cache dumb-init && \
    addgroup -S appgroup && \
    adduser -S appuser -G appgroup

WORKDIR /app
COPY --from=builder --chown=appuser:appgroup /build/dist ./dist
COPY --from=builder --chown=appuser:appgroup /build/node_modules ./node_modules

USER appuser
EXPOSE 3000

HEALTHCHECK --interval=30s --timeout=3s \
  CMD wget -qO- http://localhost:3000/health || exit 1

ENTRYPOINT ["dumb-init", "--"]
CMD ["node", "dist/server.js"]
Docker's default configuration is insecure in several ways:
- Containers run as root by default
- All capabilities are granted by default (use `--cap-drop ALL --cap-add` only what is needed)
- Network is bridged with full outbound access by default
- No resource limits by default (a container can consume all host CPU and memory)
- The Docker daemon runs as root, so any container escape = root on host

Every Docker deployment should have a hardening baseline that addresses these defaults. CIS Docker Benchmark provides a comprehensive checklist.

Kubernetes Security

Kubernetes adds orchestration on top of containers --- and with it, a significant expansion of the attack surface. The Kubernetes API server, etcd datastore, kubelet, and service mesh all present potential attack vectors.

Kubernetes RBAC

Role-Based Access Control determines who can do what to which resources. The most common mistake: granting cluster-admin to CI/CD service accounts "because it is easier."

# Role: Allow reading pods and logs in the "production" namespace
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: production
  name: pod-reader
rules:
- apiGroups: [""]
  resources: ["pods", "pods/log"]
  verbs: ["get", "list", "watch"]

---
# Bind the role to a specific user
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  namespace: production
  name: read-pods-binding
subjects:
- kind: User
  name: dev@company.com
  apiGroup: rbac.authorization.k8s.io
roleRef:
  kind: Role
  name: pod-reader
  apiGroup: rbac.authorization.k8s.io
# Audit RBAC: find who has cluster-admin (this is usually too broad)
$ kubectl get clusterrolebindings -o json | \
    jq '.items[] | select(.roleRef.name=="cluster-admin") | .subjects[]'

# Check what a specific user can do
$ kubectl auth can-i --list --as=dev@company.com -n production

# Find roles with wildcard permissions (dangerous)
$ kubectl get roles,clusterroles -A -o json | \
    jq '.items[] | select(.rules[]?.resources[]? == "*" or .rules[]?.verbs[]? == "*") | .metadata.name'

Kubernetes Network Policies

By default, all pods in a Kubernetes cluster can communicate with all other pods. This is the equivalent of a flat network. Network policies implement microsegmentation.

# Default deny all ingress traffic in the namespace
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-ingress
  namespace: production
spec:
  podSelector: {}
  policyTypes:
  - Ingress

---
# Allow traffic only from frontend to backend on port 8080
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-frontend-to-backend
  namespace: production
spec:
  podSelector:
    matchLabels:
      app: backend
  ingress:
  - from:
    - podSelector:
        matchLabels:
          app: frontend
    ports:
    - protocol: TCP
      port: 8080

Pod Security Standards

# Hardened pod security context
apiVersion: v1
kind: Pod
metadata:
  name: secure-app
spec:
  securityContext:
    runAsNonRoot: true
    runAsUser: 1000
    runAsGroup: 1000
    fsGroup: 1000
    seccompProfile:
      type: RuntimeDefault
  containers:
  - name: app
    image: myapp:v1.2.3@sha256:abc123...
    securityContext:
      allowPrivilegeEscalation: false
      readOnlyRootFilesystem: true
      capabilities:
        drop: ["ALL"]
    resources:
      limits:
        memory: "256Mi"
        cpu: "500m"
      requests:
        memory: "128Mi"
        cpu: "250m"
The Kubernetes metadata service at 169.254.169.254 is a frequent attack target. In cloud environments (AWS, GCP, Azure), this service provides instance credentials, including IAM roles with cloud permissions. A compromised pod that can reach this IP can obtain credentials for the node's IAM role and access cloud resources far beyond what the pod should have.

Mitigations:
1. **Network policies** blocking pod access to 169.254.169.254
2. **IRSA (IAM Roles for Service Accounts)** on AWS --- provides pod-specific IAM credentials without the metadata service
3. **Workload Identity** on GCP --- same concept, binds K8s service accounts to GCP service accounts
4. **IMDSv2** on AWS --- requires a session token obtained via PUT request, making SSRF exploitation harder (but not impossible)
Audit your Kubernetes security posture:

1. List all ClusterRoleBindings: `kubectl get clusterrolebindings`
2. Identify any that grant `cluster-admin` to non-admin users or service accounts
3. Check for pods running as root: `kubectl get pods -A -o jsonpath='{range .items[*]}{.metadata.namespace}/{.metadata.name}: runAsNonRoot={.spec.securityContext.runAsNonRoot}{"\n"}{end}'`
4. Verify network policies exist: `kubectl get networkpolicies -A`
5. Check for privileged containers: `kubectl get pods -A -o json | jq '.items[] | select(.spec.containers[].securityContext.privileged==true) | .metadata.name'`

The most dangerous Kubernetes misconfiguration: an anonymous or overly-permissive API server accessible from the internet. Run `kubectl cluster-info` and verify the API endpoint is not publicly accessible without authentication.

Container Image Supply Chain Security

Container image supply chain attacks are a growing threat. If an attacker compromises a base image or a popular library image, every application built on it is compromised.

flowchart LR
    subgraph "Image Supply Chain"
        BI["Base Image Selection<br/>Official images only<br/>Minimal base (Alpine, distroless)"]
        BUILD["Build<br/>Multi-stage builds<br/>Pin versions + digests<br/>No secrets in layers"]
        SCAN["Scan<br/>Trivy, Grype in CI/CD<br/>Block critical vulns<br/>Generate SBOM"]
        SIGN["Sign<br/>cosign (Sigstore)<br/>Cryptographic signature<br/>Provenance attestation"]
        STORE["Store<br/>Private registry<br/>Access controls<br/>Retention policies"]
        ADMIT["Admit<br/>Verify signature at deploy<br/>OPA/Gatekeeper policies<br/>Block unsigned images"]
        RUN["Runtime<br/>Falco monitoring<br/>Read-only filesystem<br/>No privilege escalation"]
    end

    BI --> BUILD --> SCAN --> SIGN --> STORE --> ADMIT --> RUN

    style SCAN fill:#e74c3c,color:#fff
    style SIGN fill:#3498db,color:#fff
    style ADMIT fill:#27ae60,color:#fff
# Sign a container image with cosign (Sigstore)
$ cosign sign --key cosign.key myregistry.com/myapp:v1.2.3

# Verify a signed image before deployment
$ cosign verify --key cosign.pub myregistry.com/myapp:v1.2.3
Verification for myregistry.com/myapp:v1.2.3 --
The following checks were performed on each of these signatures:
  - The cosign claims were validated
  - The signatures were verified against the specified public key

# Generate an SBOM (Software Bill of Materials)
$ syft myapp:v1.2.3 -o spdx-json > sbom.json

# Scan the SBOM for vulnerabilities
$ grype sbom:sbom.json

Cloud-Native Security Tools

CategoryToolsPurpose
CSPM (Cloud Security Posture)AWS Security Hub, Azure Defender, GCP SCC, Prowler (open source)Detect misconfigurations across cloud accounts
IaC ScanningCheckov, tfsec, Terrascan, KICSScan Terraform/CloudFormation before deployment
Container ScanningTrivy, Grype, Snyk ContainerImage vulnerability scanning
Runtime SecurityFalco, Tracee, TetragoneBPF-based runtime anomaly detection
Image Signingcosign/Sigstore, Notary v2Cryptographic image signing and verification
Policy EnforcementOPA/Gatekeeper, KyvernoK8s admission control policies
Secret ManagementVault, AWS Secrets Manager, External Secrets OperatorCentralized, audited, rotated secrets
# Run Prowler to audit AWS account security
$ prowler aws --severity critical high
# FAIL: S3 bucket "analytics-data" has public access enabled
# FAIL: IAM root account has no MFA enabled
# FAIL: CloudTrail logging is disabled in us-west-2
# PASS: VPC Flow Logs are enabled in all VPCs
# PASS: RDS instances are not publicly accessible

# Scan Terraform before applying
$ checkov -d ./terraform/
Passed checks: 42, Failed checks: 3, Skipped checks: 0
Check: CKV_AWS_18: "Ensure the S3 bucket has access logging enabled"
  FAILED for resource: aws_s3_bucket.data_bucket
  File: /main.tf:23-31

The key insight with cloud security tools is shift left --- find misconfigurations before they are deployed, not after. Scanning your Terraform code in CI/CD is cheaper and faster than discovering the misconfiguration in production during a breach investigation. A checkov run in your pull request pipeline takes 30 seconds and catches the S3 bucket that would otherwise be public for months before someone notices.


What You've Learned

In this chapter, you explored security in cloud and container environments:

  • Shared responsibility model: The cloud provider secures the infrastructure; you secure your configuration, data, and access. The boundary shifts between IaaS, PaaS, and SaaS. Most cloud breaches result from customer misconfigurations, not provider failures.

  • IAM is the foundation. Least privilege with specific actions on specific resources with conditions. No permanent credentials in code --- use cloud-native identity (IAM Roles, Managed Identity, Workload Identity). MFA for all humans. Regular access reviews. The Capital One breach was fundamentally an IAM problem.

  • VPC network architecture. Public, private, and data subnets with security groups referencing each other by ID. NACLs for subnet-level guardrails. Defense in depth: the database should never be reachable from the internet, even through multiple layers.

  • Container isolation mechanisms. Namespaces isolate visibility, cgroups limit resource consumption, seccomp restricts system calls, AppArmor/SELinux provide mandatory access control. But containers share the host kernel --- they are not as isolated as VMs.

  • Docker security. Do not run as root. Pin image versions by digest. Scan for vulnerabilities in CI/CD. Never mount the Docker socket. Use read-only filesystems with explicit tmpfs mounts. Drop all capabilities and add back only what is needed.

  • Kubernetes security. RBAC with least privilege (never cluster-admin for CI/CD). Network policies for pod-to-pod segmentation (default deny). Pod security contexts with runAsNonRoot, readOnlyRootFilesystem, and capability dropping. Protect the metadata service.

  • Image supply chain. Sign images with cosign, verify at admission with OPA/Gatekeeper, generate SBOMs with syft, scan continuously with Trivy/Grype. Use private registries with access controls.

  • Cloud-native tools. CSPM (Prowler) for cloud misconfigurations, IaC scanning (Checkov) for pre-deployment checks, Falco for runtime detection, and Vault for secret management. Shift left: catch misconfigurations before deployment.

  • Logging and monitoring. CloudTrail records every API call in AWS. VPC Flow Logs capture network metadata. Kubernetes audit logs record every API server request. GuardDuty and equivalent services provide managed threat detection. Without comprehensive logging, cloud compromise detection is effectively impossible --- you cannot investigate what you did not record.

  • Multi-cloud and hybrid considerations. Organizations running workloads across AWS, Azure, and GCP face multiplicative complexity. Each provider has different IAM models, different logging systems, and different security tooling. A consistent security posture requires either provider-agnostic tooling (Terraform + Checkov, Falco, Vault) or dedicated expertise in each platform. The worst outcome is inconsistent security across providers, where the least-secured environment becomes the attack path to the others.

Three cloud breaches that illustrate the most common failure patterns:

1. **Capital One (2019, 100M records):** A misconfigured WAF allowed SSRF to the EC2 metadata service (169.254.169.254), which returned IAM role credentials. The role had excessive permissions to S3 buckets containing customer data. Root causes: overprivileged IAM role, missing IMDSv2 enforcement (which requires session tokens and blocks SSRF), and insufficient monitoring of S3 data access patterns.

2. **SolarWinds / Microsoft Cloud (2020-2021):** After the initial supply chain compromise, APT29 pivoted to cloud environments. They forged SAML tokens using stolen signing certificates (Golden SAML attack) to access Office 365 mailboxes and Azure AD. Root cause: the SAML signing certificate was stored on an on-premises server that was already compromised. Lesson: your cloud security is only as strong as the on-premises infrastructure that manages cloud identity.

3. **Uber (2022):** An attacker obtained credentials through social engineering (MFA fatigue --- repeatedly sending push notifications until the employee approved one). From there, they found a PowerShell script on a network share containing hardcoded admin credentials for the PAM system, which gave them access to AWS, GCP, and internal dashboards. Root causes: MFA fatigue (mitigated by number-matching), hardcoded credentials in scripts, and insufficient network segmentation.

The common thread across all three breaches: the cloud infrastructure itself was not hacked. The cryptography was not broken. The provider did not fail. In every case, the customer misconfigured access controls, stored credentials insecurely, or failed to detect anomalous activity in time. This is the shared responsibility model in action --- the provider secures the infrastructure, but the customer is responsible for securing what runs on it.

The attack surface in cloud and container environments is enormous. Where should you start?

Three things, in this order. First, lock down IAM --- MFA everywhere, no wildcard permissions, no long-lived credentials, audit who has admin access. Second, enable logging --- CloudTrail, VPC Flow Logs, Kubernetes audit logs. You cannot detect what you cannot see. Third, scan your configuration --- run Prowler or Checkov and fix everything rated critical. Those three actions address the vast majority of cloud security breaches. Then iterate from there.

For containers, the same principle applies: start with the highest-impact items. Run containers as non-root. Scan images in CI/CD before they reach production. Set up a default-deny NetworkPolicy in every namespace. Use Pod Security Standards (the replacement for PodSecurityPolicy) to enforce baseline or restricted security profiles. And never mount the Docker socket into a container --- that gives the container full control over the host. Get these basics right before chasing more exotic threats.

How often should you audit your cloud configuration? Continuously. Run Prowler and Checkov in your CI/CD pipeline so misconfigurations are caught before deployment. Schedule weekly automated scans of your production accounts. Review IAM access quarterly. And after every incident, audit the configuration that was exploited and add a check to prevent recurrence. Cloud environments drift fast --- what was secure last month may not be secure today because someone added an overprivileged role or opened a security group for "temporary" debugging and forgot to close it.