Chapter 36: Cloud and Container Security
"The cloud is just someone else's computer. And now you need to secure someone else's computer while only controlling half of it." --- Common security wisdom, uncomfortably accurate
The S3 Bucket That Leaked 100 Million Records
In 2019, Capital One suffered a breach that exposed the personal data of approximately 100 million customers and applicants. The attacker, a former AWS employee, exploited a misconfigured web application firewall to perform a Server-Side Request Forgery (SSRF) attack against the EC2 metadata service. This gave her temporary credentials from an IAM role with excessive permissions. That role could read any S3 bucket in the account. The data --- names, addresses, credit scores, Social Security numbers --- was sitting in S3 without additional encryption controls beyond AWS defaults.
Here is the critical distinction: AWS infrastructure was never compromised. The vulnerability was in how Capital One configured their WAF, how they designed their IAM roles, and how they stored sensitive data. This distinction is the entire foundation of cloud security. It is called the shared responsibility model, and misunderstanding it is the number one cause of cloud security incidents.
The Shared Responsibility Model
Every major cloud provider operates on the same principle: the provider secures the infrastructure, and the customer secures what they put on it. The boundary shifts depending on the service model.
graph TD
subgraph "Customer Responsibility"
C1["Data Classification & Encryption"]
C2["IAM: Users, Roles, Policies, MFA"]
C3["Application Security & Patching"]
C4["OS Configuration & Patching (IaaS)"]
C5["Network Config: Security Groups, NACLs"]
C6["Firewall Rules, VPC Design"]
end
subgraph "Shared"
S1["Network Controls"]
S2["Encryption Options"]
S3["Logging & Monitoring"]
end
subgraph "Provider Responsibility"
P1["Physical Security of Data Centers"]
P2["Hardware: Servers, Storage, Networking"]
P3["Hypervisor / Virtualization Layer"]
P4["Managed Service Infrastructure"]
P5["Global Network Infrastructure"]
P6["Environmental Controls (power, cooling)"]
end
C1 --- S1
S1 --- P1
style C1 fill:#3498db,color:#fff
style C2 fill:#3498db,color:#fff
style C3 fill:#3498db,color:#fff
style C4 fill:#3498db,color:#fff
style C5 fill:#3498db,color:#fff
style C6 fill:#3498db,color:#fff
style P1 fill:#e67e22,color:#fff
style P2 fill:#e67e22,color:#fff
style P3 fill:#e67e22,color:#fff
style P4 fill:#e67e22,color:#fff
style P5 fill:#e67e22,color:#fff
style P6 fill:#e67e22,color:#fff
The boundary shifts significantly between IaaS, PaaS, and SaaS:
| Layer | IaaS (EC2) | PaaS (RDS, Lambda) | SaaS (Office 365) |
|---|---|---|---|
| Data | Customer | Customer | Customer |
| Application | Customer | Customer | Provider |
| Runtime | Customer | Provider | Provider |
| OS | Customer | Provider | Provider |
| Network | Shared | Shared | Provider |
| Hardware | Provider | Provider | Provider |
| Physical | Provider | Provider | Provider |
When you use a managed database like RDS, you do not patch the OS, but you are still responsible for database access controls and encryption. This is where people make mistakes. They assume "managed" means "secured." AWS will patch the underlying PostgreSQL engine for your RDS instance, but it will not stop you from setting the admin password to "password123," making the instance publicly accessible, or granting the database role rds_superuser to your application service account.
IAM: The Foundation of Cloud Security
Identity and Access Management is the single most critical control in cloud environments. If you get IAM wrong, nothing else matters --- the attacker with admin credentials can disable every other security control you have configured.
The Principle of Least Privilege in Practice
// BAD: The Capital One problem
// This policy says "do anything to any S3 bucket in the account"
{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Action": "s3:*",
"Resource": "*"
}]
}
// GOOD: Specific actions on specific resources with conditions
{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::my-app-data-bucket",
"arn:aws:s3:::my-app-data-bucket/*"
],
"Condition": {
"StringEquals": {
"aws:PrincipalTag/Team": "backend",
"s3:ExistingObjectTag/classification": "internal"
},
"IpAddress": {
"aws:SourceIp": "10.0.0.0/8"
},
"Bool": {
"aws:MultiFactorAuthPresent": "true"
}
}
}]
}
Yes, the second policy is longer and more complex. But the overhead is absolutely worth it. The first policy says "this identity can do anything to any S3 bucket in the account." When attached to a compromised role, you lose everything. The second says "this identity can read objects from one specific bucket, only if tagged as the backend team, only objects classified as internal, only from the internal network, and only with MFA." When compromised, you lose read access to one classification level of one bucket from the internal network. The complexity of the policy is proportional to the value of the data it protects.
Service Accounts and Machine Identity
Applications running in the cloud need credentials to access other services. How you manage these credentials determines your exposure:
graph TD
B["BEST: Cloud-native identity<br/>AWS IAM Roles / Azure Managed Identity<br/>GCP Workload Identity<br/>──────────────────<br/>No credentials to manage, rotate, or leak<br/>Instance/pod automatically receives<br/>temporary tokens (15min-12hr lifetime)"] --> G["GOOD: Secrets manager<br/>AWS Secrets Manager / HashiCorp Vault<br/>──────────────────<br/>Centralized, audited, auto-rotated<br/>secrets with access logging"]
G --> A["ACCEPTABLE: Environment variables<br/>In managed runtime (Lambda, ECS)<br/>──────────────────<br/>Not in code, but visible to process<br/>and in memory dumps"]
A --> BA["BAD: Config files on disk<br/>──────────────────<br/>Readable by anyone with<br/>file system access"]
BA --> T["TERRIBLE: Hard-coded in source<br/>──────────────────<br/>Visible in version control<br/>to everyone, forever"]
T --> C["CATASTROPHIC: Committed to<br/>public GitHub repository<br/>──────────────────<br/>Automated scanners find these<br/>in under 60 seconds"]
style B fill:#27ae60,color:#fff
style G fill:#2ecc71,color:#fff
style A fill:#f39c12,color:#fff
style BA fill:#e67e22,color:#fff
style T fill:#e74c3c,color:#fff
style C fill:#c0392b,color:#fff
# Audit your AWS IAM posture
# List users with console access but no MFA
$ aws iam generate-credential-report && sleep 5
$ aws iam get-credential-report --output text --query Content | \
base64 -d | awk -F, '$4=="true" && $8=="false" {print "NO MFA: "$1}'
# List access keys older than 90 days
$ aws iam get-credential-report --output text --query Content | \
base64 -d | awk -F, '$9=="true" {print $1, "Key created:", $10}'
# Find IAM policies with wildcards (overly permissive)
$ aws iam list-policies --only-attached --query 'Policies[*].Arn' --output text | \
tr '\t' '\n' | while read arn; do
version=$(aws iam get-policy --policy-arn "$arn" --query 'Policy.DefaultVersionId' --output text)
aws iam get-policy-version --policy-arn "$arn" --version-id "$version" \
--query 'PolicyVersion.Document' --output json | \
grep -l '"Action": "\*"' 2>/dev/null && echo "WILDCARD: $arn"
done
A startup committed their AWS access keys to a public GitHub repository. Within 90 seconds --- not minutes, seconds --- automated scanners detected the keys and began spinning up cryptocurrency mining instances. By the time the developer noticed and revoked the keys (4 hours later), the attacker had launched over 200 c5.18xlarge instances across multiple regions. The AWS bill for those 4 hours: $47,000.
This is not unusual. Research from GitGuardian shows that over 10 million secrets were detected in public GitHub repositories in 2022 alone. AWS has since added credential scanning that automatically quarantines exposed keys, and tools like git-secrets, gitleaks, truffleHog, and GitHub's built-in secret scanning can prevent commits containing secrets. But the first rule remains: never commit credentials to version control, period.
Network Security in the Cloud
VPC Architecture with Defense in Depth
graph TD
subgraph "VPC 10.0.0.0/16"
subgraph "Public Subnet 10.0.1.0/24"
IGW["Internet Gateway"] --> ALB["Application<br/>Load Balancer<br/>SG: 80,443 from 0.0.0.0/0"]
NAT["NAT Gateway<br/>(outbound only)"]
end
subgraph "Private Subnet 10.0.2.0/24"
APP1["App Server 1<br/>SG: 8080 from ALB-SG"]
APP2["App Server 2<br/>SG: 8080 from ALB-SG"]
end
subgraph "Data Subnet 10.0.3.0/24"
RDS1["RDS Primary<br/>SG: 5432 from App-SG"]
RDS2["RDS Replica<br/>SG: 5432 from App-SG"]
end
ALB -->|"Port 8080 only"| APP1
ALB -->|"Port 8080 only"| APP2
APP1 -->|"Port 5432 only"| RDS1
APP2 -->|"Port 5432 only"| RDS1
APP1 -->|"Outbound via NAT"| NAT
end
NACL1["NACL: Public<br/>Allow 80,443 inbound<br/>from 0.0.0.0/0"] -.-> ALB
NACL2["NACL: Private<br/>Allow from 10.0.1.0/24 only"] -.-> APP1
NACL3["NACL: Data<br/>Allow from 10.0.2.0/24 only"] -.-> RDS1
style ALB fill:#3498db,color:#fff
style APP1 fill:#f39c12,color:#fff
style APP2 fill:#f39c12,color:#fff
style RDS1 fill:#27ae60,color:#fff
style RDS2 fill:#27ae60,color:#fff
Key design principles:
- Databases are never in public subnets and have no public IP addresses
- App servers can only be reached from the load balancer, not directly from the internet
- Security groups reference each other by group ID, not by IP address --- adding a new app server to the App-SG automatically grants it database access
- NACLs provide a second layer of subnet-level filtering (stateless, evaluated in rule order)
- Outbound internet access for private subnets goes through NAT Gateway, providing a single egress point for monitoring
Security Groups vs. NACLs
| Feature | Security Groups | Network ACLs |
|---|---|---|
| State | Stateful (return traffic auto-allowed) | Stateless (must allow return traffic explicitly) |
| Level | Instance/ENI level | Subnet level |
| Rules | Allow rules only | Allow AND deny rules |
| Evaluation | All rules evaluated | Rules evaluated in number order, first match wins |
| Default | Deny all inbound, allow all outbound | Allow all inbound and outbound |
| Use case | Fine-grained per-instance control | Broad subnet-level guardrails |
Container Security
Containers add another layer of abstraction --- and another attack surface. Understanding container isolation mechanisms is essential for securing containerized workloads.
Container Isolation: Not a VM
graph TD
subgraph "Virtual Machine Isolation"
VA["App A + Libraries"] --> VOS1["Guest OS (full)"]
VB["App B + Libraries"] --> VOS2["Guest OS (full)"]
VOS1 --> HV["Hypervisor<br/>(hardware-level isolation)"]
VOS2 --> HV
HV --> HW1["Host Hardware"]
end
subgraph "Container Isolation"
CA["App A + Libraries"] --> CR["Container Runtime<br/>(namespaces + cgroups)<br/>SHARED KERNEL"]
CB["App B + Libraries"] --> CR
CR --> HW2["Host OS + Kernel"]
end
style HV fill:#27ae60,color:#fff
style CR fill:#e67e22,color:#fff
The critical difference: containers share the host kernel. A kernel exploit in one container can compromise ALL containers on the same host. VMs have a complete hardware abstraction boundary between guests.
Linux Namespaces
Namespaces give each container its own isolated view of system resources:
| Namespace | Isolates | Security Implication |
|---|---|---|
| PID | Process IDs | Container sees only its own processes |
| NET | Network stack | Own IP, ports, routing table |
| MNT | Filesystem mounts | Own root filesystem |
| UTS | Hostname | Own hostname and domain |
| IPC | Inter-process communication | Own semaphores, message queues |
| USER | User/group IDs | Root in container != root on host (when enabled) |
| CGROUP | Cgroup root | Own cgroup hierarchy |
Control Groups (cgroups)
While namespaces isolate what a container can see, cgroups limit what it can use:
# Run a container with strict resource limits
$ docker run -d \
--memory=512m \ # Max 512MB RAM
--memory-swap=512m \ # No swap (same as memory = no swap)
--cpus=1.0 \ # Max 1 CPU core
--pids-limit=100 \ # Max 100 processes (prevents fork bombs)
--read-only \ # Read-only root filesystem
--tmpfs /tmp:rw,noexec,nosuid,size=64m \ # Writable /tmp with limits
--security-opt=no-new-privileges \ # Prevent privilege escalation
--cap-drop=ALL \ # Drop all Linux capabilities
--cap-add=NET_BIND_SERVICE \ # Add back only what's needed
myapp:v1.2.3
Seccomp Profiles
Seccomp (Secure Computing Mode) restricts which system calls a container can make. Docker's default seccomp profile blocks approximately 44 of 300+ syscalls, including dangerous ones like mount, reboot, kexec_load, and ptrace.
{
"defaultAction": "SCMP_ACT_ERRNO",
"architectures": ["SCMP_ARCH_X86_64"],
"syscalls": [
{
"names": ["read", "write", "open", "close", "stat",
"fstat", "poll", "lseek", "mmap", "mprotect",
"munmap", "brk", "ioctl", "access", "pipe",
"select", "dup2", "nanosleep", "getpid",
"socket", "connect", "accept", "sendto",
"recvfrom", "bind", "listen", "exit_group",
"futex", "epoll_wait", "epoll_ctl",
"clone", "execve", "openat", "newfstatat"],
"action": "SCMP_ACT_ALLOW"
}
]
}
Docker Security Best Practices
The most common Docker security mistakes are running as root, using latest tags, not scanning images, and mounting the Docker socket. Each one deserves attention.
1. Do Not Run as Root
# BAD: Runs as root by default
FROM node:18
COPY . /app
CMD ["node", "server.js"]
# GOOD: Create and use a non-root user
FROM node:18-alpine
RUN addgroup -S appgroup && adduser -S appuser -G appgroup
WORKDIR /app
COPY --chown=appuser:appgroup . .
USER appuser
CMD ["node", "server.js"]
2. Pin Image Versions
# BAD: "latest" can change at any time, breaking reproducibility
FROM node:latest
# GOOD: Pin to specific version
FROM node:18.19.1-alpine3.19
# BEST: Pin to digest (immutable, guaranteed same image)
FROM node@sha256:a1f3c5e22e5d89f15e6b3c2...
3. Scan Images for Vulnerabilities
# Scan with Trivy (open source, comprehensive)
$ trivy image myapp:v1.2.3
myapp:v1.2.3 (alpine 3.19.1)
Total: 3 (HIGH: 2, CRITICAL: 1)
# Integrate into CI/CD to block deployment of vulnerable images
$ trivy image --exit-code 1 --severity CRITICAL myapp:v1.2.3
# Exit code 1 = critical vulnerabilities found, fail the build
4. Never Mount the Docker Socket
# DANGEROUS: Gives the container full control over the Docker daemon
$ docker run -v /var/run/docker.sock:/var/run/docker.sock myapp
# This container can now:
# - Start/stop any container on the host
# - Create privileged containers (full host access)
# - Mount the host filesystem into a new container
# - Effectively has root on the host
5. Complete Secure Dockerfile
# Multi-stage build: build tools don't end up in production image
FROM node:18.19.1-alpine3.19 AS builder
WORKDIR /build
COPY package*.json ./
RUN npm ci --only=production && npm cache clean --force
COPY . .
RUN npm run build
# Production stage: minimal image
FROM node:18.19.1-alpine3.19
RUN apk add --no-cache dumb-init && \
addgroup -S appgroup && \
adduser -S appuser -G appgroup
WORKDIR /app
COPY --from=builder --chown=appuser:appgroup /build/dist ./dist
COPY --from=builder --chown=appuser:appgroup /build/node_modules ./node_modules
USER appuser
EXPOSE 3000
HEALTHCHECK --interval=30s --timeout=3s \
CMD wget -qO- http://localhost:3000/health || exit 1
ENTRYPOINT ["dumb-init", "--"]
CMD ["node", "dist/server.js"]
Docker's default configuration is insecure in several ways:
- Containers run as root by default
- All capabilities are granted by default (use `--cap-drop ALL --cap-add` only what is needed)
- Network is bridged with full outbound access by default
- No resource limits by default (a container can consume all host CPU and memory)
- The Docker daemon runs as root, so any container escape = root on host
Every Docker deployment should have a hardening baseline that addresses these defaults. CIS Docker Benchmark provides a comprehensive checklist.
Kubernetes Security
Kubernetes adds orchestration on top of containers --- and with it, a significant expansion of the attack surface. The Kubernetes API server, etcd datastore, kubelet, and service mesh all present potential attack vectors.
Kubernetes RBAC
Role-Based Access Control determines who can do what to which resources. The most common mistake: granting cluster-admin to CI/CD service accounts "because it is easier."
# Role: Allow reading pods and logs in the "production" namespace
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
namespace: production
name: pod-reader
rules:
- apiGroups: [""]
resources: ["pods", "pods/log"]
verbs: ["get", "list", "watch"]
---
# Bind the role to a specific user
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
namespace: production
name: read-pods-binding
subjects:
- kind: User
name: dev@company.com
apiGroup: rbac.authorization.k8s.io
roleRef:
kind: Role
name: pod-reader
apiGroup: rbac.authorization.k8s.io
# Audit RBAC: find who has cluster-admin (this is usually too broad)
$ kubectl get clusterrolebindings -o json | \
jq '.items[] | select(.roleRef.name=="cluster-admin") | .subjects[]'
# Check what a specific user can do
$ kubectl auth can-i --list --as=dev@company.com -n production
# Find roles with wildcard permissions (dangerous)
$ kubectl get roles,clusterroles -A -o json | \
jq '.items[] | select(.rules[]?.resources[]? == "*" or .rules[]?.verbs[]? == "*") | .metadata.name'
Kubernetes Network Policies
By default, all pods in a Kubernetes cluster can communicate with all other pods. This is the equivalent of a flat network. Network policies implement microsegmentation.
# Default deny all ingress traffic in the namespace
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-ingress
namespace: production
spec:
podSelector: {}
policyTypes:
- Ingress
---
# Allow traffic only from frontend to backend on port 8080
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-frontend-to-backend
namespace: production
spec:
podSelector:
matchLabels:
app: backend
ingress:
- from:
- podSelector:
matchLabels:
app: frontend
ports:
- protocol: TCP
port: 8080
Pod Security Standards
# Hardened pod security context
apiVersion: v1
kind: Pod
metadata:
name: secure-app
spec:
securityContext:
runAsNonRoot: true
runAsUser: 1000
runAsGroup: 1000
fsGroup: 1000
seccompProfile:
type: RuntimeDefault
containers:
- name: app
image: myapp:v1.2.3@sha256:abc123...
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop: ["ALL"]
resources:
limits:
memory: "256Mi"
cpu: "500m"
requests:
memory: "128Mi"
cpu: "250m"
The Kubernetes metadata service at 169.254.169.254 is a frequent attack target. In cloud environments (AWS, GCP, Azure), this service provides instance credentials, including IAM roles with cloud permissions. A compromised pod that can reach this IP can obtain credentials for the node's IAM role and access cloud resources far beyond what the pod should have.
Mitigations:
1. **Network policies** blocking pod access to 169.254.169.254
2. **IRSA (IAM Roles for Service Accounts)** on AWS --- provides pod-specific IAM credentials without the metadata service
3. **Workload Identity** on GCP --- same concept, binds K8s service accounts to GCP service accounts
4. **IMDSv2** on AWS --- requires a session token obtained via PUT request, making SSRF exploitation harder (but not impossible)
Audit your Kubernetes security posture:
1. List all ClusterRoleBindings: `kubectl get clusterrolebindings`
2. Identify any that grant `cluster-admin` to non-admin users or service accounts
3. Check for pods running as root: `kubectl get pods -A -o jsonpath='{range .items[*]}{.metadata.namespace}/{.metadata.name}: runAsNonRoot={.spec.securityContext.runAsNonRoot}{"\n"}{end}'`
4. Verify network policies exist: `kubectl get networkpolicies -A`
5. Check for privileged containers: `kubectl get pods -A -o json | jq '.items[] | select(.spec.containers[].securityContext.privileged==true) | .metadata.name'`
The most dangerous Kubernetes misconfiguration: an anonymous or overly-permissive API server accessible from the internet. Run `kubectl cluster-info` and verify the API endpoint is not publicly accessible without authentication.
Container Image Supply Chain Security
Container image supply chain attacks are a growing threat. If an attacker compromises a base image or a popular library image, every application built on it is compromised.
flowchart LR
subgraph "Image Supply Chain"
BI["Base Image Selection<br/>Official images only<br/>Minimal base (Alpine, distroless)"]
BUILD["Build<br/>Multi-stage builds<br/>Pin versions + digests<br/>No secrets in layers"]
SCAN["Scan<br/>Trivy, Grype in CI/CD<br/>Block critical vulns<br/>Generate SBOM"]
SIGN["Sign<br/>cosign (Sigstore)<br/>Cryptographic signature<br/>Provenance attestation"]
STORE["Store<br/>Private registry<br/>Access controls<br/>Retention policies"]
ADMIT["Admit<br/>Verify signature at deploy<br/>OPA/Gatekeeper policies<br/>Block unsigned images"]
RUN["Runtime<br/>Falco monitoring<br/>Read-only filesystem<br/>No privilege escalation"]
end
BI --> BUILD --> SCAN --> SIGN --> STORE --> ADMIT --> RUN
style SCAN fill:#e74c3c,color:#fff
style SIGN fill:#3498db,color:#fff
style ADMIT fill:#27ae60,color:#fff
# Sign a container image with cosign (Sigstore)
$ cosign sign --key cosign.key myregistry.com/myapp:v1.2.3
# Verify a signed image before deployment
$ cosign verify --key cosign.pub myregistry.com/myapp:v1.2.3
Verification for myregistry.com/myapp:v1.2.3 --
The following checks were performed on each of these signatures:
- The cosign claims were validated
- The signatures were verified against the specified public key
# Generate an SBOM (Software Bill of Materials)
$ syft myapp:v1.2.3 -o spdx-json > sbom.json
# Scan the SBOM for vulnerabilities
$ grype sbom:sbom.json
Cloud-Native Security Tools
| Category | Tools | Purpose |
|---|---|---|
| CSPM (Cloud Security Posture) | AWS Security Hub, Azure Defender, GCP SCC, Prowler (open source) | Detect misconfigurations across cloud accounts |
| IaC Scanning | Checkov, tfsec, Terrascan, KICS | Scan Terraform/CloudFormation before deployment |
| Container Scanning | Trivy, Grype, Snyk Container | Image vulnerability scanning |
| Runtime Security | Falco, Tracee, Tetragon | eBPF-based runtime anomaly detection |
| Image Signing | cosign/Sigstore, Notary v2 | Cryptographic image signing and verification |
| Policy Enforcement | OPA/Gatekeeper, Kyverno | K8s admission control policies |
| Secret Management | Vault, AWS Secrets Manager, External Secrets Operator | Centralized, audited, rotated secrets |
# Run Prowler to audit AWS account security
$ prowler aws --severity critical high
# FAIL: S3 bucket "analytics-data" has public access enabled
# FAIL: IAM root account has no MFA enabled
# FAIL: CloudTrail logging is disabled in us-west-2
# PASS: VPC Flow Logs are enabled in all VPCs
# PASS: RDS instances are not publicly accessible
# Scan Terraform before applying
$ checkov -d ./terraform/
Passed checks: 42, Failed checks: 3, Skipped checks: 0
Check: CKV_AWS_18: "Ensure the S3 bucket has access logging enabled"
FAILED for resource: aws_s3_bucket.data_bucket
File: /main.tf:23-31
The key insight with cloud security tools is shift left --- find misconfigurations before they are deployed, not after. Scanning your Terraform code in CI/CD is cheaper and faster than discovering the misconfiguration in production during a breach investigation. A checkov run in your pull request pipeline takes 30 seconds and catches the S3 bucket that would otherwise be public for months before someone notices.
What You've Learned
In this chapter, you explored security in cloud and container environments:
-
Shared responsibility model: The cloud provider secures the infrastructure; you secure your configuration, data, and access. The boundary shifts between IaaS, PaaS, and SaaS. Most cloud breaches result from customer misconfigurations, not provider failures.
-
IAM is the foundation. Least privilege with specific actions on specific resources with conditions. No permanent credentials in code --- use cloud-native identity (IAM Roles, Managed Identity, Workload Identity). MFA for all humans. Regular access reviews. The Capital One breach was fundamentally an IAM problem.
-
VPC network architecture. Public, private, and data subnets with security groups referencing each other by ID. NACLs for subnet-level guardrails. Defense in depth: the database should never be reachable from the internet, even through multiple layers.
-
Container isolation mechanisms. Namespaces isolate visibility, cgroups limit resource consumption, seccomp restricts system calls, AppArmor/SELinux provide mandatory access control. But containers share the host kernel --- they are not as isolated as VMs.
-
Docker security. Do not run as root. Pin image versions by digest. Scan for vulnerabilities in CI/CD. Never mount the Docker socket. Use read-only filesystems with explicit tmpfs mounts. Drop all capabilities and add back only what is needed.
-
Kubernetes security. RBAC with least privilege (never cluster-admin for CI/CD). Network policies for pod-to-pod segmentation (default deny). Pod security contexts with runAsNonRoot, readOnlyRootFilesystem, and capability dropping. Protect the metadata service.
-
Image supply chain. Sign images with cosign, verify at admission with OPA/Gatekeeper, generate SBOMs with syft, scan continuously with Trivy/Grype. Use private registries with access controls.
-
Cloud-native tools. CSPM (Prowler) for cloud misconfigurations, IaC scanning (Checkov) for pre-deployment checks, Falco for runtime detection, and Vault for secret management. Shift left: catch misconfigurations before deployment.
-
Logging and monitoring. CloudTrail records every API call in AWS. VPC Flow Logs capture network metadata. Kubernetes audit logs record every API server request. GuardDuty and equivalent services provide managed threat detection. Without comprehensive logging, cloud compromise detection is effectively impossible --- you cannot investigate what you did not record.
-
Multi-cloud and hybrid considerations. Organizations running workloads across AWS, Azure, and GCP face multiplicative complexity. Each provider has different IAM models, different logging systems, and different security tooling. A consistent security posture requires either provider-agnostic tooling (Terraform + Checkov, Falco, Vault) or dedicated expertise in each platform. The worst outcome is inconsistent security across providers, where the least-secured environment becomes the attack path to the others.
Three cloud breaches that illustrate the most common failure patterns:
1. **Capital One (2019, 100M records):** A misconfigured WAF allowed SSRF to the EC2 metadata service (169.254.169.254), which returned IAM role credentials. The role had excessive permissions to S3 buckets containing customer data. Root causes: overprivileged IAM role, missing IMDSv2 enforcement (which requires session tokens and blocks SSRF), and insufficient monitoring of S3 data access patterns.
2. **SolarWinds / Microsoft Cloud (2020-2021):** After the initial supply chain compromise, APT29 pivoted to cloud environments. They forged SAML tokens using stolen signing certificates (Golden SAML attack) to access Office 365 mailboxes and Azure AD. Root cause: the SAML signing certificate was stored on an on-premises server that was already compromised. Lesson: your cloud security is only as strong as the on-premises infrastructure that manages cloud identity.
3. **Uber (2022):** An attacker obtained credentials through social engineering (MFA fatigue --- repeatedly sending push notifications until the employee approved one). From there, they found a PowerShell script on a network share containing hardcoded admin credentials for the PAM system, which gave them access to AWS, GCP, and internal dashboards. Root causes: MFA fatigue (mitigated by number-matching), hardcoded credentials in scripts, and insufficient network segmentation.
The common thread across all three breaches: the cloud infrastructure itself was not hacked. The cryptography was not broken. The provider did not fail. In every case, the customer misconfigured access controls, stored credentials insecurely, or failed to detect anomalous activity in time. This is the shared responsibility model in action --- the provider secures the infrastructure, but the customer is responsible for securing what runs on it.
The attack surface in cloud and container environments is enormous. Where should you start?
Three things, in this order. First, lock down IAM --- MFA everywhere, no wildcard permissions, no long-lived credentials, audit who has admin access. Second, enable logging --- CloudTrail, VPC Flow Logs, Kubernetes audit logs. You cannot detect what you cannot see. Third, scan your configuration --- run Prowler or Checkov and fix everything rated critical. Those three actions address the vast majority of cloud security breaches. Then iterate from there.
For containers, the same principle applies: start with the highest-impact items. Run containers as non-root. Scan images in CI/CD before they reach production. Set up a default-deny NetworkPolicy in every namespace. Use Pod Security Standards (the replacement for PodSecurityPolicy) to enforce baseline or restricted security profiles. And never mount the Docker socket into a container --- that gives the container full control over the host. Get these basics right before chasing more exotic threats.
How often should you audit your cloud configuration? Continuously. Run Prowler and Checkov in your CI/CD pipeline so misconfigurations are caught before deployment. Schedule weekly automated scans of your production accounts. Review IAM access quarterly. And after every incident, audit the configuration that was exploited and add a check to prevent recurrence. Cloud environments drift fast --- what was secure last month may not be secure today because someone added an overprivileged role or opened a security group for "temporary" debugging and forgot to close it.