Chapter 33: Security Monitoring and SIEM
"You can't protect what you can't see. And in most organizations, the security team is flying blind." --- Anton Chuvakin, former Gartner VP Analyst
The Breach That Went Unnoticed for 229 Days
Here is a number that should make you uncomfortable. According to IBM's Cost of a Data Breach Report, the average time to identify and contain a data breach is 277 days. That is nine months. An attacker has nine months of free rein inside your network before you even know they are there. The SolarWinds backdoor went undetected for over nine months. The Marriott breach of 500 million records lasted four years.
So what are security teams doing that whole time? In many cases, they are drowning. Drowning in logs they are not reading, alerts they have tuned out, and dashboards nobody checks. The problem is not a lack of data --- it is too much data and not enough signal extraction. The average enterprise generates 10,000 to 50,000 security events per second. This chapter is about how to build a monitoring program that actually detects threats, not just collects dust.
What to Log and Why
Not all logs are created equal. The first decision in any monitoring program is what to collect. Collecting everything sounds appealing until you are paying $50,000 per month for log storage and your SIEM is choking on noise.
The Logging Hierarchy by Layer
graph TD
subgraph "Network Layer"
N1["Firewall allow/deny events"]
N2["IDS/IPS alerts"]
N3["DNS query logs"]
N4["Proxy/web filter logs"]
N5["NetFlow/IPFIX records"]
N6["VPN connection/disconnection"]
end
subgraph "Host Layer"
H1["Authentication: success AND failure"]
H2["Privilege escalation / sudo"]
H3["Process creation (Sysmon Event 1)"]
H4["PowerShell Script Block Logging"]
H5["File access on sensitive shares"]
H6["Service install / driver load"]
end
subgraph "Application Layer"
A1["Web server access/error logs"]
A2["Database authentication + queries"]
A3["Application auth events"]
A4["API access logs"]
A5["Email gateway events"]
end
subgraph "Cloud Layer"
C1["AWS CloudTrail API calls"]
C2["VPC Flow Logs"]
C3["Azure Activity Log"]
C4["GCP Cloud Audit Logs"]
C5["K8s audit logs"]
C6["SaaS audit logs (O365, Okta)"]
end
style N1 fill:#e74c3c,color:#fff
style H1 fill:#e74c3c,color:#fff
style H2 fill:#e74c3c,color:#fff
style C1 fill:#e74c3c,color:#fff
style N3 fill:#e74c3c,color:#fff
Why log successful authentications, not just failures? Because you need to establish what is normal before you can detect what is abnormal. If someone tells you that user "jsmith" logged in from 185.234.72.19, is that suspicious? You have no idea unless you know that jsmith normally logs in from 10.0.1.50 in the New York office between 8 AM and 6 PM. Successful authentication logs build the behavioral baseline that makes anomaly detection possible. Without them, you are flying blind on the most critical question: "Is this legitimate user activity or an attacker using stolen credentials?"
Critical Log Sources and What They Reveal
System Logs:
- Linux:
/var/log/auth.log(authentication),/var/log/syslog(system events),journald(systemd journal),auditdlogs (syscall-level auditing) - Windows: Security Event Log (logon events 4624/4625, privilege use 4672, account management 4720-4738), Sysmon (process creation, network connections, file creation), PowerShell Script Block Logging (Event 4104)
- Network devices: Syslog from routers, switches, firewalls, wireless controllers
Application Logs:
- Web servers: Apache/Nginx access and error logs (HTTP method, status code, user agent, referrer)
- Databases: Query logs (especially admin operations, schema changes, bulk exports), authentication logs, slow query logs
- Custom applications: Structured event logs following a schema like ECS (Elastic Common Schema)
Cloud Logs:
- AWS: CloudTrail (every API call with caller identity, source IP, timestamp), VPC Flow Logs (network connection metadata), GuardDuty findings (managed threat detection)
- Azure: Activity Log (management plane), NSG Flow Logs, Entra ID sign-in logs
- GCP: Cloud Audit Logs (admin activity, data access), VPC Flow Logs, Security Command Center
# Enable critical Windows security auditing
$ auditpol /set /category:"Logon/Logoff" /success:enable /failure:enable
$ auditpol /set /category:"Account Logon" /success:enable /failure:enable
$ auditpol /set /category:"Account Management" /success:enable /failure:enable
$ auditpol /set /category:"Privilege Use" /success:enable /failure:enable
$ auditpol /set /category:"Process Tracking" /success:enable /failure:enable
# Enable PowerShell Script Block Logging (critical for fileless malware detection)
# Via registry:
# HKLM\SOFTWARE\Policies\Microsoft\Windows\PowerShell\ScriptBlockLogging
# EnableScriptBlockLogging = 1
# EnableScriptBlockInvocationLogging = 1
# Enable command-line process auditing (captures full command lines in Event 4688)
# Group Policy: Computer Configuration > Administrative Templates >
# System > Audit Process Creation > Include command line in process creation events
Sysmon (System Monitor) is a free Windows system service from Microsoft's Sysinternals suite that dramatically improves Windows endpoint visibility. A well-configured Sysmon installation logs: process creation with full command lines and parent process information (Event 1), network connections by process with destination IP/port (Event 3), file creation time changes (Event 2 --- malware backdating files to hide), driver and DLL loads (Events 6, 7), image loads into processes (Event 7), CreateRemoteThread (Event 8 --- used in process injection), raw disk access (Event 9), process access (Event 10 --- used in credential dumping), file stream creation (Event 15 --- alternate data streams), named pipe connections (Events 17, 18 --- used in lateral movement), and DNS queries per process (Event 22).
The community-maintained sysmon-config by SwiftOnSecurity (github.com/SwiftOnSecurity/sysmon-config) provides an excellent baseline configuration that filters out noise while capturing security-relevant events. The olafhartong sysmon-modular configuration provides a more granular, tag-based approach. If you run Windows and do not have Sysmon deployed, you are operating with one eye closed.
Log Pipeline Architecture
The journey from raw log event to actionable alert involves multiple stages, each of which can introduce failures, delays, or data loss.
flowchart LR
subgraph "Collection"
S1["Syslog<br/>(UDP/TCP 514)"]
S2["Filebeat<br/>(file-based)"]
S3["Winlogbeat<br/>(Windows Event)"]
S4["Cloud APIs<br/>(CloudTrail, etc.)"]
S5["Sysmon<br/>(Windows)"]
end
subgraph "Transport & Buffer"
K["Message Queue<br/>Kafka / Redis<br/>Buffer against<br/>burst and downtime"]
end
subgraph "Processing"
L["Parse & Normalize<br/>Logstash / Fluent Bit<br/>Grok patterns<br/>Field extraction<br/>Timestamp normalization<br/>Schema mapping (ECS)"]
E["Enrich<br/>GeoIP lookup<br/>Asset DB lookup<br/>Threat intel IOC match<br/>User identity resolution"]
end
subgraph "Storage & Analysis"
ES["Elasticsearch /<br/>Splunk Indexer<br/>Index<br/>Store<br/>Search"]
end
subgraph "Outputs"
AL["Alerts<br/>PagerDuty, Slack,<br/>SOAR playbooks"]
DA["Dashboards<br/>Kibana, Grafana,<br/>Splunk Dashboards"]
HU["Hunt Interface<br/>Analyst workbench<br/>for ad-hoc queries"]
end
S1 --> K
S2 --> K
S3 --> K
S4 --> K
S5 --> K
K --> L
L --> E
E --> ES
ES --> AL
ES --> DA
ES --> HU
A Practical Logstash Configuration
# /etc/logstash/conf.d/security-pipeline.conf
input {
beats {
port => 5044
ssl => true
ssl_certificate => "/etc/logstash/certs/logstash.crt"
ssl_key => "/etc/logstash/certs/logstash.key"
}
syslog {
port => 5514
type => "syslog"
}
}
filter {
if [type] == "syslog" {
grok {
match => { "message" => "%{SYSLOGTIMESTAMP:timestamp} %{SYSLOGHOST:hostname} %{DATA:program}(?:\[%{POSINT:pid}\])?: %{GREEDYDATA:log_message}" }
}
# Parse SSH authentication events
if [program] == "sshd" {
grok {
match => { "log_message" => "Failed password for %{USER:username} from %{IP:src_ip} port %{INT:src_port}" }
add_tag => ["ssh_failed_auth"]
}
grok {
match => { "log_message" => "Accepted %{WORD:auth_method} for %{USER:username} from %{IP:src_ip} port %{INT:src_port}" }
add_tag => ["ssh_success_auth"]
}
}
}
# GeoIP enrichment for all source IPs
if [src_ip] {
geoip {
source => "src_ip"
target => "geoip"
}
}
# Threat intelligence enrichment
if [src_ip] {
translate {
source => "src_ip"
target => "threat_intel"
dictionary_path => "/etc/logstash/threat_feeds/malicious_ips.yml"
fallback => "clean"
}
}
}
output {
elasticsearch {
hosts => ["https://elasticsearch:9200"]
index => "security-%{+YYYY.MM.dd}"
ssl_certificate_verification => true
user => "logstash_writer"
password => "${ES_PASSWORD}"
}
# Forward high-priority events to alerting
if "ssh_failed_auth" in [tags] or [threat_intel] != "clean" {
http {
url => "https://alerting-service.internal/api/events"
http_method => "post"
format => "json"
}
}
}
Log pipelines carry sensitive data --- authentication events, IP addresses, user activity, sometimes PII. Secure your SIEM infrastructure:
- Encrypt log data in transit (TLS for syslog, Beats, and Logstash connections)
- Encrypt log data at rest (disk encryption on Elasticsearch nodes)
- Restrict access to the SIEM with RBAC --- not everyone needs to search all logs
- Implement retention policies with automatic deletion --- balance security needs against GDPR, privacy regulations
- Monitor the SIEM itself --- if an attacker compromises your SIEM, they can erase their tracks and blind you completely
- Buffer logs through Kafka or similar --- if Elasticsearch goes down, you do not want to lose events during the outage
SIEM: The Brain of Your Security Operations
A Security Information and Event Management (SIEM) system collects, normalizes, correlates, and analyzes log data from across your environment.
SIEM Platforms
Commercial:
- Splunk: The dominant commercial SIEM. Powerful search language (SPL). Expensive at scale --- pricing based on daily data ingestion volume. Typically $2-5 per GB/day
- Microsoft Sentinel: Cloud-native SIEM built on Azure Log Analytics. Uses KQL (Kusto Query Language). Pay-per-GB pricing with free tier for some Azure data
- CrowdStrike LogScale (formerly Humio): High-performance log management with streaming architecture. Excels at high-volume ingestion
- IBM QRadar: Strong correlation engine. Popular in enterprises with IBM ecosystems
Open Source:
- Elastic Security (ELK Stack): Elasticsearch + Kibana + detection rules. Not a SIEM out of the box, but Elastic Security adds prebuilt detection rules, case management, and timeline investigation
- Wazuh: Open-source security monitoring with SIEM, HIDS, compliance monitoring, and vulnerability detection. Active community
- OSSIM (AlienVault Open Source): Community edition of AT&T's USM platform. Integrates multiple open-source tools
Why don't more companies just use the ELK stack instead of paying for Splunk? Because running ELK at scale requires significant engineering effort. Elasticsearch clusters need tuning for performance, index lifecycle management, and storage optimization. You need to build your own detection rules, dashboards, alerting logic, and case management workflows. Splunk gives you much of that out of the box with vendor support. The real cost of "free" open-source is engineering time. For a 10-person security team, the ELK trade-off might be right. For a 3-person team, the engineering burden of maintaining the platform can consume all your capacity, leaving no time for actual security work.
Writing SIEM Queries
Understanding your SIEM's query language is essential. Here are detection-focused examples across platforms:
# ============================================
# SPLUNK SPL QUERIES
# ============================================
# Detect brute force: >10 failed logins from same source in 5 minutes
index=auth sourcetype=linux_secure "Failed password"
| bin _time span=5m
| stats count by src_ip, _time
| where count > 10
| sort -count
# Impossible travel: same user, two distant locations, short time
index=auth action=success
| iplocation src_ip
| stats earliest(_time) as first_login latest(_time) as last_login
values(src_ip) as ips values(City) as cities dc(Country) as country_count by user
| where country_count > 1
| eval time_diff_hours=(last_login-first_login)/3600
| where time_diff_hours < 8
# Detect credential dumping: LSASS access
index=sysmon EventCode=10 TargetImage="*\\lsass.exe"
| stats count by SourceImage, Computer
| where SourceImage!="*\\svchost.exe" AND SourceImage!="*\\csrss.exe"
| sort -count
# Large data exfiltration: >500MB to single external IP
index=proxy action=allowed NOT dest_ip=10.* NOT dest_ip=172.16.* NOT dest_ip=192.168.*
| stats sum(bytes_out) as total_bytes by src_ip, dest_ip
| eval total_MB=round(total_bytes/1024/1024,2)
| where total_MB > 500
| sort -total_MB
# ============================================
# ELASTICSEARCH KQL QUERIES
# ============================================
# Failed authentication from external IPs
event.category: "authentication" AND event.outcome: "failure"
AND NOT source.ip: (10.0.0.0/8 OR 172.16.0.0/12 OR 192.168.0.0/16)
# PowerShell download cradle execution
process.name: "powershell.exe" AND process.command_line:
(*DownloadString* OR *DownloadFile* OR *IEX* OR *Invoke-Expression*
OR *Net.WebClient* OR *Start-BitsTransfer* OR *Invoke-WebRequest*)
Alert Fatigue: The Silent Killer of Security Programs
Here is a statistic that explains why breaches go undetected for 277 days: the average SOC receives over 10,000 alerts per day. The average analyst can meaningfully investigate maybe 20-30 per day. Do the math. That means 99.7% of alerts go uninvestigated in many organizations. This is alert fatigue, and it is the single biggest operational problem in security monitoring. It is not a technology problem --- it is a signal-to-noise problem.
The Alert Fatigue Cycle
stateDiagram-v2
[*] --> TooManyAlerts: Default rules, no tuning
TooManyAlerts --> AnalystsOverwhelmed: 10,000+ alerts/day
AnalystsOverwhelmed --> StartIgnoring: Cannot investigate all
StartIgnoring --> RealThreats: True positives buried in noise
RealThreats --> BreachDetectedLate: 277 days average
StartIgnoring --> NoFeedback: No time to tune rules
NoFeedback --> RulesStayNoisy: Rules never improve
RulesStayNoisy --> TooManyAlerts: Cycle continues
BreachDetectedLate --> PostMortem: "Why didn't we catch this?"
PostMortem --> TuningInitiative: Finally invest in tuning
TuningInitiative --> [*]: Break the cycle
note right of TuningInitiative: The solution is TUNING,\nnot more analysts
Why Alert Fatigue Happens
-
Default rules left untuned: Out-of-box detection rules generate alerts for everything. They do not know your environment, your baselines, or your business context.
-
Low-fidelity rules: "Alert on any failed login" generates thousands of alerts from typos, expired passwords, misconfigured service accounts, and password rotation.
-
No context enrichment: An alert saying "suspicious process on 10.0.1.47" means nothing without knowing what that IP is. Is it the CEO's laptop or a test VM in the development lab?
-
Alert duplication: Multiple security tools detecting the same event each generate their own alert. The firewall alerts, the IDS alerts, the EDR alerts --- all for the same port scan.
-
False positive acceptance: When 95% of alerts are false positives, analysts develop "alert blindness" and stop investigating entirely. The 5% of real threats are buried.
At one organization, a security team inherited a SIEM with 847 active detection rules. In the first week, they received 74,000 alerts. The team reviewed every single rule with one criterion: "Can we articulate what specific attack technique this detects AND what investigation steps an analyst should take when it fires?" If not, the rule was disabled.
They disabled 612 rules. Alert volume dropped to 3,200 per week. Detection coverage actually improved, because analysts could now investigate the alerts that mattered instead of drowning in noise. One of those previously-buried alerts led to the discovery of a compromised service account that had been quietly exfiltrating data for two months.
Fewer, better rules beat more, noisier rules every single time.
Tuning: The Most Important and Most Neglected Activity
Tuning strategies:
-
Whitelist known-good activity: If your backup server generates "high volume data transfer" alerts every night at 2 AM, whitelist it by source IP and time window --- but log the whitelist exception so you can review it.
-
Increase specificity: Instead of "alert on any PowerShell execution," try "alert on PowerShell execution that downloads content from the internet AND is not signed by a known publisher AND is not launched by a known automation tool."
-
Add context enrichment: Enrich alerts with asset information from your CMDB. An alert involving a domain controller is critical. The same alert on a developer workstation is high. On a test VM, it is medium.
-
Use threshold tuning: Instead of alerting on a single failed SSH login, alert on 20 failures from the same source in 5 minutes followed by a success (indicating successful brute force).
-
Implement risk scoring: Instead of alerting on individual events, assign risk scores and alert when cumulative risk for a user or host exceeds a threshold within a time window.
Risk Scoring Example:
Failed VPN login from unusual country +20 points
Successful login after failures +15 points
Access to sensitive file share +10 points
Large data upload to cloud storage +25 points
New scheduled task created +15 points
Individual events: No alert (each is explainable)
Combined score for user "jsmith" in 1 hour: 85 points
ALERT: Possible account compromise and data exfiltration
Detection Engineering: Writing Rules That Actually Work
Detection engineering is becoming a discipline of its own. It requires understanding both the attack techniques you want to detect and the data sources available to detect them. Good detection rules are not just queries --- they are complete packages with documentation, response playbooks, and maintenance plans.
The MITRE ATT&CK Framework
MITRE ATT&CK is a knowledge base of adversary tactics, techniques, and procedures (TTPs) based on real-world observations. It provides a common language for describing what attackers do and a framework for mapping your detection coverage.
graph LR
subgraph "MITRE ATT&CK Tactics (Enterprise)"
T1["Reconnaissance"] --> T2["Resource<br/>Development"]
T2 --> T3["Initial<br/>Access"]
T3 --> T4["Execution"]
T4 --> T5["Persistence"]
T5 --> T6["Privilege<br/>Escalation"]
T6 --> T7["Defense<br/>Evasion"]
T7 --> T8["Credential<br/>Access"]
T8 --> T9["Discovery"]
T9 --> T10["Lateral<br/>Movement"]
T10 --> T11["Collection"]
T11 --> T12["C2"]
T12 --> T13["Exfiltration"]
T13 --> T14["Impact"]
end
style T3 fill:#e74c3c,color:#fff
style T4 fill:#e74c3c,color:#fff
style T8 fill:#e67e22,color:#fff
style T10 fill:#e67e22,color:#fff
style T12 fill:#8e44ad,color:#fff
style T13 fill:#8e44ad,color:#fff
Each tactic contains multiple techniques. For example, Initial Access includes: T1566 Phishing, T1190 Exploit Public-Facing Application, T1078 Valid Accounts, T1195 Supply Chain Compromise, and others. Each technique has sub-techniques, real-world examples, detection guidance, and mitigation recommendations.
Sigma: Vendor-Agnostic Detection Rules
Sigma is an open standard for writing detection rules that can be converted to any SIEM's query language --- think of it as YARA for log events, or Snort rules for SIEM.
# Sigma rule: Detect suspicious PowerShell download cradle
title: Suspicious PowerShell Download and Execute
id: 3b6ab547-8ec2-4991-b9ce-2f4e10893c64
status: stable
description: |
Detects PowerShell commands that download content from the internet
and execute it, commonly used in initial access and execution phases.
author: Security Team
date: 2026/03/01
modified: 2026/03/12
references:
- https://attack.mitre.org/techniques/T1059/001/
- https://attack.mitre.org/techniques/T1105/
logsource:
category: process_creation
product: windows
detection:
selection_process:
ParentImage|endswith:
- '\powershell.exe'
- '\pwsh.exe'
Image|endswith:
- '\powershell.exe'
- '\pwsh.exe'
selection_pattern:
CommandLine|contains:
- 'IEX'
- 'Invoke-Expression'
- 'DownloadString'
- 'DownloadFile'
- 'Net.WebClient'
- 'Start-BitsTransfer'
- 'Invoke-WebRequest'
- 'iwr '
- 'wget '
- 'curl '
condition: selection_process and selection_pattern
falsepositives:
- Legitimate admin scripts that download content (document and whitelist)
- Package managers (chocolatey, winget)
- System management tools (SCCM, Intune)
level: high
tags:
- attack.execution
- attack.t1059.001
- attack.t1105
# Convert Sigma rule to Splunk SPL
$ sigma convert -t splunk -p sysmon rule.yml
source="WinEventLog:Microsoft-Windows-Sysmon/Operational"
EventCode=1
(ParentImage="*\\powershell.exe" OR ParentImage="*\\pwsh.exe")
(CommandLine="*IEX*" OR CommandLine="*Invoke-Expression*"
OR CommandLine="*DownloadString*" OR CommandLine="*Net.WebClient*")
# Convert to Elasticsearch query
$ sigma convert -t elasticsearch rule.yml
# The SigmaHQ repository (github.com/SigmaHQ/sigma) contains
# 3000+ community-contributed rules covering most ATT&CK techniques
Detection Rule Quality Checklist
Every detection rule should meet these criteria before deployment:
- Maps to a specific MITRE ATT&CK technique with ID
- Has a clear description of what attack behavior it detects
- Specifies the exact log source and event requirements
- Has been tested against both malicious and benign activity in your environment
- Documents known false positive scenarios and how to triage them
- Includes a response playbook (what to do when it fires, step by step)
- Has an assigned owner responsible for tuning and maintenance
- Specifies severity/priority level based on asset criticality and attack stage
- Has been validated against real attack simulations (MITRE Caldera, Atomic Red Team)
- Has a review date for periodic reassessment
Every detection rule should have a corresponding playbook for what to do when it fires. An alert without a response playbook is just noise with a notification. If your analysts do not know what to investigate when a rule triggers, the rule is useless regardless of how well-crafted the detection logic is. The playbook should specify: what to check first, where to look for context, what constitutes a true positive vs. false positive, and what containment actions to take.
Threat Hunting: Proactive Detection
While SIEM alerting is reactive --- it waits for patterns to match --- threat hunting is proactive. Hunters hypothesize that a specific attack technique is being used in their environment and then search for evidence.
The Threat Hunting Loop
flowchart TD
H["1. HYPOTHESIS<br/>Based on threat intel, ATT&CK,<br/>or organizational risk assessment<br/><br/>Example: 'An attacker may be using<br/>DNS tunneling to exfiltrate data'"] --> D["2. DATA COLLECTION<br/>Gather relevant log data<br/>DNS query logs for past 30 days<br/>Identify all unique domains queried"]
D --> A["3. ANALYSIS<br/>Look for anomalies:<br/>High-entropy subdomain strings<br/>Unusually long domain labels<br/>High query volume to single domain<br/>TXT record queries to obscure domains"]
A --> F{"4. FINDINGS<br/>Suspicious<br/>activity found?"}
F -->|Yes| I["5a. INVESTIGATE<br/>Determine scope, timeline,<br/>affected systems, root cause"]
F -->|No| R["5b. DOCUMENT<br/>Record negative finding<br/>Refine hypothesis for next hunt"]
I --> AU["6. AUTOMATE<br/>Convert hunting insight<br/>into detection rule<br/>for continuous monitoring"]
R --> H
AU --> H
style H fill:#3498db,color:#fff
style A fill:#e67e22,color:#fff
style I fill:#e74c3c,color:#fff
style AU fill:#27ae60,color:#fff
Practical Threat Hunting Queries
# ============================================
# HUNT: DNS Tunneling Detection
# ============================================
# DNS tunneling encodes data in subdomain labels
# Look for domains with unusually long subdomain strings
# Splunk:
index=dns
| eval subdomain=mvindex(split(query,"."),0)
| eval subdomain_length=len(subdomain)
| where subdomain_length > 30
| stats count values(query) as sample_queries by src_ip
| where count > 100
| sort -count
# ============================================
# HUNT: C2 Beaconing Detection
# ============================================
# C2 beacons communicate at regular intervals
# Look for connections with suspiciously consistent timing
# Splunk:
index=proxy
| sort _time
| streamstats current=f last(_time) as prev_time by src_ip, dest
| eval interval=_time-prev_time
| stats count stdev(interval) as jitter avg(interval) as avg_interval
by src_ip, dest
| where count > 50 AND jitter < 5 AND avg_interval > 30 AND avg_interval < 3600
| eval beacon_score=round(100-(jitter/avg_interval*100),1)
| where beacon_score > 90
| sort -beacon_score
# beacon_score near 100 = very consistent timing = likely C2
# ============================================
# HUNT: Lateral Movement via SMB
# ============================================
# Single source connecting to many destinations on 445 = scanning or spreading
# Splunk:
index=firewall dest_port=445 action=allowed
| stats dc(dest_ip) as unique_targets values(dest_ip) as targets by src_ip
| where unique_targets > 20
| sort -unique_targets
# ============================================
# HUNT: Credential Dumping (LSASS Access)
# ============================================
# Processes accessing LSASS memory = likely credential dumping
# Splunk (Sysmon Event 10):
index=sysmon EventCode=10 TargetImage="*\\lsass.exe"
| search NOT SourceImage IN ("*\\svchost.exe","*\\csrss.exe",
"*\\services.exe","*\\wininit.exe","*\\MsMpEng.exe")
| stats count by SourceImage Computer
| sort -count
Start a simple threat hunt today:
1. Pull the last 7 days of DNS query logs
2. Group queries by base domain (strip subdomains)
3. Count unique subdomains per base domain
4. Sort by count --- domains with thousands of unique subdomains are suspicious
5. Check the top results against threat intelligence (VirusTotal, AbuseIPDB)
6. If you find something suspicious, congratulations --- you just completed your first hunt
Common benign results to filter out: CDN domains (akamai, cloudfront), analytics (google-analytics), email services (outlook.com), and update services (windowsupdate.com) generate many unique subdomains legitimately. Document these as your "known-good" list and filter them in future hunts.
SOC Operations Structure
A Security Operations Center (SOC) is the organizational function responsible for continuous security monitoring and incident response.
SOC Tier Structure
graph TD
subgraph "SOC Organization"
T1["Tier 1: Alert Triage<br/>──────────────────<br/>Monitor incoming alerts<br/>Initial triage (TP/FP)<br/>Follow runbooks<br/>Escalate to Tier 2<br/>Close FPs with documentation"]
T2["Tier 2: Investigation<br/>──────────────────<br/>Deep-dive investigation<br/>Correlate across data sources<br/>Determine scope and impact<br/>Initial containment<br/>Escalate to Tier 3"]
T3["Tier 3: Advanced Analysis<br/>──────────────────<br/>Malware reverse engineering<br/>Proactive threat hunting<br/>Detection rule development<br/>Incident response lead<br/>Threat intel integration"]
MGR["SOC Manager<br/>──────────────────<br/>Staffing and scheduling<br/>Metrics and reporting<br/>Process improvement<br/>Stakeholder communication<br/>Budget and tooling"]
end
T1 -->|"Escalate complex alerts"| T2
T2 -->|"Escalate advanced threats"| T3
T3 -->|"New detection rules"| T1
T3 -->|"Hunting findings"| T2
MGR --> T1
MGR --> T2
MGR --> T3
style T1 fill:#3498db,color:#fff
style T2 fill:#e67e22,color:#fff
style T3 fill:#e74c3c,color:#fff
style MGR fill:#2c3e50,color:#fff
SOC Metrics That Matter
If you are running a SOC, measure the right things. Vanity metrics like "total alerts processed" are meaningless. Here are the metrics that actually matter.
Mean Time to Detect (MTTD): From when an attack begins to when the SOC is aware of it. The 277-day statistic measures this. Your goal is to drive it below hours, ideally minutes for critical assets.
Mean Time to Respond (MTTR): From detection to containment. Even if you detect quickly, a slow response gives attackers time to escalate privileges, move laterally, and achieve their objectives.
Mean Time to Acknowledge (MTTA): From alert creation to analyst assignment. If alerts sit in a queue for hours before someone looks at them, your detection rules are wasted.
Alert-to-Incident Ratio: What percentage of alerts result in actual incidents? If less than 5%, your rules are too noisy. If above 50%, you might not be catching enough lower-severity activity.
False Positive Rate per Rule: Track this for each rule individually. Rules with consistently high false positive rates need tuning or removal.
Detection Coverage (ATT&CK Heatmap): Mapped against MITRE ATT&CK, what percentage of techniques do you have detection rules for? Most organizations cover less than 20% when they first measure. The DeTT&CT framework helps visualize this.
Strategies to Reduce MTTD
- Invest in high-fidelity detections. Ten excellent rules that fire rarely and are always investigated beat a thousand noisy ones that are ignored.
- Automate triage with SOAR. Security Orchestration, Automation, and Response platforms can automatically enrich alerts, check threat intelligence, query additional data sources, and classify common alert types --- freeing analysts for complex investigation.
- Hunt proactively. Do not wait for alerts. Hypothesize and search. Hunting finds what automated detection misses.
- Deploy deception technology. Honeypots, honey credentials, honey files, and canary tokens detect attackers that bypass all other controls. If anyone touches a decoy, it is definitionally malicious.
- Measure and improve. For every confirmed incident, retrospectively ask: "Could we have detected this sooner? What data source or rule would have caught it earlier? What log were we missing?"
Set up canary tokens right now --- they take 5 minutes and cost nothing:
1. Go to canarytokens.org
2. Create a "DNS token" --- you will get a unique hostname
3. Embed it in a document called "passwords.xlsx" on a sensitive file share
4. If anyone opens that document, the token fires and you get an email alert
5. Create a "Web bug" token and embed it in a fake AWS credentials file in a honeypot directory
6. Create a "Windows folder" token that alerts when a directory is browsed
Canary tokens exploit the one thing attackers must do: interact with your environment. A token on a share called "IT-Admin-Passwords" will catch any attacker performing discovery. They are free, take minutes to deploy, and can detect attackers that evade every other control.
Log Retention and Compliance
Different regulations require different retention periods:
| Regulation | Minimum Retention | Notes |
|---|---|---|
| PCI DSS | 1 year (3 months immediately available) | Systems processing card data |
| HIPAA | 6 years | Audit logs for PHI access |
| SOX (Sarbanes-Oxley) | 7 years | Financial system logs |
| GDPR | "As short as possible" | Tension: retain for security vs. minimize for privacy |
| NIST 800-171 | Not specified but "sufficient for investigation" | Federal contractor requirements |
| Recommended minimum | 1 year hot, 5 years cold | Balance cost, compliance, and investigative needs |
The tension between security retention and privacy minimization is real. Security teams want to keep logs forever because you never know when you will need to investigate a historical event. Privacy regulations say minimize data collection and retention. The answer is a clear retention policy with automatic deletion, strong access controls, and a process for legal hold when needed.
What You've Learned
In this chapter, you explored the full lifecycle of security monitoring:
-
What to log: Prioritize authentication events, privilege changes, process creation, DNS queries, and cloud API calls. Log at every layer --- network, host, application, cloud. Not all logs are equally valuable; focus on the sources that enable detection of the most common attack techniques.
-
Log pipeline architecture: Collection, buffering, parsing, normalization, enrichment, storage, and alerting form a pipeline where any failure creates blind spots. Use message queues for resilience, structured schemas for consistency, and threat intelligence for enrichment.
-
SIEM fundamentals: Collection, normalization, correlation, alerting, and search form the core functions. Commercial platforms (Splunk, Sentinel) trade cost for convenience; open source (ELK, Wazuh) trades engineering time for cost savings.
-
Alert fatigue: The number one operational problem in security monitoring. 10,000+ daily alerts with 95% false positives mean real threats are buried. The solution is aggressive tuning, risk scoring, and context enrichment --- not more analysts.
-
Detection engineering: Good detection rules map to MITRE ATT&CK techniques, have documented response playbooks, are tested against benign and malicious activity, and are continuously tuned. Sigma provides a vendor-agnostic format for sharing detection rules.
-
Threat hunting: Proactive, hypothesis-driven searching for threats that evade automated detection. Hunting findings become automated detections, creating a virtuous cycle.
-
SOC operations: Tiered structure (triage, investigation, advanced analysis) with clear escalation paths. Measure MTTD, MTTR, MTTA, and false positive rates. Deploy deception technology as a last line of detection.
The ideal is to collect the right data, write focused detection rules, tune relentlessly, hunt proactively, and measure everything. The reality is that most organizations are somewhere on the journey between "we have a SIEM but nobody looks at it" and "our detection coverage mapped against ATT&CK is 60% and improving." The important thing is to be moving in the right direction. Start with authentication logs, Sysmon, DNS logs, and three good detection rules. Add a few canary tokens. You will be ahead of most organizations. Then iterate from there.