Hashing, MACs, and Digital Signatures

"A hash function is the duct tape of cryptography — it holds everything together, and when used wrong, everything falls apart." — Adapted from Bruce Schneier

The Fingerprint That Can't Be Forged

How do you verify that a file you downloaded hasn't been tampered with? You compare checksums — the SHA-256 hash listed on the download page against the hash you compute locally. But do you actually understand what that hash represents and why it works?

The concept behind hashing is one of the most powerful ideas in computer science. It underpins everything from password storage to blockchain to git to TLS certificate verification. Get hashing wrong, and you can break systems in ways that are invisible until it's too late.

A hash function takes any amount of input data — one byte or one terabyte — and produces a fixed-size output called a hash or digest. The same input always produces the same output. But even the tiniest change in the input produces a completely different output.

# Same input, same hash — always deterministic
$ echo -n "Hello, World!" | openssl dgst -sha256
SHA2-256(stdin)= dffd6021bb2bd5b0af676290809ec3a53191dd81c7f70a4b28688a362182986f

$ echo -n "Hello, World!" | openssl dgst -sha256
SHA2-256(stdin)= dffd6021bb2bd5b0af676290809ec3a53191dd81c7f70a4b28688a362182986f

# Change ONE character (! to .) — the "avalanche effect"
$ echo -n "Hello, World." | openssl dgst -sha256
SHA2-256(stdin)= 27981ebdd89071b807e581e1bc0e93e4b7a7ed1a4e6bf4140523af55e9e76e3e

# Completely different hash from a single character change
# ~50% of the output bits changed — this is by design

# Hash of empty string — even no input has a specific hash
$ echo -n "" | openssl dgst -sha256
SHA2-256(stdin)= e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855

Properties of Cryptographic Hash Functions

Not all hash functions are cryptographic. CRC32 is a hash function used for error detection, but it's trivially reversible and useless for security. MurmurHash is great for hash tables but has no cryptographic properties. A cryptographic hash function must satisfy three specific security properties:

1. Pre-image Resistance (One-Way)

Given a hash output H, it must be computationally infeasible to find any input M such that hash(M) = H.

This is the "one-way" property. You can go from input to hash in microseconds, but you cannot go from hash back to input — even with unlimited computing power, you'd need to try 2^256 inputs on average for SHA-256. It's like dropping an egg — easy to go from egg to scrambled egg, impossible to go from scrambled egg back to egg.

But what about rainbow tables? Rainbow tables are precomputed lookup tables for common inputs — typically passwords. They don't "reverse" the hash function; they precompute hashes for millions of likely inputs and store the mappings. This only works for short, predictable inputs like passwords. For a random 256-bit key, a rainbow table would need to store 2^256 entries — more atoms than exist in the observable universe. This is why salts are used for password hashing — a random value added to each password before hashing makes precomputed tables useless.

2. Second Pre-image Resistance

Given an input M1, it must be computationally infeasible to find a different input M2 such that hash(M1) = hash(M2).

In plain language: given a specific file, you can't create a different file with the same hash. This is what makes hash-based integrity verification work. If you have a file and its SHA-256 hash, you can verify no one modified the file, because finding a modified version with the same hash requires ~2^256 operations.

3. Collision Resistance

It must be computationally infeasible to find any two different inputs M1 and M2 such that hash(M1) = hash(M2).

How is this different from second pre-image resistance? Subtly but critically. Second pre-image resistance: given a specific M1, find M2 with the same hash. Collision resistance: find any pair (M1, M2) that hash to the same value. The attacker has complete freedom to choose both inputs. This freedom makes collision attacks much easier than second pre-image attacks — roughly 2^(n/2) operations instead of 2^n, due to the birthday paradox.

graph TD
    subgraph PREIMAGE["Pre-image Resistance"]
        H1["Given: H = 0xabcd..."] --> Q1{"Find ANY M where<br/>hash(M) = H?"}
        Q1 -->|"~2^256 operations"| HARD1["Computationally<br/>infeasible"]
    end

    subgraph SECOND["Second Pre-image Resistance"]
        M1["Given: M1 and<br/>hash(M1) = 0xabcd..."] --> Q2{"Find M2 ≠ M1 where<br/>hash(M2) = 0xabcd...?"}
        Q2 -->|"~2^256 operations"| HARD2["Computationally<br/>infeasible"]
    end

    subgraph COLLISION["Collision Resistance"]
        FREE["Attacker chooses<br/>BOTH inputs freely"] --> Q3{"Find ANY M1, M2 where<br/>hash(M1) = hash(M2)?"}
        Q3 -->|"~2^128 operations<br/>(birthday paradox)"| HARD3["Harder to guarantee"]
    end

    NOTE["For SHA-256:<br/>Pre-image: 2^256 work<br/>Second pre-image: 2^256 work<br/>Collision: 2^128 work<br/><br/>MD5 (128-bit): collision in seconds<br/>SHA-1 (160-bit): collision demonstrated"]

    style HARD1 fill:#38a169,color:#fff
    style HARD2 fill:#38a169,color:#fff
    style HARD3 fill:#d69e2e,color:#fff
    style NOTE fill:#fff3cd,color:#1a202c

Collision resistance is where MD5 and SHA-1 failed catastrophically. Their stories aren't just historical curiosities — they have real implications for systems running today.

The Death of MD5 and SHA-1

MD5: Dead Since 2004, Still Found in Production

MD5 produces a 128-bit hash. In 2004, Xiaoyun Wang and her team demonstrated practical collision attacks against MD5. The collisions could be generated in seconds. By 2008, researchers demonstrated the devastating real-world impact.

The most devastating MD5 collision attack was the rogue CA certificate attack (2008). Researchers from CWI Amsterdam and other institutions generated two X.509 certificates with identical MD5 hashes but different contents. One was a legitimate-looking end-entity certificate that a Certificate Authority would sign. The other was a CA certificate — a certificate that could issue other certificates. Since both had the same MD5 hash, the CA's signature on the first certificate was also a valid signature on the second.

The result: they created a rogue Certificate Authority trusted by every browser. They could issue certificates for any website on the internet. This attack was the final nail in MD5's coffin for security purposes.

But the story doesn't end there. In 2012, the Flame malware — attributed to state-sponsored actors — used a novel MD5 collision attack against Microsoft's Windows Update certificates. The attackers found an MD5 collision that allowed them to create a fraudulent Microsoft code-signing certificate, enabling them to distribute malware through Windows Update itself. This wasn't a theoretical paper — it was weaponized cryptanalysis deployed against real targets.

# DO NOT use MD5 for security purposes
$ echo -n "test" | openssl dgst -md5
MD5(stdin)= 098f6bcd4621d373cade4e832627b4f6

# MD5 collisions can be generated in seconds on a modern laptop
# Tools like HashClash can produce MD5 collisions in under a minute

# Check if your codebase still uses MD5:
$ grep -r "MD5\|md5" --include="*.py" --include="*.java" --include="*.js" .
# Every result needs to be evaluated for security impact

MD5 is broken for all cryptographic purposes. Do not use it for:
- Integrity verification of downloads or files
- Digital signatures or certificate fingerprints
- Password hashing (broken AND too fast)
- HMAC (technically HMAC-MD5 isn't broken due to HMAC's construction, but there's no reason to use it)

MD5 remains acceptable only for non-security uses like data deduplication, cache keys, or checksums where collision attacks are not in your threat model. Even then, prefer SHA-256 — there's no performance reason to use MD5 on modern hardware. SHA-256 with hardware acceleration is faster than MD5 on many platforms.

SHA-1: Dead Since 2017

SHA-1 produces a 160-bit hash. Theoretical attacks were known since 2005 (Wang's team again), but the first practical collision was demonstrated by Google and CWI Amsterdam in 2017 — the SHAttered attack. They created two different PDF files with identical SHA-1 hashes.

The attack required approximately 2^63 SHA-1 computations, which Google estimated cost about $110,000 in cloud computing resources. That's expensive for an individual but trivial for nation-states, well-funded criminal organizations, or even venture-funded startups.

# SHA-1 — broken, don't use for new applications
$ echo -n "test" | openssl dgst -sha1
SHA1(stdin)= a94a8fe5ccb19ba61c4c0873d391e987982fbbd3

# The SHAttered collision PDFs are available at shattered.io
# Both files have SHA-1 hash: 38762cf7f55934b34d179ae6a4c80cadccbb7f0a
# but completely different content

# In 2020, a "chosen-prefix collision" attack was demonstrated
# (SHA-1 is in Shambles — Leurent & Peyrin)
# Cost: estimated at $45,000 in cloud resources
# This is FAR more dangerous than identical-prefix collisions:
# the attacker can choose arbitrary prefixes for both files

The chosen-prefix collision is particularly dangerous because it enables practical attacks against real protocols. An attacker can create two certificates with the same SHA-1 hash where the first is a legitimate certificate and the second has attacker-chosen content. This directly attacks any system that uses SHA-1 for certificate signatures, PGP key IDs, or code signing.

SHA-256 and SHA-3: The Current Standards

# SHA-256 (SHA-2 family) — the workhorse, use this
$ echo -n "test" | openssl dgst -sha256
SHA2-256(stdin)= 9f86d081884c7d659a2feaa0c55ad015a3bf4f1b2b0b822cd15d6c15b0f00a08

# SHA-512 — longer output, slightly different performance profile
# Faster than SHA-256 on 64-bit processors (operates on 64-bit words)
$ echo -n "test" | openssl dgst -sha512
SHA2-512(stdin)= ee26b0dd4af7e749aa1a8ee3c10ae9923f618980772e473f8819a5d4...

# SHA-3 (Keccak) — different internal design, backup standard
# Uses sponge construction instead of Merkle-Damgard
# Not vulnerable to length extension attacks (unlike SHA-256)
$ echo -n "test" | openssl dgst -sha3-256
SHA3-256(stdin)= 36f028580bb02cc8272a9a020f4200e346e276ae664e45ee80745574e2f5ab80

# BLAKE2 — faster than SHA-256, used in Argon2 password hashing
# Not yet in NIST standards but widely trusted
$ b2sum <<< "test"

SHA-256 is the standard for virtually all modern security applications. SHA-3 exists as a hedge — if a structural weakness is found in the SHA-2 family (which uses the Merkle-Damgard construction), SHA-3's completely different internal design (sponge construction) would likely be unaffected. Using SHA-3 also avoids length extension attacks, which are discussed later in this chapter.

How Git Uses Hashing for Integrity

Git is fundamentally a content-addressable filesystem. Every object in git — every file (blob), directory listing (tree), commit, and tag — is identified by the SHA hash of its contents. This creates a Merkle tree structure where changing any single byte in the repository changes all hashes above it in the tree.

graph TD
    C3["Commit c6b2a91<br/>tree: 789abc<br/>parent: 7f3d0e2<br/>msg: 'Update deps'"] --> C2

    C2["Commit 7f3d0e2<br/>tree: def456<br/>parent: 1a8c5b4<br/>msg: 'Refactor user model'"] --> C1

    C1["Commit 1a8c5b4<br/>tree: abc123<br/>parent: none<br/>msg: 'Initial commit'"]

    C3 --> T3["Tree 789abc"]
    C2 --> T2["Tree def456"]
    C1 --> T1["Tree abc123"]

    T3 --> B3a["Blob: README.md<br/>hash: e69de2..."]
    T3 --> B3b["Blob: main.py<br/>hash: 3f4a7c..."]

    NOTE["If you modify Commit 7f3d0e2:<br/>→ Its hash changes<br/>→ C3's parent hash changes<br/>→ C3's hash changes<br/>→ ALL subsequent commits change<br/><br/>This is the same principle as blockchain:<br/>cryptographic hash chains create<br/>tamper-evident history"]

    style NOTE fill:#fff3cd,color:#1a202c
    style C3 fill:#3182ce,color:#fff
    style C2 fill:#3182ce,color:#fff
    style C1 fill:#3182ce,color:#fff

# See the hash of a file in git
$ git hash-object README.md
e69de29bb2d1d6434b8b29ae775ad8c2e48c5391

# Every commit is identified by its hash
$ git log --oneline -5
a3f7b2c Fix authentication bug
9d1e4f8 Add rate limiting to API
c6b2a91 Update dependencies
7f3d0e2 Refactor user model
1a8c5b4 Initial commit

# The commit hash includes:
# - Hash of the tree (directory state)
# - Hash of the parent commit(s)
# - Author info and timestamp
# - Committer info and timestamp
# - Commit message
#
# Change ANY of these and the commit hash changes.
# This creates a tamper-evident chain.

# Verify git's integrity
$ git fsck --full
# Checks all object hashes in the repository

Git originally used SHA-1. After the SHAttered attack, the git project began migrating to SHA-256. The concern wasn't that someone would forge a git commit tomorrow, but that SHA-1 collisions would become cheaper over time, and git's entire integrity model depends on collision resistance. If an attacker could create two different source code trees with the same SHA-1 hash, they could substitute malicious code that passes git's integrity checks.

The migration from SHA-1 to SHA-256 in git is a massive undertaking that illustrates why **cryptographic agility** — the ability to switch algorithms without rewriting your system — is important. Every tool that interacts with git (GitHub, GitLab, CI systems, IDE plugins, merge tools) needs to handle both hash formats. Git has implemented SHA-256 support with a compatibility layer that can translate between SHA-1 and SHA-256 object names.

The lesson: design your systems to be algorithm-agile from day one. Hard-coding hash algorithm assumptions (fixed-length comparisons, storing hash type alongside hash value, abstracting hash computation behind an interface) makes future migration dramatically easier. Systems that assumed SHA-1 forever are now paying the cost of that assumption.

HMAC: Hash-Based Message Authentication Code

Hashing tells you that data hasn't been modified accidentally. But it doesn't authenticate who created the hash. If you send someone a file and its SHA-256 hash, and an attacker intercepts both, the attacker can replace the file, compute a new SHA-256 hash for the modified file, and send the new file with the new hash. The recipient would verify the hash, it would match, and they'd trust the modified file.

The hash itself isn't authenticated. You need a way to verify both integrity (data hasn't changed) AND authenticity (the hash was created by someone who knows a shared secret). That's what HMAC does.

HMAC (Hash-based Message Authentication Code) combines a hash function with a secret key. Only someone who knows the key can compute or verify the HMAC.

flowchart TD
    subgraph HASH_ONLY["Plain Hash — No Authentication"]
        M1["Message"] --> SHA["SHA-256"] --> H1["Hash"]
        NOTE1["Anyone can compute this.<br/>Attacker replaces message + hash.<br/>Recipient can't detect substitution."]
    end

    subgraph HMAC_AUTH["HMAC — Authenticated"]
        M2["Message"] --> HMAC_FUNC["HMAC-SHA256"]
        K["Secret Key<br/>(shared between parties)"] --> HMAC_FUNC
        HMAC_FUNC --> TAG["Authentication Tag"]
        NOTE2["Only someone with the key can:<br/>1. Compute the correct tag<br/>2. Verify a tag is correct<br/>Attacker cannot forge a valid tag."]
    end

    style NOTE1 fill:#e53e3e,color:#fff
    style NOTE2 fill:#38a169,color:#fff
    style K fill:#e53e3e,color:#fff

# Compute HMAC-SHA256
$ echo -n "Transfer $10000 to account 12345" | \
    openssl dgst -sha256 -hmac "shared_secret_key"
HMAC-SHA256(stdin)= 8b2c14a912f3e5d67c8a9b0e1f2345...

# Without the key, you can't compute the correct HMAC
$ echo -n "Transfer $10000 to account 12345" | \
    openssl dgst -sha256 -hmac "wrong_key"
HMAC-SHA256(stdin)= completely_different_value...

# And if the message is modified, the HMAC changes
$ echo -n "Transfer $10000 to account 99999" | \
    openssl dgst -sha256 -hmac "shared_secret_key"
HMAC-SHA256(stdin)= also_completely_different...

# Both the message AND the key must match for verification

Where HMACs Are Used (With Real Examples)

Application	How HMAC is Used
AWS Signature V4	Every AWS API request is signed with HMAC-SHA256 using your secret access key. The signature covers HTTP method, URI, headers, query parameters, and payload hash. AWS verifies the signature server-side.
JWT (HS256)	JWTs using the HS256 algorithm sign the header+payload with HMAC-SHA256. The server verifies with the shared secret. (RS256 uses RSA signatures instead.)
TLS record protocol	In TLS 1.2 with non-AEAD cipher suites, each record includes an HMAC for integrity. In TLS 1.3, AEAD modes (GCM) provide authentication directly.
Webhook verification	GitHub, Stripe, Slack, and others sign webhook payloads with HMAC-SHA256. Your server verifies the signature to ensure the webhook is authentic.
Cookie integrity	Web frameworks (Rails, Django, Express) sign session cookies with HMAC to prevent client-side tampering.
TOTP/HOTP	Time-based and HMAC-based one-time passwords use HMAC-SHA1 (specified by RFC 6238/4226).

Implement webhook signature verification. This is a pattern you'll use constantly:

```python
import hmac
import hashlib

def verify_github_webhook(payload_body: bytes, signature_header: str, secret: str) -> bool:
    """Verify GitHub webhook signature (X-Hub-Signature-256 header)"""
    expected = 'sha256=' + hmac.new(
        secret.encode('utf-8'),
        payload_body,
        hashlib.sha256
    ).hexdigest()

    # CRITICAL: Use constant-time comparison!
    # Regular == leaks timing information
    return hmac.compare_digest(expected, signature_header)

# In your webhook handler:
# signature = request.headers.get('X-Hub-Signature-256')
# if not verify_github_webhook(request.body, signature, WEBHOOK_SECRET):
#     return HttpResponse(status=401)  # Reject unsigned/forged webhooks

The hmac.compare_digest() function is critical. Regular string comparison (==) returns False as soon as it finds the first differing byte — meaning it takes less time for strings that differ early. An attacker can exploit this timing difference to reconstruct the correct HMAC one byte at a time, testing 256 values for each position. Constant-time comparison always takes the same amount of time regardless of where the strings differ.


### Length Extension Attacks: Why hash(key + message) Is Broken

Why not just concatenate the key and message and hash them? Something like SHA-256(key + message)? Because of length extension attacks. This is one of the most important practical cryptographic attacks to understand, because the vulnerable construction looks intuitively correct but is catastrophically broken.

SHA-256 (and all Merkle-Damgard hash functions) processes input in blocks, maintaining an internal state. The final hash output IS the internal state after processing the last block. If you know hash(key + message) and the length of (key + message), you can resume the hash computation from that state and append additional data — computing hash(key + message + padding + attacker_data) — without knowing the key.

```mermaid
flowchart TD
    subgraph VULN["Vulnerable: SHA-256(key || message)"]
        K["Secret Key (16 bytes)"] --> CONCAT
        M["message: 'amount=100'"] --> CONCAT
        CONCAT["Concatenate"] --> SHA["SHA-256 processes<br/>block by block"]
        SHA --> HASH["Final hash = internal state<br/>0xabc123..."]
    end

    subgraph ATTACK["Length Extension Attack"]
        HASH2["Attacker knows:<br/>1. Hash value (0xabc123...)<br/>2. Length of key+message<br/>(doesn't need the key!)"]
        HASH2 --> RESUME["Resume SHA-256<br/>from internal state 0xabc123..."]
        EXTRA["Append: '&admin=true'"] --> RESUME
        RESUME --> NEW_HASH["Valid hash for:<br/>key || 'amount=100' || padding || '&admin=true'<br/>WITHOUT KNOWING THE KEY"]
    end

    subgraph SAFE["Safe: HMAC-SHA256(key, message)"]
        HMAC_CONST["HMAC(K, M) = H((K' XOR opad) || H((K' XOR ipad) || M))"]
        HMAC_NOTE["Double hashing + XOR with padding constants<br/>makes length extension impossible.<br/>The outer hash prevents the attack because<br/>the attacker can't access the intermediate state."]
    end

    style VULN fill:#e53e3e,color:#fff
    style ATTACK fill:#dd6b20,color:#fff
    style SAFE fill:#38a169,color:#fff

This attack has been used against real APIs. In 2009, Thai Duong and Juliano Rizzo demonstrated length extension attacks against Flickr's API authentication, which used MD5(secret + parameters). They could append arbitrary API parameters and compute a valid signature. The fix: use HMAC, which was specifically designed to resist this attack.

What about SHA-3 — is it vulnerable to length extension? No. SHA-3 uses a sponge construction instead of Merkle-Damgard, and its internal state is larger than its output. The output doesn't reveal the internal state, making length extension attacks impossible. This is one of the advantages of SHA-3 over SHA-2. However, HMAC-SHA-256 is also safe — HMAC's double-hashing construction prevents length extension regardless of the underlying hash function.

Digital Signatures: Non-Repudiation

HMAC has a fundamental limitation: both parties share the same secret key. This means either party could have computed the MAC. If Alice sends Bob an HMAC-signed message, Bob can verify it came from someone who knows the key — but since he also knows the key, he could have created it himself. He can't prove to a third party that Alice signed it, because he had the same capability.

Digital signatures solve this problem using asymmetric cryptography. The signer uses their private key to sign. Anyone can verify with the signer's public key. Since only the signer has the private key, only they could have created the signature. This provides non-repudiation — the signer can't deny having signed, and any third party can verify the signature independently.

sequenceDiagram
    participant B as Bob (Signer)
    participant DOC as Document
    participant A as Alice (Verifier)
    participant C as Charlie (Third Party)

    Note over B: Bob signs with PRIVATE key
    B->>DOC: Sign(hash(document), private_key)<br/>→ signature

    Note over B,A: Bob sends document + signature

    B->>A: document + signature

    Note over A: Alice verifies with Bob's PUBLIC key
    A->>A: Verify(hash(document), signature, public_key)<br/>→ VALID

    Note over A,C: Alice can prove to Charlie that Bob signed
    A->>C: Here's the document, signature, and Bob's public key
    C->>C: Verify(hash(document), signature, public_key)<br/>→ VALID
    Note over C: Charlie independently confirms<br/>Bob signed this document.<br/>Neither Alice nor Charlie need Bob's private key.

How Digital Signatures Work

The signing process doesn't encrypt the entire message (that would be slow for large data). Instead, the message is hashed first, and the hash is then signed with the private key.

flowchart TD
    subgraph SIGN["Signing"]
        MSG["Message<br/>(any size)"] --> HASH_S["SHA-256"]
        HASH_S --> DIGEST["Hash<br/>(32 bytes, fixed)"]
        DIGEST --> SIGN_OP["Sign with<br/>PRIVATE key"]
        PRIVK["Private Key"] --> SIGN_OP
        SIGN_OP --> SIG["Signature<br/>(64 bytes for ECDSA P-256)"]
    end

    subgraph SEND["Transmit"]
        MSG2["Message"] --> NET["Network"]
        SIG2["Signature"] --> NET
    end

    subgraph VERIFY["Verification"]
        MSG3["Message"] --> HASH_V["SHA-256"]
        HASH_V --> DIGEST2["Hash"]
        DIGEST2 --> CMP{"Compare"}
        SIG3["Signature"] --> VERIFY_OP["Verify with<br/>PUBLIC key"]
        PUBK["Public Key"] --> VERIFY_OP
        VERIFY_OP --> RECOVERED["Recovered Hash"]
        RECOVERED --> CMP
        CMP -->|"Match"| VALID["VALID SIGNATURE"]
        CMP -->|"No match"| INVALID["INVALID — tampered<br/>or wrong signer"]
    end

    style VALID fill:#38a169,color:#fff
    style INVALID fill:#e53e3e,color:#fff
    style PRIVK fill:#e53e3e,color:#fff
    style PUBK fill:#38a169,color:#fff

# Generate an ECDSA key pair for signing (P-256 curve)
$ openssl genpkey -algorithm EC -pkeyopt ec_paramgen_curve:P-256 \
    -out signing_key.pem

$ openssl pkey -in signing_key.pem -pubout -out signing_key_pub.pem

# Sign a file
$ openssl dgst -sha256 -sign signing_key.pem \
    -out document.sig document.pdf

# Verify the signature
$ openssl dgst -sha256 -verify signing_key_pub.pem \
    -signature document.sig document.pdf
Verified OK

# Tamper with the file and verify again
$ echo "tampered" >> document.pdf
$ openssl dgst -sha256 -verify signing_key_pub.pem \
    -signature document.sig document.pdf
Verification Failure

# Generate an Ed25519 key pair (modern, preferred)
$ openssl genpkey -algorithm Ed25519 -out ed25519_key.pem
$ openssl pkey -in ed25519_key.pem -pubout -out ed25519_pub.pem

# Sign with Ed25519
$ openssl pkeyutl -sign -inkey ed25519_key.pem \
    -out document.ed25519sig -rawin -in document.pdf

# Verify with Ed25519
$ openssl pkeyutl -verify -pubin -inkey ed25519_pub.pem \
    -sigfile document.ed25519sig -rawin -in document.pdf
Signature Verified Successfully

RSA Signatures vs ECDSA vs EdDSA

Algorithm	Signature Size	Sign Speed	Verify Speed	Security Level
RSA-2048	256 bytes	~1,300/sec	~43,000/sec	112 bits
RSA-4096	512 bytes	~200/sec	~14,000/sec	128 bits
ECDSA P-256	64 bytes	~30,000/sec	~12,000/sec	128 bits
Ed25519	64 bytes	~50,000/sec	~20,000/sec	~128 bits

Ed25519 is the recommended choice for new implementations. It's faster than both RSA and ECDSA, produces compact 64-byte signatures, uses a safe curve (Curve25519), and is designed to be resistant to implementation errors. Notably, Ed25519 is deterministic — it doesn't need a random nonce during signing, which eliminates the catastrophic failure mode where nonce reuse leaks the private key (as happened with the PlayStation 3 ECDSA implementation).

Where Digital Signatures Are Used

Code signing:

# Verify a GPG signature on a software release
$ gpg --verify python-3.12.0.tar.xz.asc python-3.12.0.tar.xz
gpg: Signature made Mon Oct  2 12:34:56 2023
gpg: using RSA key 7169...
gpg: Good signature from "Python Release Manager"

# Verify a macOS app's code signature
$ codesign -vvv /Applications/Firefox.app

# Verify an APK (Android package) signature
$ apksigner verify --verbose app.apk

TLS certificates: The Certificate Authority (CA) digitally signs your TLS certificate. When a browser receives the certificate, it verifies the CA's signature using the CA's public key (pre-installed in the browser/OS trust store). This is how browsers know that a certificate for yoursite.com was legitimately issued.

Git signed commits and tags:

# Configure git to sign commits with GPG or SSH
$ git config --global commit.gpgsign true
$ git config --global user.signingkey ~/.ssh/id_ed25519.pub
$ git config --global gpg.format ssh

# Sign a commit (happens automatically with gpgsign=true)
$ git commit -S -m "Fix critical vulnerability"

# Verify a signed commit
$ git log --show-signature -1
commit a3f7b2c (HEAD -> main)
Good "git" signature for user@example.com with ED25519 key SHA256:...
Author: Developer <dev@example.com>
Date:   Mon Mar 10 15:30:00 2026 +0530

    Fix critical vulnerability

# GitHub shows a "Verified" badge on signed commits

In 2020, SolarWinds' build system was compromised. Attackers inserted malicious code into SolarWinds' Orion software, and the compromised version was digitally signed with SolarWinds' legitimate code signing certificate — because the attackers had access to the build pipeline. The signature was technically valid because it was genuinely signed by SolarWinds' key.

This illustrates a critical point: digital signatures prove WHO signed something, not WHAT the signer intended to sign. If an attacker compromises the signing process (the build server, the CI/CD pipeline, the developer's machine), the signatures are technically valid but the content is malicious. Protecting the signing key and the entire build pipeline is paramount.

The SolarWinds attack led to a paradigm shift in software supply chain security. SLSA (Supply-chain Levels for Software Artifacts) now defines four levels of build integrity, from basic source versioning (L1) to fully hermetic, reproducible builds with verified provenance (L4). Sigstore, a project by the Linux Foundation, provides free, ephemeral code signing certificates tied to developer identities, making it easier to sign artifacts without managing long-lived keys.

Comparing Hash, HMAC, and Digital Signatures

graph TD
    subgraph COMPARISON["Choose the Right Tool"]
        HASH["<b>Hash</b><br/>SHA-256<br/><br/>Integrity: YES<br/>Authentication: NO<br/>Non-repudiation: NO<br/><br/>Use: File checksums,<br/>data deduplication,<br/>git object IDs"]

        HMAC_BOX["<b>HMAC</b><br/>HMAC-SHA256<br/><br/>Integrity: YES<br/>Authentication: YES<br/>Non-repudiation: NO<br/><br/>Use: API auth (AWS SigV4),<br/>webhook verification,<br/>JWT (HS256), session cookies"]

        DIGSIG["<b>Digital Signature</b><br/>ECDSA / Ed25519<br/><br/>Integrity: YES<br/>Authentication: YES<br/>Non-repudiation: YES<br/><br/>Use: TLS certificates,<br/>code signing, signed commits,<br/>legal documents, JWT (RS256)"]
    end

    HASH -.->|"Add shared secret"| HMAC_BOX
    HMAC_BOX -.->|"Replace shared secret<br/>with key pair"| DIGSIG

    style HASH fill:#3182ce,color:#fff
    style HMAC_BOX fill:#805ad5,color:#fff
    style DIGSIG fill:#38a169,color:#fff

The key decision: Do you need just integrity (hash), integrity + authentication between two parties who share a secret (HMAC), or integrity + authentication + proof to third parties (digital signature)?

Password Hashing: A Special Case

Password hashing deserves separate treatment because it has completely different requirements from data integrity hashing. The process is straightforward — hash the password, store the hash, and when the user logs in, hash their attempt and compare. But the hash function choice is critical. SHA-256 is a terrible password hash.

The Problem: Speed Kills

SHA-256 is designed to be fast. A modern GPU (RTX 4090) can compute approximately 20 billion SHA-256 hashes per second. An attacker with a stolen password database can try every 8-character password (lowercase + digits, ~2.8 trillion combinations) in about 2.3 minutes.

# How fast is SHA-256?
$ openssl speed sha256
type             16 bytes     64 bytes    256 bytes   1024 bytes
sha256          115724.37k   274474.97k   508563.29k   618924.37k

# ~600 MB/s on a single CPU core
# A modern GPU does 100-200x more
# That's tens of billions of password-length strings per second

# Hashcat benchmarks (RTX 4090):
# SHA-256:     ~20,000 MH/s (20 billion hashes/second)
# bcrypt (cost 12): ~100 kH/s (100,000 hashes/second)
# Argon2id:    ~10 kH/s (10,000 hashes/second)
#
# That's a 200,000x to 2,000,000x slowdown

The Solution: Intentionally Slow, Memory-Hard Hash Functions

Password-specific hash functions are designed to be slow and memory-intensive, making brute-force attacks impractical even with specialized hardware.

Function	Year	Properties	Recommendation
bcrypt	1999	Adaptive cost factor (doubling work with each increment). Built-in salt. Battle-tested for 25+ years.	Good — widely available
scrypt	2009	Memory-hard (requires large amounts of RAM, defeating GPU/ASIC attacks). Configurable CPU and memory cost.	Good — but tricky to tune
Argon2id	2015	Winner of the Password Hashing Competition. Memory-hard, parallelism-aware. Hybrid: data-dependent and data-independent memory access.	BEST — use for all new applications

The key concept is work factor. A password hash function should take about 100-500 milliseconds to compute on the server. That's imperceptible to a user logging in, but devastating to an attacker. At 250ms per hash on the server, and assuming the attacker has hardware that's 1000x faster, they'd still be limited to ~4000 guesses per second per GPU. At that rate, cracking a random 10-character password would take centuries.

Salting: Why It Matters

A salt is a random value unique to each password, stored alongside the hash.

# Without salt: two users with same password → same hash
# Attacker cracks one, gets both
$ echo -n "password123" | openssl dgst -sha256
# Same hash every time

# With salt (bcrypt example):
# Each user gets a unique random salt
# Even identical passwords produce different hashes:
# alice: $2b$12$LJ3m4y/VQN4tR2xBKM5DPeZqYvhT8nW1T6Qy2P/mR7K.abc123
# bob:   $2b$12$9k2Pf8X.YCN3dR5aBHM8Ou1d3Q2wZ4v5T6Uy8P/aR9L.def456
# Same password "password123", completely different hashes
# Rainbow tables are useless — attacker must brute-force each individually

Common password hashing mistakes (all of these appear in production audits):

1. **Using SHA-256/SHA-512 for passwords** — Too fast. GPU crackable in minutes for common passwords.
2. **Using MD5 for passwords** — Too fast AND broken. Please stop.
3. **Using a single global salt** — If the global salt leaks, all passwords are vulnerable. Use per-user random salts.
4. **Not increasing the work factor over time** — bcrypt cost 10 was appropriate in 2010. In 2026, use cost 12-14. Hardware gets faster; your work factor should increase.
5. **Storing passwords in plaintext** — Still happens. In 2019, Facebook disclosed that hundreds of millions of passwords were stored in plaintext in internal logs.
6. **Encrypting passwords instead of hashing** — Encryption is reversible. If the encryption key is compromised, all passwords are exposed. Hashing is one-way.
7. **Using pepper without proper implementation** — A pepper (server-side secret added to passwords before hashing) is good defense-in-depth, but must be stored in a HSM or KMS, not in the application config.

Practical Integrity Verification

Practice hash-based integrity verification in real scenarios:

**1. Verify a downloaded file:**
```bash
# Download a file and its checksum
curl -O https://example.com/release-v2.0.tar.gz
curl -O https://example.com/release-v2.0.tar.gz.sha256

# Verify — the checksum file contains "hash  filename"
sha256sum -c release-v2.0.tar.gz.sha256
release-v2.0.tar.gz: OK

# On macOS (no sha256sum):
shasum -a 256 -c release-v2.0.tar.gz.sha256

2. Create and verify signed git commits:

# Set up SSH signing (simpler than GPG)
git config --global gpg.format ssh
git config --global user.signingkey ~/.ssh/id_ed25519.pub
git config --global commit.gpgsign true

# Every commit is now signed
git commit -m "Signed commit"

# Verify signatures in log
git log --show-signature -5

# Set up allowed signers file for verification
echo "user@example.com $(cat ~/.ssh/id_ed25519.pub)" > ~/.ssh/allowed_signers
git config --global gpg.ssh.allowedSignersFile ~/.ssh/allowed_signers

3. Build a file integrity baseline (poor man's Tripwire):

# Create integrity manifest for critical files
for f in /etc/ssh/sshd_config /etc/passwd /etc/shadow /etc/hosts; do
    sha256sum "$f"
done > /root/integrity_baseline.txt

# Later, verify nothing changed
sha256sum -c /root/integrity_baseline.txt
/etc/ssh/sshd_config: OK
/etc/passwd: OK
/etc/shadow: FAILED  # <-- ALERT: This file was modified!
/etc/hosts: OK

# For production: use AIDE, OSSEC, or Tripwire
# They do this automatically with scheduling and alerting

4. Compute and verify webhook signatures:

# Simulate a GitHub webhook verification
SECRET="webhook_secret_123"
PAYLOAD='{"event":"push","ref":"refs/heads/main"}'

# Compute the signature (what GitHub would send)
SIGNATURE=$(echo -n "$PAYLOAD" | openssl dgst -sha256 -hmac "$SECRET" | cut -d' ' -f2)
echo "X-Hub-Signature-256: sha256=$SIGNATURE"

# Verify (what your server would do)
echo -n "$PAYLOAD" | openssl dgst -sha256 -hmac "$SECRET"
# Compare with the signature header


---

## The Bigger Picture: How These Pieces Fit Together

To connect everything covered in the last two chapters, here is how TLS uses all of these primitives together in a single connection.

```mermaid
flowchart TD
    subgraph HANDSHAKE["TLS Handshake"]
        CERT["Server Certificate<br/><b>Digital Signature</b> (Ch. 4)<br/>CA signs server's public key"]
        KEX["Key Exchange<br/><b>Asymmetric Crypto</b> (Ch. 3)<br/>ECDHE establishes shared secret"]
        VERIFY["Handshake Verification<br/><b>Hash</b> (Ch. 4)<br/>Hash of all handshake messages"]
        SERVER_SIG["Server Authenticates<br/><b>Digital Signature</b> (Ch. 4)<br/>Server signs DH parameters"]
    end

    subgraph DATA["Data Transfer"]
        ENCRYPT["Bulk Encryption<br/><b>Symmetric Crypto</b> (Ch. 3)<br/>AES-256-GCM encrypts all data"]
        INTEGRITY["Record Integrity<br/><b>AEAD / MAC</b> (Ch. 4)<br/>GCM auth tag on every record"]
    end

    CERT --> KEX
    KEX --> VERIFY
    SERVER_SIG --> VERIFY
    VERIFY --> ENCRYPT
    ENCRYPT --> INTEGRITY

    RESULT["Secure Connection<br/>All 5 primitives working together:<br/>Hash + HMAC + Signature + Symmetric + Asymmetric"]

    INTEGRITY --> RESULT

    style RESULT fill:#38a169,color:#fff
    style CERT fill:#805ad5,color:#fff
    style KEX fill:#3182ce,color:#fff
    style ENCRYPT fill:#dd6b20,color:#fff
    style INTEGRITY fill:#d69e2e,color:#fff

TLS is a composition of all these building blocks. Understanding each one is essential before diving into TLS itself. But first, there's one more critical topic to cover: key exchange and perfect forward secrecy. How do two parties agree on a shared secret without ever sending the secret across the network?

What You've Learned

This chapter covered the cryptographic tools for data integrity and authentication:

Cryptographic hash functions (SHA-256) provide a fixed-size fingerprint of arbitrary data. They must be pre-image resistant (one-way), second pre-image resistant (can't find a substitute), and collision resistant (can't find any two colliding inputs). MD5 is broken (collisions in seconds, used in the Flame malware attack). SHA-1 is broken (SHAttered: $110K, chosen-prefix collision: $45K). SHA-256 is the current standard.
Git uses SHA hashes to create a Merkle tree of content-addressed objects. Changing any byte changes all hashes above it. Git is migrating from SHA-1 to SHA-256. This demonstrates the importance of cryptographic agility.
HMAC combines hashing with a shared secret key, providing both integrity and authentication. Used in AWS Signature V4, JWT (HS256), webhook verification, and TLS record integrity. Always use HMAC, never hash(key + message) — length extension attacks make the naive construction dangerous.
Digital signatures use asymmetric cryptography (private key signs, public key verifies) to provide integrity, authentication, and non-repudiation. Ed25519 is recommended for new implementations. The SolarWinds attack showed that signatures prove who signed, not what was intended — protect the signing process.
Password hashing requires intentionally slow, memory-hard functions (Argon2id recommended, bcrypt acceptable) with per-user salts. Standard hash functions like SHA-256 are billions of times too fast for password storage.
Length extension attacks make SHA-256(key + message) fundamentally broken. HMAC's double-hashing construction prevents this. SHA-3 is immune by design.

Next up is key exchange — how two parties agree on a shared secret key when communicating over a network that anyone can eavesdrop on. The answer involves Diffie-Hellman, and by the end of the next chapter, you'll understand why the "E" in ECDHE is the most important letter in modern cryptography.

Network Security: Applied Principles & Modern Defense