file-formats
What Makes a File Valid? Magic Bytes & File Signatures Explained
How programs really know what a file is — not from its extension, but from the raw bytes at the start. A developer-friendly look at magic numbers, MIME sniffing, structural validity, and why renaming a .txt file to .pdf does not make a PDF.
Rename a .txt file to .pdf. Open it in a PDF reader. You’ll get an error — or, on a permissive reader, a blank page. The file didn’t change. The bytes inside are still Hello world. But something knows it’s not a real PDF.
That something is file signature detection. It has nothing to do with the extension, and everything to do with the first few bytes of the file. Once you understand how it works, a whole class of confusing behavior in tools, browsers, and file upload validators suddenly makes sense.
The extension is just a label
When you save a file as report.pdf, the .pdf part is metadata attached to the filename by the operating system. It’s a hint — to the OS, to the user, to associated apps — but it carries no authority over the actual file contents. Rename that same file to report.xyz and the bytes inside are identical.
This is why macOS and Windows hide extensions by default. It’s also why security tools don’t trust them: an attacker can rename a malicious executable invoice.pdf and your file manager will happily show it with a PDF icon.
The real type of a file lives inside the file itself.
Magic bytes: the file’s actual identity
Almost every binary format reserves the first few bytes of the file for a fixed signature — a magic number — that identifies the format unambiguously. Programs check these bytes before they do anything else with the file.
Here’s a table of the most common ones:
| Format | Hex signature | Human-readable |
|---|---|---|
| PNG | 89 50 4E 47 0D 0A 1A 0A | \x89PNG\r\n\x1a\n |
| JPEG | FF D8 FF | (binary prefix, ends FF D9) |
| GIF | 47 49 46 38 39 61 | GIF89a |
25 50 44 46 2D | %PDF- | |
| ZIP | 50 4B 03 04 | PK\x03\x04 |
| GZIP | 1F 8B 08 | (binary) |
| BMP | 42 4D | BM |
Notice that some are printable ASCII (GIF89a, %PDF-, BM) and others are binary sequences chosen specifically to be non-printable and collision-resistant. The PNG signature 89 50 4E 47 0D 0A 1A 0A was deliberately designed to detect common file transfer corruption: the \r\n sequence catches DOS line-ending conversions, the \x1a stops display on old CP/M terminals, and 89 (high-bit set) detects 7-bit stripping.
A magic byte check is simple: open the file, read the first N bytes, and compare. If they don’t match the expected sequence, reject the file immediately — no need to parse the rest.
How MIME sniffing works
Browsers take this a step further with MIME sniffing — attempting to determine a resource’s actual content type regardless of what the server declares in Content-Type. The MIME Sniffing Standard defines an explicit algorithm: the browser reads up to 1445 bytes from the response body and pattern-matches them against a table of known signatures and byte sequences.
The implications are significant. A server can send Content-Type: text/plain, but if the first bytes of the response look like a PNG, a browser may treat it as an image. This behavior was the root of several historical cross-site scripting vulnerabilities, which is why the X-Content-Type-Options: nosniff response header was introduced — it instructs the browser to trust the declared MIME type and skip sniffing entirely.
For file upload validation, MIME sniffing means that checking file.type in JavaScript (which comes from the OS extension mapping) is insufficient. Genuine server-side validation reads the raw bytes.
Why “rename a .txt to .pdf” doesn’t work
Let’s trace what actually happens when you open a fake PDF. A PDF reader like Preview or Adobe Acrobat doesn’t just find the %PDF- signature at byte 0 and stop — that’s the minimum bar for “maybe a PDF.” It then expects to find:
- A header line like
%PDF-1.7or%PDF-2.0 - A body of objects: dictionaries, streams, cross-reference tables
- A cross-reference table or xref stream mapping object numbers to byte offsets
- A trailer dictionary with a
/Rootkey pointing to the document catalog - The string
%%EOFat or near the very end of the file
A text file of Hello world has exactly zero of these structures. The reader either bails immediately at the missing %PDF- or, if it somehow got past that, hits a parse error within the first dozen bytes of the body.
Structural validity is the second layer of checking, after the magic bytes.
Trailers and required structure
Several formats require not just a valid header but a valid trailer or index structure at the end of the file:
PDF must end with %%EOF (or close to it — some implementations allow trailing whitespace). More importantly, the cross-reference table must be present and internally consistent. A PDF without a valid xref table is unrenderable, even if the header is correct.
ZIP is even more interesting. A ZIP archive is structured backwards: the Central Directory — the master index of all files, their names, sizes, compression methods, and offsets — lives at the end of the file, followed by the End of Central Directory record (50 4B 05 06). Individual file entries (50 4B 03 04, the “local file headers”) appear at the start, but extractors navigate the Central Directory first. This is why you can append files to a ZIP without rewriting the whole archive, and why a truncated ZIP (missing the end) is unreadable even if the local headers are intact.
GZIP requires a valid header block with flags and a CRC32 checksum, and ends with an ISIZE field. A missing or mismatched checksum means the decompressor will refuse the file.
PNG uses a chunk-based structure: each chunk has a 4-byte type code, length, data, and CRC32. The file must end with an IEND chunk. A PNG missing IEND is technically malformed, though many decoders are lenient about it.
The pattern is consistent: format validity = correct magic bytes + required internal structure + (for some formats) a valid trailer or index.
Real-world consequences
This matters in several practical places:
- File upload validators on the server must read actual bytes, not trust the client’s
Content-Typeor filename extension. - Security scanners detect disguised malware by checking magic bytes against the declared extension.
- CDNs and image processors use magic bytes to route files to the right processing pipeline.
- Browser resource loading can fail with a CORS or MIME-type error if the server sends the wrong headers, because the browser’s sniffing and the declared type disagree.
- Test harnesses that need real, parseable files — for integration tests, format converters, or load testing — can’t use zero-padded blobs. They need the real magic bytes and internal structure.
That last point is why the Sample File Generator builds files the right way. When you generate a 5 MB PDF, it produces a real PDF with a valid %PDF- header, a proper object tree, a cross-reference table, and %%EOF — not a 5 MB file of zeros with a .pdf extension. The same goes for ZIP, GZIP, JPEG, PNG, and the other supported formats. Those files pass signature checks, pass structural validators, and open correctly in the applications that read them. That’s the only kind of test file that’s worth having.
Quick reference: how to check magic bytes yourself
If you have a file and want to verify its actual type, open a terminal:
# macOS / Linux — read the first 4 bytes as hex
xxd -l 4 somefile.bin
# Or use the file command (reads and matches against a signatures database)
file somefile.bin
The file command on Unix systems maintains its own database of magic byte patterns (the magic database) and applies them in priority order, reporting the best match. It’s the command-line equivalent of MIME sniffing — and it completely ignores the extension.
Understanding what the first few bytes of a file say about its type is one of those pieces of knowledge that unlocks a cleaner mental model of how file formats, browsers, and validators actually work. The extension is just clothing. The magic bytes are the identity.