Your Free Online PDF Parser Is Leaking Your Contracts. Here's the Fix.
Every Free Online PDF Parser Is Doing the Same Suspicious Thing
Open any "free online PDF to text" tool — smallpdf, ilovepdf, pdftotext.com, the half-dozen others that own the front page of Google for that query — and walk through what they actually do:
- You drag a PDF onto their site.
- The browser uploads the full file to their backend.
- Their server runs
pdftotext(or similar), grabs the result. - You download the text.
That's three steps you never see. Your file is now on someone else's machine. Their access log records your IP, browser, and the exact timestamp. The connection metadata — file size, destination, timing — is observable by your ISP, your VPN provider, your employer's network, and anyone with a hop on the path between you and the server. None of this is theoretical. All of it is the normal, boring behaviour of every server-backed file converter.
For a public PDF — a research paper, a press release, an old book chapter — none of this matters. For anything containing private data, it absolutely does. Contracts. Payslips. Medical records. Internal memos. Bank statements. Unreleased work. Legal correspondence. Tax returns. That's the long tail of what people actually feed into "convert PDF to text" tools, and it's exactly the wrong dataset to be shipping off to a third party every time.
"But They Say They Delete It After an Hour"
Maybe they do. Maybe they don't. You have no way to verify it. And "delete after an hour" doesn't help with the four things that already happened before the delete:
- Your file sat on their disk long enough to be backed up to an off-site snapshot. Those snapshots are rarely scrubbed in the same window as the live file.
- Their CDN edge node cached the upload payload. Cache invalidation is hard. The bytes might live on a Cloudflare or Fastly edge for hours.
- Their error-reporting tool (Sentry, Bugsnag, Datadog) snapshotted any request that hit a parser error. Sample requests get stored for debugging — sometimes indefinitely.
- The conversion produced an output file. Even when the input is deleted, the output is often retained as a "convenience" link for the user. The output text of a contract is just as sensitive as the contract itself.
This isn't a smear on any specific service. It's how server-side conversion infrastructure works. The privacy promise is structurally weak — there's no mechanism the user can verify.
The Fix: Run the Parser in the Browser Instead
The trick is that PDF parsing doesn't need a server. It needs CPU and memory, both of which your browser has plenty of. For years the obstacle was that the best parsers were native binaries — pdftotext, pdfplumber, pdfminer.six, pdfium — written in C, C++, or Rust. You couldn't ship them to a browser.
WebAssembly changes that. Take a C/Rust parser, compile it to a .wasm binary, ship it as a static asset, and run it inside the user's tab. The bytes flow exactly as far as the user's RAM. No server. No log. No ISP visibility beyond the one-time WASM download.
Liteparse is one of these: a Rust-based PDF parser from the team behind LlamaIndex, compiled to WebAssembly and published as @llamaindex/liteparse-wasm on npm under Apache 2.0. It does spatial text extraction with bounding boxes, runs at usable speed for documents up to a few hundred pages, and weighs about 4 MB compressed — cacheable, so the cost is one-time per visit. We've wired it into a free FindUtils tool: the PDF Text Extractor.
What "Zero Upload" Actually Looks Like
Open the PDF Text Extractor and the FindUtils brand pattern is right there: 🆓 Free · ⚡ No install · 🔓 No sign-up. Drop a PDF. Three tabs populate — Text, JSON, Pages — without anything leaving your browser. To prove it:
- Open your browser's developer tools (Cmd+Option+I on Mac, F12 on Windows).
- Switch to the Network tab and clear it.
- Drop a PDF.
- Watch the Network tab. You see exactly one request: the WebAssembly module (around 4 MB, cached after the first parse). Your PDF bytes never appear in any outgoing request.
You can take it further: disconnect from the internet entirely, then drop a PDF. Parsing still works because the WASM module was already cached. That's the strongest possible proof — local code, local execution, local result. No marketing copy on the page can match a Network-tab check the user does themselves.
What You Get in Three Tabs
A single parse produces three outputs:
Text — layout-preserved plain text. Paragraphs and columns stay intact. This is "copy text out of a PDF" mode, the use case that drives 80% of online traffic for the keyword.
JSON — every text span on every page, with its bounding box and page index. This is the output you actually want if you're building anything downstream — RAG indexes, table extractors, citation renderers, structured-data scrapers. Server-based competitors usually skip this tier because their backend is wired for text-only.
Pages — when liteparse emits screenshots, per-page rendered PNGs. Useful for archive thumbnails or for verifying which pages of a long PDF you actually parsed.
Click Copy on any tab to push it to your clipboard. Click Download to save it as .txt, .json, or .png. All operations are local DOM events with no network involvement.
When You Should Still Use a Server-Based Tool
I'm not going to pretend the browser approach wins every category. Three situations where a server makes sense:
- Heavy OCR on scanned PDFs. Image-based PDFs need an OCR pass — extracting bytes from raster images, not a parse of embedded text. Running Tesseract or a hosted OCR API in a browser tab works but is much slower than a beefy server with GPU access.
- Files larger than 50 MB. Browser memory is generous but not infinite. The FindUtils extractor caps at 50 MB to keep parsing reliable on mid-range laptops and phones. Larger PDFs benefit from a server with more RAM. (You can also split locally and parse each piece.)
- Bulk batch processing. If you're parsing 10,000 PDFs as a one-off job, you want a server job runner with parallel workers, not 10,000 manual tab loads. For batch work, run liteparse server-side yourself — it's the same Apache 2.0 codebase.
For the manual, single-document, "I just need the text out of this thing" workflow, the browser parser is a strict improvement.
Side-by-Side Comparison
| Capability | smallpdf / ilovepdf / similar | FindUtils PDF Text Extractor |
|---|---|---|
| File uploaded to a server | Yes | Never |
| File visible to ISP | Yes, full payload | No — only the 4 MB WASM fetch |
| Tracker pixels on page | Common | None |
| Sign-up required | Optional / nagged | Never |
| Output formats | Text only | Text + structured JSON + PNG previews |
| File size cap (free) | 5–10 MB typical | 50 MB |
| Works offline | No | Yes, after first WASM load |
| Source of parser | Closed | Apache 2.0 — verifiable |
| Per-document price | Free with watermark / quota | Free, no quota, no watermark |
What This Means for Your Workflow
If you've been pasting contracts, medical records, or any private PDF into a free online tool to "just get the text out," stop. There's a better path that takes the same number of clicks, gives richer output, and proves the privacy guarantee with the Network tab.
Open the PDF Text Extractor, drop a PDF, and verify it yourself. Then bookmark the page — the parser is now a local utility, and the next parse starts instantly because the WebAssembly module is already cached.
Related Reading
- PDF Text Extractor Guide — the deeper how-to with bounding-box explanations and downstream workflow examples.
- JSON Formatter — pretty-print the JSON output before piping it into the next step.
- npm Supply Chain Attacks: How to Secure npm install With Docker Sandboxing — same privacy-first philosophy, applied to npm installs.
- Liteparse on GitHub — read the parser source yourself. Apache 2.0.
The web has always been good at "submit a form and get a result." It's just now finally getting good at "run the result locally and submit nothing." For PDFs, the FindUtils extractor is what that future looks like.