PDF → Text

extract raw text · one .txt per source file

Drop one or more PDFs and pull the text content into a plain-text file per input. Page boundaries are marked with a customisable separator.

Drop PDF files here or click to select

Multiple files allowed

Page separator (use {n} for page number; \n for newline)

no files

Ready.

When to use this tool

Pull text out of a PDF for editing, searching, translating, pasting into a chat, or feeding to another tool. The output is plain UTF-8 text — no formatting, no layout. Great for "I just need the words" tasks.

Step by step

Drop the PDFs. Each input becomes one .txt file in the output.
Customise the page separator if you need a specific divider between pages. Default is --- Page {n} --- on its own line. Use {n} for the page number.
Click "Extract & download". Each result has a thumbnail-free download row; click to save the .txt.

Common use cases

Translation. Paste extracted text into Google Translate or DeepL; PDFs are awkward to translate directly.
Searching. grep / ripgrep across a folder of .txt files is much faster than scanning PDFs.
Quoting. Pull a paragraph out of a research paper to drop into your notes.
LLM context. Most AI chatbots accept text better than PDFs — extract first, paste in.
Word-counts and analysis. Run linguistic tools (NLTK, spaCy) on the extracted text.
Plain-text archive. Convert a folder of PDFs to text for long-term, format-independent storage.

Common mistakes

Scanned PDFs return empty. If the file is an image of text (a scan), there's no text layer — extraction returns nothing. Use an OCR tool first (ocrmypdf, Tesseract).
Expecting layout fidelity. Columns get flattened, tables come out as runs of words. For structured output try PDF → Markdown; for tabular data, a dedicated table-extraction tool.
Encoding issues. PDFs sometimes use unusual encodings; the result may have funny characters in places. Check the output and substitute as needed.

FAQ

Why is my PDF returning no text?

Most likely it's a scan. Open it in a viewer — if you can't select / copy any text by hand, this tool can't extract any either. Run OCR first.

Is OCR coming?

Tesseract.js is on the wishlist. It's heavy (10+ MB WASM) so the implementation needs to load it on demand and only when you ask. If you have a strong opinion, drop me a note via contact.

Can I get the text per-page in separate files?

Not yet — the output is one .txt per source file with page separators. To split per page, run the .txt through any split-on-marker script.

Does it preserve hyphenation and line breaks?

Line breaks within a paragraph become spaces (so words don't run together). Page-final hyphens are not joined automatically — you may see a trailing dash on some words.