PDF → Text
Drop one or more PDFs and pull the text content into a plain-text file per input. Page boundaries are marked with a customisable separator.
Drop PDF files here or click to select
Multiple files allowed
When to use this tool
Pull text out of a PDF for editing, searching, translating, pasting into a chat, or feeding to another tool. The output is plain UTF-8 text — no formatting, no layout. Great for "I just need the words" tasks.
Step by step
- Drop the PDFs. Each input becomes one
.txtfile in the output. - Customise the page separator if you need a specific divider between pages. Default is
--- Page {n} ---on its own line. Use{n}for the page number. - Click "Extract & download". Each result has a thumbnail-free download row; click to save the .txt.
Common use cases
- Translation. Paste extracted text into Google Translate or DeepL; PDFs are awkward to translate directly.
- Searching. grep / ripgrep across a folder of .txt files is much faster than scanning PDFs.
- Quoting. Pull a paragraph out of a research paper to drop into your notes.
- LLM context. Most AI chatbots accept text better than PDFs — extract first, paste in.
- Word-counts and analysis. Run linguistic tools (NLTK, spaCy) on the extracted text.
- Plain-text archive. Convert a folder of PDFs to text for long-term, format-independent storage.
Common mistakes
- Scanned PDFs return empty. If the file is an image of text (a scan), there's no text layer — extraction returns nothing. Use an OCR tool first (
ocrmypdf, Tesseract). - Expecting layout fidelity. Columns get flattened, tables come out as runs of words. For structured output try PDF → Markdown; for tabular data, a dedicated table-extraction tool.
- Encoding issues. PDFs sometimes use unusual encodings; the result may have funny characters in places. Check the output and substitute as needed.
FAQ
Why is my PDF returning no text?
Most likely it's a scan. Open it in a viewer — if you can't select / copy any text by hand, this tool can't extract any either. Run OCR first.
Is OCR coming?
Tesseract.js is on the wishlist. It's heavy (10+ MB WASM) so the implementation needs to load it on demand and only when you ask. If you have a strong opinion, drop me a note via contact.
Can I get the text per-page in separate files?
Not yet — the output is one .txt per source file with page separators. To split per page, run the .txt through any split-on-marker script.
Does it preserve hyphenation and line breaks?
Line breaks within a paragraph become spaces (so words don't run together). Page-final hyphens are not joined automatically — you may see a trailing dash on some words.