WikiPlus

FAQ: PDF Text Extraction Answered

PDF text extraction seems like it should be simple — get the text that is already in the file — but it generates a surprising number of questions. Why does extraction sometimes produce empty output? Why does the text appear in the wrong order? Why do some characters come out as boxes or question marks? Why does a PDF that looks perfectly readable produce no extractable text? This FAQ answers the most common questions about PDF text extraction, organized by the type of problem you are experiencing.

Questions About Why Extraction Fails or Produces No Output

Why does the tool produce empty output even though I can see text in the PDF? Your PDF is most likely a scanned-image PDF. The pages are stored as raster images (photographs of the text), not as encoded character data. There is no text to extract — only pixel data. To get the text, you need OCR (Optical Character Recognition) software to analyze the images and infer the characters. See the PDF to Text vs OCR guide for tool recommendations. Why does extraction produce output on some pages but nothing on others? This is a mixed or hybrid PDF. Some pages were created digitally (and contain text data) while others were scanned or contain embedded images. The extraction tool correctly retrieves text from the digital pages and produces nothing for the image-only pages. If text from the image pages is important, apply OCR specifically to those pages. The PDF says 'text could not be extracted' — what does that mean? Some PDFs use font encoding schemes that map PDF character codes to Unicode incorrectly or not at all. When the extraction engine encounters character codes it cannot map to Unicode, it cannot produce meaningful text output. This occurs more often in older PDFs or PDFs generated by specialized or legacy software with non-standard character encoding. In these cases, OCR of the visual page may actually produce better results than direct extraction. I can search the PDF with Ctrl+F but the extractor produces nothing. This is unusual. It may indicate the PDF uses a hidden text layer for search that does not correspond to the main content stream. Try a different extraction engine (Adobe Acrobat's Export to Text feature) to compare. It may also indicate the PDF uses the accessibility ActualText property to store a different text representation than what appears visually — the viewer searches this alternative text, but some extractors do not read it.

Questions About Text Quality and Order

Why is the extracted text out of order or mixing content from different columns? PDF content streams store text in the order it was written during PDF creation, not necessarily in visual reading order. For multi-column layouts, text from both columns may be interleaved in the stream. Enable reading-order reconstruction in the tool if available — this sorts text blocks by their page coordinates (top-to-bottom, left-to-right) to approximate visual reading order. For complex layouts, the result may still require manual reordering of some sections. Why does each line of the original PDF appear as a separate paragraph in the output? The PDF stores explicit line break characters at each visual line ending. The extraction preserves these line breaks, which appear as paragraph breaks in the output. Use a text editor's find-and-replace with the regex pattern `\n(?!\n)` (single newline not followed by another newline) replaced with a space to join lines within paragraphs while preserving paragraph breaks (double newlines). Some words appear with a hyphen in the middle — why? These are end-of-line hyphens from the original PDF's typesetting. The text was hyphenated at a line break for visual formatting. The extraction preserves the hyphen as it appears in the text stream. Use a regex `-(\r?\n)` replaced with empty string to rejoin hyphenated words, though test carefully to avoid removing intentional hyphens in compound words. The numbers in the extracted text look wrong or are missing decimal points. Some PDF creators encode numbers using specialized fonts where the standard character code mapping does not apply correctly. This is more common in PDFs generated by financial systems, CAD software, or specialized databases. Try extracting with a different tool to see if results differ. Adobe Acrobat's extraction handles some of these edge cases better than other engines.

Questions About Specific PDF Types

Does the tool work on password-protected PDFs? It depends on the type of protection. Owner-password PDFs (which restrict copying and editing in viewers) can be processed — the tool reads content streams directly, which accessibility tools are permitted to do. User-password PDFs (which encrypt the content and require a password to open) cannot be processed without the password. Enter the password when prompted. Does the tool extract text from PDF forms with filled-in data? Yes. Filled form field values are stored in the PDF's form data structure. The extractor reads both the fixed text content (form labels, instructions) and the interactive form field values (filled-in text, selected options). All text visible in the form — both static and data-filled — is extracted. Can I extract text from a PDF that is mostly charts and graphs? Text in chart labels, axis labels, legend text, and titles extracts correctly — these are text elements in the PDF. The actual chart graphics (bars, lines, pie slices) are image or vector data that cannot be extracted as text. For a report that is 90 percent charts, the extracted text file will contain chart titles and labels but not the data values represented by the chart graphics unless those values also appear as text annotations. Does the tool handle right-to-left languages like Arabic and Hebrew? Yes. The MuPDF engine correctly handles RTL text direction and bidirectional text. Arabic and Hebrew characters extract with correct Unicode encoding and text direction. The plain text output file stores RTL text as Unicode sequences — text editors and downstream tools that support RTL rendering will display it correctly; tools that only render LTR may show the characters in visual reverse order.

Questions About Using Extracted Text

What is the best text editor for working with extracted PDF text? For files under 10 MB (most extracted text files), any editor works. For large files (10 MB to 1 GB, from very long documents), use a code editor: VS Code, Sublime Text, Notepad++ (Windows), or BBEdit (macOS). These handle large files efficiently and provide regex find-and-replace, which is essential for post-extraction cleanup. Avoid Word or Google Docs for initial cleanup — they add formatting overhead and can be slow with large plain text files. How do I import extracted PDF text into Microsoft Word? Open Word, go to File > Open, and select the .txt file. Word prompts for encoding — choose UTF-8 if prompted. The text imports as unformatted plain text. Apply your desired formatting (heading styles, paragraph styles) to organize the document. Alternatively, copy the text from a text editor and paste it into a new Word document. Can I extract text from a PDF and put it back into a different PDF with different formatting? Yes, but this involves multiple steps: extract text from the original PDF, edit the text in a word processor, reformat with the desired layout, and export as a new PDF. The extraction step is just the first part of this workflow — the resulting .txt gives you the content, which you then reuse in your new document. Is extracted PDF text appropriate for feeding to large language models (LLMs) like GPT or Claude? Yes — plain text is the optimal input format for LLMs. Extracted PDF text, with standard cleanup (boilerplate removal, line-break normalization) applied, is one of the most common input methods for AI document analysis. Most LLM interfaces accept pasted text directly. For programmatic LLM APIs, the .txt file can be read and passed as the prompt or document content.

Frequently Asked Questions

My extracted text has lots of symbols and boxes where letters should be — how do I fix this?
This indicates a character encoding issue in the original PDF. The PDF uses a font with a non-standard character mapping that does not correctly translate to Unicode. Possible fixes: try extracting with a different tool (Adobe Acrobat's Export to Text uses a different encoding resolution approach); check if the PDF is an older version that uses a legacy encoding; or if the PDF is a scanned document, try OCR instead of direct extraction. For PDFs you created, re-exporting from the source application with proper Unicode font embedding typically resolves encoding issues.
How long does it take to extract text from a 500-page PDF?
For a 500-page text-based PDF (no large embedded images), extraction typically completes in 10 to 30 seconds in a modern browser with WebAssembly. The time is dominated by parsing the PDF's internal structure and iterating through content streams. PDFs with many embedded images take longer because the parser must process image object headers even though images themselves are not extracted as text. For very large files where the browser tool is slow, command-line tools (pdftotext or mutool) on the same hardware are typically 5 to 10 times faster.
Can I extract text from a PDF that I received by email without downloading it first?
You need to download the PDF file to use the browser-based tool — it requires a local file as input. Download the attachment to your device, then drag it into the tool. If you are on a mobile device or a device where saving attachments is inconvenient, Google Drive's 'Open with Docs' feature can process a PDF directly from Gmail attachments without a separate download step.