WikiPlus

How to Extract Text From a Scanned PDF

A scanned PDF stores each page as a pixel image, not as text data. You can see the words on screen, but there is nothing for software to grab when you try to copy or search. Extracting usable text from these documents requires OCR — optical character recognition. This guide walks you through the fastest and most private way to do that: using a free, browser-based OCR tool powered by Tesseract.js that requires no software installation, no account, and no cloud upload. By the end you will have clean, copyable text from any scanned PDF in minutes.

Why You Cannot Copy Text From a Scanned PDF Directly

PDF is a flexible container format. It can hold vector text (characters stored as glyph codes with font information — fully selectable and searchable), raster images (photographs or scans stored as pixel grids — not selectable), or a combination of both. When a document is scanned from paper or photographed, the resulting PDF contains only the raster image of each page. When you try to select text in a scanned PDF, your PDF viewer has nothing to select. It sees a picture of text, not actual text data. The Ctrl+C shortcut, the selection cursor, and the search function all operate on the text layer — if there is none, they have nothing to work with. This is why scanned PDFs behave so differently from 'normal' PDFs. A document exported from Word or created by a modern printer driver is text-based — you can select it, search it, and copy it. A scan is image-based — you cannot do any of those things without first running OCR. The distinction is visible in your PDF viewer. In Adobe Acrobat, go to Edit > Find (Ctrl+F) and search for a word you can see in the document. In a text-based PDF, it will be found immediately. In a scanned PDF with no text layer, Acrobat will report no results. This test quickly tells you whether a PDF needs OCR. Another tell: try clicking anywhere on a scanned page. In a text-based PDF, clicking on a word positions your cursor within the text. In a scanned PDF, clicking has no effect — the entire page is treated as a single image object. Once you have confirmed the PDF is image-based, OCR is the solution.

Step-by-Step: Extract Text Using the Browser OCR Tool

Here is the complete process from scanned PDF to extracted text using the free browser-based tool. Step 1 — Open the tool. Navigate to the PDF OCR tool in your browser. No extensions, plugins, or accounts are needed. The tool works in Chrome, Firefox, Edge, and Safari on both desktop and mobile devices. Step 2 — Load your PDF. Click the upload button or drag your scanned PDF directly onto the tool's drop zone. Files up to several hundred pages are supported. The file loads into your browser's local memory — it does not get sent to any server. Step 3 — Pick your language. The language selector defaults to English. If your document is in another language, choose it from the dropdown. The correct language model improves both character recognition and word assembly. For multilingual documents, select the primary language. Step 4 — Start processing. Click the Extract Text (or Process) button. A progress indicator appears as Tesseract works through each page. Processing speed depends on your device's CPU speed and the number of pages. Expect roughly 2–5 seconds per page on a modern laptop. Step 5 — Review the output. The recognized text appears in the output area, organized by page. Look over the text for any obvious recognition errors, especially in numbers, proper names, and any technical or domain-specific vocabulary. Step 6 — Copy or download. Use the Copy button to copy all extracted text to your clipboard, or use the Download button to save it as a plain .txt file. From there, paste it into Word, Google Docs, Notepad, or any other application where you need to use the text. The entire process typically takes under two minutes for a standard 5–10 page document on a modern computer.

Common Scanned PDF Problems and How to Handle Them

Real-world scanned PDFs vary enormously in quality. Here is how to handle the most common problems. Skewed pages: If the document was not placed perfectly straight on the scanner, pages may be slightly rotated. Tesseract has some built-in deskewing capability and can recognize text that is rotated a few degrees. For heavily skewed pages (more than 10 degrees), accuracy drops significantly. If you have the ability to re-scan, keep the document straight. Otherwise, try an image editor to straighten the page before processing. Low resolution: Scans below 200 DPI produce blurry character images that are difficult for OCR to distinguish. The standard recommendation is 300 DPI minimum for OCR. If you have the original documents, re-scan at higher resolution. If not, some image upscaling tools (including AI-based upscalers) can improve the apparent resolution, which may help OCR accuracy. Double-page spreads: Books and magazines scanned as two-page spreads (one scan covering two facing pages) will have OCR run on the entire spread as a single page. The text extraction will be in reading order but the line structure may be confused at the page gutter. Try cropping the spread into individual pages before OCR. Mixed text and images: Documents with images interspersed with text (newsletters, reports with charts) will have OCR run on the entire page. Tesseract identifies text regions automatically and skips areas it identifies as non-text (photographs, diagrams, logos). The resulting text extraction will include the text content without the visual elements. Tables: Tabular data extracted by OCR will appear as unstructured text — the column alignment that was visual in the scan is lost in plain text output. For structured table extraction, the text output can be manually reformatted, or a dedicated table extraction tool may give better results.

What to Do With Extracted Text: Practical Use Cases

Once you have extracted text from a scanned PDF, the range of things you can do with it is broad. Here are the most common and useful workflows. Searching and archiving: Paste the extracted text into your document management system or notes application alongside the original scan. Future keyword searches will find the document even though the original scan has no embedded text layer. Editing and repurposing: Paste into Microsoft Word or Google Docs to create an editable version of the document's content. You will need to clean up formatting — the plain text output has no bolding, headings, or column structure — but the character content is there and correct, saving significant retyping time. Data extraction from forms and invoices: For documents like invoices, receipts, or forms, the OCR output gives you the raw text values that can then be parsed. Copy the relevant fields (invoice number, date, total amount) into a spreadsheet for record-keeping. Tools like regular expressions or simple text parsing scripts can automate this for large volumes. Accessibility: Converting scanned meeting notes, historical records, or institutional documents into text makes them accessible to screen readers, translatable via translation services, and indexable by content management systems. Translation: Paste the extracted text into a translation service (Google Translate, DeepL) to get a quick translation of a foreign-language scanned document. This workflow — scan OCR then translate — is extremely useful for working with documents in languages you do not read. Content indexing for websites: If you publish PDFs on a public website, running OCR on scanned PDFs and embedding the resulting text as page metadata or HTML content improves search engine indexing of that content.

Frequently Asked Questions

How do I know if my PDF is scanned (image-based) or text-based?
Try to click on and select some text in your PDF viewer. If a text cursor appears and you can highlight words, the PDF has a text layer and does not need OCR. If clicking selects the entire page as an image block, or if you cannot select any text at all, the PDF is image-based and needs OCR. Another test: press Ctrl+F (or Cmd+F) and search for a word visible on the page. If no results are found despite the word being visible, the PDF is scanned.
Can I extract text from only specific pages of a large scanned PDF?
The current browser tool processes the entire PDF. For large documents where you only need text from specific pages, use a PDF page splitter to extract just the pages you need into a smaller PDF file, then run OCR on that smaller file. This is faster and produces a cleaner output focused only on the content you need.
Will the extracted text preserve the original formatting and layout?
The output is plain text — it preserves reading order (words and sentences are in the correct sequence) but does not preserve visual formatting such as bold or italic text, column layouts, table structures, font sizes, or indentation. For a formatted output that preserves document layout, you would need a more advanced OCR tool with layout reconstruction capability, such as Adobe Acrobat's OCR feature or ABBYY FineReader.