What is PDF OCR — Scanned PDF to Text?
PDF OCR reads text out of image-only PDFs — scans, photographed pages, faxed documents, or image-based exports — so you can copy, search and edit the words instead of only looking at them. Tesseract.js runs entirely in the browser and handles nine languages including English, Spanish, German, French, Polish and Portuguese. Lawyers pull clauses out of scanned contracts to quote them in a brief. Researchers unlock quotations from digitised books that libraries only publish as page images. Accountants copy numbers off scanned invoices and bank statements into a spreadsheet without retyping them. Immigration candidates convert scanned birth certificates into typeable text for a visa application. The PDF never leaves your device: pages are rendered locally to bitmaps and fed straight into the OCR engine. Recognition accuracy depends on the scan quality — a 300 DPI contrast-balanced page reaches 97–99% for Latin scripts, while blurry phone photos of crumpled paper sit closer to 85%.
When should I use this tool?
- Extract text from scanned legal contracts, affidavits and court filings for copy-paste into a brief.
- Digitise printed book pages or archive scans so you can search them, highlight quotations and cite passages.
- Recover invoice numbers, dates and totals from scanned PDFs that came in without a text layer.
- Unlock older academic papers or government forms distributed as image-only PDFs for analysis.
How do I run OCR on a PDF online?
- 1Drop a scanned PDF into the upload area or click to browse for the file.
- 2Pick the language that matches the document from the language dropdown.
- 3Click Extract text — the browser loads the OCR engine and begins recognition.
- 4Watch the progress bar as each page is rendered and the text is read.
- 5Copy the recognised text to the clipboard or download it as a plain .txt file.
Frequently asked questions
Does the PDF get uploaded to a server for OCR?
No — the complete OCR pipeline runs inside your browser tab without transmitting a single byte to any remote server. The tool uses two WebAssembly libraries that execute locally: MuPDF, an open-source PDF engine compiled to WebAssembly, rasterises each page of the PDF into a bitmap image entirely within the browser's sandboxed memory. Tesseract.js, a WebAssembly port of the widely used Tesseract OCR engine, then receives those bitmap images and performs character recognition against its trained language model, also in the same browser sandbox. The recognised text is written into the browser's DOM and offered as a plain-text download — all without leaving your device. You can verify this concretely by opening your browser's DevTools Network panel before dropping the PDF: during the entire OCR job, the only outbound requests you will observe are the one-time downloads of the Tesseract engine (~3 MB), the MuPDF library, and the language training data file for whichever language you selected. After those assets are fetched and cached in the browser's storage, the Network tab shows complete silence for the remainder of the job, including all page rendering, all OCR recognition, and the final text output. This architecture matters for sensitive document types. Scanned contracts awaiting signature, medical examination reports, passports and identity documents, immigration paperwork, and internal corporate filings all contain information that most organisations and individuals have strong reasons not to transmit to third-party servers. Because every computation happens inside your browser, WikiPlus receives no copy of your document, no copy of the recognised text, and no metadata about the file you processed.
Which languages does the OCR support and how accurate is it?
The tool ships with nine languages: English, Spanish, German, French, Italian, Dutch, Polish, Portuguese, and Russian. These cover the major Latin-script and Cyrillic-script languages used in business, legal, and academic contexts across Europe and the Americas. Each language uses Tesseract's LSTM-based neural network model trained on large corpora of printed text in that language, which provides substantially better accuracy on modern typefaces than older pattern-matching approaches. Accuracy depends primarily on scan quality, font clarity, and page layout. A clean 300 DPI black-and-white scan of a standard book page or typed letter in any of the supported languages typically achieves 97 to 99 percent character accuracy on Latin scripts under good conditions. Reducing scan resolution below 200 DPI, using heavily compressed JPEG scans with visible artefacts, capturing documents at an angle with a phone camera, or processing pages with low contrast between ink and paper all reduce accuracy into the 85 to 93 percent range. Handwritten text is not reliably recognised; Tesseract's models are trained exclusively on printed typefaces and perform poorly on cursive or informal handwriting regardless of language. Multi-column layouts are processed column by column, with the tool concatenating columns into a single text stream in reading order. Tables survive OCR as space-aligned plain text but lose their formal row-and-column structure. If your document is in a language not listed, the English model provides partial coverage for European languages that share cognates and proper nouns with English, though accuracy for language-specific characters will be reduced. For Arabic, Chinese, Japanese, Korean, and other non-supported scripts, a dedicated offline tool or commercial OCR service with appropriate training data is necessary.
How long does OCR take and does page count matter?
Processing time has two distinct phases: a one-time initialisation cost and a per-page recognition cost. The initialisation phase happens only on your first OCR job in a given language. The browser must download the Tesseract WebAssembly engine (approximately 3 MB), the MuPDF rasteriser, and the language training data file for your selected language (10 to 50 MB depending on the language — English is around 10 MB, German around 40 MB). On a typical home broadband connection this download takes 10 to 30 seconds. Once downloaded, the browser caches these files in IndexedDB storage so every subsequent OCR job in the same language begins in under one second with no network request. Per-page recognition runs at approximately 2 to 5 seconds per A4-sized page on a modern laptop, with timing varying based on the density of text, the font complexity, and the browser's available CPU resources. A 10-page document therefore takes roughly 30 to 60 seconds end-to-end including the first-run warm-up. A 100-page scan runs for approximately 4 to 8 minutes. Page count scales roughly linearly because each page is processed sequentially through the same pipeline. Mobile devices run WebAssembly at a factor of 2 to 3 times slower than modern laptop CPUs, so a 20-page document that takes one minute on a laptop may take 2 to 3 minutes on a mid-range smartphone. Critically, the browser tab remains responsive during OCR because the recognition runs in a Web Worker thread that is isolated from the main UI thread; you can switch tabs, scroll the page, and continue other browser tasks without disrupting the recognition job.
What do I do if the recognised text is garbled or wrong?
Garbled OCR output almost always traces to one of three root causes, each with a clear remedy. The first and most common cause is a language mismatch: Tesseract's neural network models are optimised for the phoneme and character frequency patterns of specific languages, and running a German document through the English model, for example, produces confident but incorrect substitutions for umlauted characters such as ä, ö, and ü. Check the language selector, choose the language that matches the document's script and language, and re-run. The fix takes seconds. The second cause is poor scan quality. Tesseract achieves its highest accuracy on scans captured at 300 DPI or above in black-and-white or greyscale mode with good contrast. JPEG compression artefacts, skewed page angles above roughly 10 degrees, shadows from book bindings, bleed-through from reverse-side printing, and heavy background textures all degrade character recognition. Re-scanning at 300 DPI with a flatbed scanner rather than a phone camera typically restores accuracy to the 97-plus percent range. The third cause is that the PDF already contains an embedded digital text layer and does not actually require OCR. Some PDFs appear to be scans — perhaps because they were exported from image editing software — but already have a searchable text layer beneath the visual image. In those cases the OCR engine is doing redundant and less accurate work compared to simply extracting the existing layer. Use the PDF to Text tool first; if it returns readable content, you have a digital text PDF and OCR is unnecessary. For historical documents with unusual typefaces, heavy ornamentation, or degraded inks, specialist services with historical-text training data such as Transkribus produce better results than a general-purpose OCR engine.
Content on this page is available under CC BY 4.0.