Was ist PDF OCR — gescanntes PDF zu Text?
PDF OCR liest Text aus reinen Bild-PDFs heraus — Scans, abfotografierte Seiten, gefaxte Dokumente oder bildbasierte Exporte — sodass du die Wörter kopieren, durchsuchen und bearbeiten kannst, statt sie nur zu betrachten. Tesseract.js läuft vollständig im Browser und unterstützt neun Sprachen, darunter Englisch, Spanisch, Deutsch, Französisch, Polnisch und Portugiesisch. Anwälte ziehen Klauseln aus gescannten Verträgen, um sie in einem Schriftsatz zu zitieren. Forscher erschließen Zitate aus digitalisierten Büchern, die Bibliotheken nur als Seitenbilder veröffentlichen. Buchhalter übertragen Zahlen aus gescannten Rechnungen und Kontoauszügen in eine Tabelle, ohne sie abzutippen. Einwanderer konvertieren gescannte Geburtsurkunden in tippbaren Text für einen Visumantrag. Das PDF verlässt dein Gerät nie: Seiten werden lokal zu Bitmaps gerendert und direkt an die OCR-Engine übergeben. Die Erkennungsgenauigkeit hängt von der Scan-Qualität ab — eine kontrastausgeglichene Seite mit 300 DPI erreicht 97–99% für lateinische Schriften, während unscharfe Handyfotos zerknitterter Blätter eher bei 85% liegen.
Wann sollte ich dieses Werkzeug nutzen?
- Text aus gescannten Verträgen, eidesstattlichen Erklärungen und Gerichtsakten extrahieren, um ihn per Copy-Paste in einen Schriftsatz zu übernehmen.
- Gedruckte Buchseiten oder Archivscans digitalisieren, damit du sie durchsuchen, Zitate markieren und Passagen belegen kannst.
- Rechnungsnummern, Daten und Summen aus gescannten PDFs ohne Textebene wiederherstellen.
- Ältere wissenschaftliche Arbeiten oder Behördenformulare, die als reine Bild-PDFs verteilt werden, zur Auswertung erschließen.
Wie führe ich OCR auf einem PDF online aus?
- 1Lege ein gescanntes PDF in den Upload-Bereich oder klicke, um die Datei auszuwählen.
- 2Wähle die passende Sprache des Dokuments aus dem Dropdown.
- 3Klicke auf Text extrahieren — der Browser lädt die OCR-Engine und startet die Erkennung.
- 4Beobachte den Fortschrittsbalken, während jede Seite gerendert und der Text gelesen wird.
- 5Kopiere den erkannten Text in die Zwischenablage oder lade ihn als einfache .txt-Datei herunter.
Häufig gestellte Fragen
Wird das PDF für die OCR auf einen Server hochgeladen?
No — the complete OCR pipeline runs inside your browser tab without transmitting a single byte to any remote server. The tool uses two WebAssembly libraries that execute locally: MuPDF, an open-source PDF engine compiled to WebAssembly, rasterises each page of the PDF into a bitmap image entirely within the browser's sandboxed memory. Tesseract.js, a WebAssembly port of the widely used Tesseract OCR engine, then receives those bitmap images and performs character recognition against its trained language model, also in the same browser sandbox. The recognised text is written into the browser's DOM and offered as a plain-text download — all without leaving your device. You can verify this concretely by opening your browser's DevTools Network panel before dropping the PDF: during the entire OCR job, the only outbound requests you will observe are the one-time downloads of the Tesseract engine (~3 MB), the MuPDF library, and the language training data file for whichever language you selected. After those assets are fetched and cached in the browser's storage, the Network tab shows complete silence for the remainder of the job, including all page rendering, all OCR recognition, and the final text output. This architecture matters for sensitive document types. Scanned contracts awaiting signature, medical examination reports, passports and identity documents, immigration paperwork, and internal corporate filings all contain information that most organisations and individuals have strong reasons not to transmit to third-party servers. Because every computation happens inside your browser, WikiPlus receives no copy of your document, no copy of the recognised text, and no metadata about the file you processed.
Welche Sprachen unterstützt die OCR und wie genau ist sie?
The tool ships with nine languages: English, Spanish, German, French, Italian, Dutch, Polish, Portuguese, and Russian. These cover the major Latin-script and Cyrillic-script languages used in business, legal, and academic contexts across Europe and the Americas. Each language uses Tesseract's LSTM-based neural network model trained on large corpora of printed text in that language, which provides substantially better accuracy on modern typefaces than older pattern-matching approaches. Accuracy depends primarily on scan quality, font clarity, and page layout. A clean 300 DPI black-and-white scan of a standard book page or typed letter in any of the supported languages typically achieves 97 to 99 percent character accuracy on Latin scripts under good conditions. Reducing scan resolution below 200 DPI, using heavily compressed JPEG scans with visible artefacts, capturing documents at an angle with a phone camera, or processing pages with low contrast between ink and paper all reduce accuracy into the 85 to 93 percent range. Handwritten text is not reliably recognised; Tesseract's models are trained exclusively on printed typefaces and perform poorly on cursive or informal handwriting regardless of language. Multi-column layouts are processed column by column, with the tool concatenating columns into a single text stream in reading order. Tables survive OCR as space-aligned plain text but lose their formal row-and-column structure. If your document is in a language not listed, the English model provides partial coverage for European languages that share cognates and proper nouns with English, though accuracy for language-specific characters will be reduced. For Arabic, Chinese, Japanese, Korean, and other non-supported scripts, a dedicated offline tool or commercial OCR service with appropriate training data is necessary.
Wie lange dauert OCR und spielt die Seitenzahl eine Rolle?
Processing time has two distinct phases: a one-time initialisation cost and a per-page recognition cost. The initialisation phase happens only on your first OCR job in a given language. The browser must download the Tesseract WebAssembly engine (approximately 3 MB), the MuPDF rasteriser, and the language training data file for your selected language (10 to 50 MB depending on the language — English is around 10 MB, German around 40 MB). On a typical home broadband connection this download takes 10 to 30 seconds. Once downloaded, the browser caches these files in IndexedDB storage so every subsequent OCR job in the same language begins in under one second with no network request. Per-page recognition runs at approximately 2 to 5 seconds per A4-sized page on a modern laptop, with timing varying based on the density of text, the font complexity, and the browser's available CPU resources. A 10-page document therefore takes roughly 30 to 60 seconds end-to-end including the first-run warm-up. A 100-page scan runs for approximately 4 to 8 minutes. Page count scales roughly linearly because each page is processed sequentially through the same pipeline. Mobile devices run WebAssembly at a factor of 2 to 3 times slower than modern laptop CPUs, so a 20-page document that takes one minute on a laptop may take 2 to 3 minutes on a mid-range smartphone. Critically, the browser tab remains responsive during OCR because the recognition runs in a Web Worker thread that is isolated from the main UI thread; you can switch tabs, scroll the page, and continue other browser tasks without disrupting the recognition job.
Was mache ich, wenn der erkannte Text Müll oder falsch ist?
Garbled OCR output almost always traces to one of three root causes, each with a clear remedy. The first and most common cause is a language mismatch: Tesseract's neural network models are optimised for the phoneme and character frequency patterns of specific languages, and running a German document through the English model, for example, produces confident but incorrect substitutions for umlauted characters such as ä, ö, and ü. Check the language selector, choose the language that matches the document's script and language, and re-run. The fix takes seconds. The second cause is poor scan quality. Tesseract achieves its highest accuracy on scans captured at 300 DPI or above in black-and-white or greyscale mode with good contrast. JPEG compression artefacts, skewed page angles above roughly 10 degrees, shadows from book bindings, bleed-through from reverse-side printing, and heavy background textures all degrade character recognition. Re-scanning at 300 DPI with a flatbed scanner rather than a phone camera typically restores accuracy to the 97-plus percent range. The third cause is that the PDF already contains an embedded digital text layer and does not actually require OCR. Some PDFs appear to be scans — perhaps because they were exported from image editing software — but already have a searchable text layer beneath the visual image. In those cases the OCR engine is doing redundant and less accurate work compared to simply extracting the existing layer. Use the PDF to Text tool first; if it returns readable content, you have a digital text PDF and OCR is unnecessary. For historical documents with unusual typefaces, heavy ornamentation, or degraded inks, specialist services with historical-text training data such as Transkribus produce better results than a general-purpose OCR engine.
Der Inhalt dieser Seite ist unter CC BY 4.0 verfügbar.