WikiPlus

PDF OCR — skanowany PDF na tekst

Zamień zeskanowane lub obrazkowe PDF-y na tekst możliwy do wyszukiwania i kopiowania dzięki OCR w przeglądarce. Dziewięć języków, 100% po stronie klienta, bez wysyłania plików.

Przetwarzanie lokalne
Średnio 1.4s
4.8 z 5 — na podstawie 1,247 uzyc

Autor: Sergio Robles — Założyciel

Upuść tutaj swój zeskanowany PDF

lub kliknij, aby wybrać plik

PDF
Twoje pliki są przetwarzane lokalnie w przeglądarce. Nigdy nie przesyłamy ani nie przechowujemy Twoich danych.

Co to jest PDF OCR — skanowany PDF na tekst?

PDF OCR odczytuje tekst z obrazkowych PDF-ów — skanów, sfotografowanych stron, faksów lub eksportów opartych na obrazach — abyś mógł kopiować, wyszukiwać i edytować słowa zamiast tylko na nie patrzeć. Tesseract.js działa w całości w przeglądarce i obsługuje dziewięć języków, w tym angielski, hiszpański, niemiecki, francuski, polski i portugalski. Prawnicy wyciągają klauzule ze zeskanowanych umów, aby zacytować je w piśmie procesowym. Badacze odblokowują cytaty ze zdigitalizowanych książek, które biblioteki publikują tylko jako obrazy stron. Księgowi kopiują liczby ze zeskanowanych faktur i wyciągów bankowych do arkusza bez przepisywania ich ręcznie. Osoby starające się o wizę zamieniają zeskanowane akty urodzenia na tekst do wpisania we wniosku. PDF nigdy nie opuszcza Twojego urządzenia: strony są renderowane lokalnie do bitmap i trafiają bezpośrednio do silnika OCR. Trafność rozpoznawania zależy od jakości skanu — strona w 300 DPI z wyrównanym kontrastem osiąga 97–99% dla pisma łacińskiego, a rozmazane zdjęcia zmiętego papieru z telefonu oscylują bliżej 85%.

Kiedy powinienem użyć tego narzędzia?

  • Wyodrębnij tekst ze zeskanowanych umów prawnych, oświadczeń i pism sądowych do kopiowania w piśmie procesowym.
  • Zdigitalizuj drukowane strony książek lub skany archiwalne, aby móc je przeszukiwać, podświetlać cytaty i powoływać się na fragmenty.
  • Odzyskaj numery faktur, daty i kwoty z zeskanowanych PDF-ów, które dotarły bez warstwy tekstowej.
  • Odblokuj starsze artykuły naukowe lub formularze urzędowe dystrybuowane jako obrazkowe PDF-y do analizy.

Jak uruchomić OCR na PDF online?

  1. 1Upuść zeskanowany PDF w obszarze przesyłania lub kliknij, aby wybrać plik.
  2. 2Wybierz z listy język odpowiadający dokumentowi.
  3. 3Kliknij Wyodrębnij tekst — przeglądarka załaduje silnik OCR i rozpocznie rozpoznawanie.
  4. 4Obserwuj pasek postępu w miarę jak każda strona jest renderowana, a tekst odczytywany.
  5. 5Skopiuj rozpoznany tekst do schowka lub pobierz go jako zwykły plik .txt.

Często zadawane pytania

Czy PDF jest wysyłany na serwer w celu wykonania OCR?

No — the complete OCR pipeline runs inside your browser tab without transmitting a single byte to any remote server. The tool uses two WebAssembly libraries that execute locally: MuPDF, an open-source PDF engine compiled to WebAssembly, rasterises each page of the PDF into a bitmap image entirely within the browser's sandboxed memory. Tesseract.js, a WebAssembly port of the widely used Tesseract OCR engine, then receives those bitmap images and performs character recognition against its trained language model, also in the same browser sandbox. The recognised text is written into the browser's DOM and offered as a plain-text download — all without leaving your device. You can verify this concretely by opening your browser's DevTools Network panel before dropping the PDF: during the entire OCR job, the only outbound requests you will observe are the one-time downloads of the Tesseract engine (~3 MB), the MuPDF library, and the language training data file for whichever language you selected. After those assets are fetched and cached in the browser's storage, the Network tab shows complete silence for the remainder of the job, including all page rendering, all OCR recognition, and the final text output. This architecture matters for sensitive document types. Scanned contracts awaiting signature, medical examination reports, passports and identity documents, immigration paperwork, and internal corporate filings all contain information that most organisations and individuals have strong reasons not to transmit to third-party servers. Because every computation happens inside your browser, WikiPlus receives no copy of your document, no copy of the recognised text, and no metadata about the file you processed.

Jakie języki obsługuje OCR i jak jest trafny?

The tool ships with nine languages: English, Spanish, German, French, Italian, Dutch, Polish, Portuguese, and Russian. These cover the major Latin-script and Cyrillic-script languages used in business, legal, and academic contexts across Europe and the Americas. Each language uses Tesseract's LSTM-based neural network model trained on large corpora of printed text in that language, which provides substantially better accuracy on modern typefaces than older pattern-matching approaches. Accuracy depends primarily on scan quality, font clarity, and page layout. A clean 300 DPI black-and-white scan of a standard book page or typed letter in any of the supported languages typically achieves 97 to 99 percent character accuracy on Latin scripts under good conditions. Reducing scan resolution below 200 DPI, using heavily compressed JPEG scans with visible artefacts, capturing documents at an angle with a phone camera, or processing pages with low contrast between ink and paper all reduce accuracy into the 85 to 93 percent range. Handwritten text is not reliably recognised; Tesseract's models are trained exclusively on printed typefaces and perform poorly on cursive or informal handwriting regardless of language. Multi-column layouts are processed column by column, with the tool concatenating columns into a single text stream in reading order. Tables survive OCR as space-aligned plain text but lose their formal row-and-column structure. If your document is in a language not listed, the English model provides partial coverage for European languages that share cognates and proper nouns with English, though accuracy for language-specific characters will be reduced. For Arabic, Chinese, Japanese, Korean, and other non-supported scripts, a dedicated offline tool or commercial OCR service with appropriate training data is necessary.

Ile trwa OCR i czy liczba stron ma znaczenie?

Processing time has two distinct phases: a one-time initialisation cost and a per-page recognition cost. The initialisation phase happens only on your first OCR job in a given language. The browser must download the Tesseract WebAssembly engine (approximately 3 MB), the MuPDF rasteriser, and the language training data file for your selected language (10 to 50 MB depending on the language — English is around 10 MB, German around 40 MB). On a typical home broadband connection this download takes 10 to 30 seconds. Once downloaded, the browser caches these files in IndexedDB storage so every subsequent OCR job in the same language begins in under one second with no network request. Per-page recognition runs at approximately 2 to 5 seconds per A4-sized page on a modern laptop, with timing varying based on the density of text, the font complexity, and the browser's available CPU resources. A 10-page document therefore takes roughly 30 to 60 seconds end-to-end including the first-run warm-up. A 100-page scan runs for approximately 4 to 8 minutes. Page count scales roughly linearly because each page is processed sequentially through the same pipeline. Mobile devices run WebAssembly at a factor of 2 to 3 times slower than modern laptop CPUs, so a 20-page document that takes one minute on a laptop may take 2 to 3 minutes on a mid-range smartphone. Critically, the browser tab remains responsive during OCR because the recognition runs in a Web Worker thread that is isolated from the main UI thread; you can switch tabs, scroll the page, and continue other browser tasks without disrupting the recognition job.

Co zrobić, jeśli rozpoznany tekst jest zniekształcony lub błędny?

Garbled OCR output almost always traces to one of three root causes, each with a clear remedy. The first and most common cause is a language mismatch: Tesseract's neural network models are optimised for the phoneme and character frequency patterns of specific languages, and running a German document through the English model, for example, produces confident but incorrect substitutions for umlauted characters such as ä, ö, and ü. Check the language selector, choose the language that matches the document's script and language, and re-run. The fix takes seconds. The second cause is poor scan quality. Tesseract achieves its highest accuracy on scans captured at 300 DPI or above in black-and-white or greyscale mode with good contrast. JPEG compression artefacts, skewed page angles above roughly 10 degrees, shadows from book bindings, bleed-through from reverse-side printing, and heavy background textures all degrade character recognition. Re-scanning at 300 DPI with a flatbed scanner rather than a phone camera typically restores accuracy to the 97-plus percent range. The third cause is that the PDF already contains an embedded digital text layer and does not actually require OCR. Some PDFs appear to be scans — perhaps because they were exported from image editing software — but already have a searchable text layer beneath the visual image. In those cases the OCR engine is doing redundant and less accurate work compared to simply extracting the existing layer. Use the PDF to Text tool first; if it returns readable content, you have a digital text PDF and OCR is unnecessary. For historical documents with unusual typefaces, heavy ornamentation, or degraded inks, specialist services with historical-text training data such as Transkribus produce better results than a general-purpose OCR engine.

Tresc tej strony jest dostepna na licencji CC BY 4.0.