WikiPlus

OCR de PDF — PDF digitalizado para texto

Transforma PDFs digitalizados ou só com imagens em texto pesquisável e copiável através de OCR no navegador. Nove idiomas, 100% do lado do cliente, sem envio.

Processamento local
1.4s em média
4.8 de 5 — com base em 1,247 usos

Por Sergio Robles — Fundador

Larga aqui o teu PDF digitalizado

ou clica para escolher um ficheiro

PDF
Seus arquivos são processados localmente no seu navegador. Nunca enviamos ou armazenamos seus dados.

O que é OCR de PDF — PDF digitalizado para texto?

O OCR de PDF lê o texto a partir de PDFs só com imagens — digitalizações, páginas fotografadas, documentos enviados por fax ou exportações baseadas em imagem — para que possas copiar, pesquisar e editar as palavras em vez de apenas as olhares. O Tesseract.js corre totalmente no navegador e reconhece nove idiomas, incluindo inglês, espanhol, alemão, francês, polaco e português. Advogados extraem cláusulas de contratos digitalizados para as citarem numa peça processual. Investigadores desbloqueiam citações de livros digitalizados que as bibliotecas só publicam como imagens de página. Contabilistas copiam números de faturas e extratos bancários digitalizados para uma folha de cálculo sem os voltarem a escrever. Candidatos a imigração convertem certidões de nascimento digitalizadas em texto editável para um pedido de visto. O PDF nunca sai do teu dispositivo: as páginas são processadas localmente em mapas de bits e entregues diretamente ao motor de OCR. A precisão do reconhecimento depende da qualidade da digitalização — uma página a 300 DPI com contraste equilibrado atinge 97–99% em escrita latina, enquanto fotografias desfocadas de papel amarrotado ficam mais perto dos 85%.

Quando devo usar esta ferramenta?

  • Extrair texto de contratos legais digitalizados, declarações sob compromisso de honra e peças processuais para colar numa peça.
  • Digitalizar páginas de livros impressos ou digitalizações de arquivo para as poderes pesquisar, destacar citações e citar passagens.
  • Recuperar números de fatura, datas e totais de PDFs digitalizados que chegaram sem camada de texto.
  • Desbloquear artigos académicos mais antigos ou formulários oficiais distribuídos como PDFs só com imagens para análise.

Como faço OCR a um PDF online?

  1. 1Larga um PDF digitalizado na área de envio ou clica para procurar o ficheiro.
  2. 2Escolhe o idioma que corresponde ao documento no menu pendente de idiomas.
  3. 3Clica em Extrair texto — o navegador carrega o motor de OCR e inicia o reconhecimento.
  4. 4Acompanha a barra de progresso à medida que cada página é processada e o texto é lido.
  5. 5Copia o texto reconhecido para a área de transferência ou transfere-o como ficheiro .txt simples.

Perguntas frequentes

O PDF é enviado para um servidor para o OCR?

No — the complete OCR pipeline runs inside your browser tab without transmitting a single byte to any remote server. The tool uses two WebAssembly libraries that execute locally: MuPDF, an open-source PDF engine compiled to WebAssembly, rasterises each page of the PDF into a bitmap image entirely within the browser's sandboxed memory. Tesseract.js, a WebAssembly port of the widely used Tesseract OCR engine, then receives those bitmap images and performs character recognition against its trained language model, also in the same browser sandbox. The recognised text is written into the browser's DOM and offered as a plain-text download — all without leaving your device. You can verify this concretely by opening your browser's DevTools Network panel before dropping the PDF: during the entire OCR job, the only outbound requests you will observe are the one-time downloads of the Tesseract engine (~3 MB), the MuPDF library, and the language training data file for whichever language you selected. After those assets are fetched and cached in the browser's storage, the Network tab shows complete silence for the remainder of the job, including all page rendering, all OCR recognition, and the final text output. This architecture matters for sensitive document types. Scanned contracts awaiting signature, medical examination reports, passports and identity documents, immigration paperwork, and internal corporate filings all contain information that most organisations and individuals have strong reasons not to transmit to third-party servers. Because every computation happens inside your browser, WikiPlus receives no copy of your document, no copy of the recognised text, and no metadata about the file you processed.

Que idiomas é que o OCR suporta e qual é a precisão?

The tool ships with nine languages: English, Spanish, German, French, Italian, Dutch, Polish, Portuguese, and Russian. These cover the major Latin-script and Cyrillic-script languages used in business, legal, and academic contexts across Europe and the Americas. Each language uses Tesseract's LSTM-based neural network model trained on large corpora of printed text in that language, which provides substantially better accuracy on modern typefaces than older pattern-matching approaches. Accuracy depends primarily on scan quality, font clarity, and page layout. A clean 300 DPI black-and-white scan of a standard book page or typed letter in any of the supported languages typically achieves 97 to 99 percent character accuracy on Latin scripts under good conditions. Reducing scan resolution below 200 DPI, using heavily compressed JPEG scans with visible artefacts, capturing documents at an angle with a phone camera, or processing pages with low contrast between ink and paper all reduce accuracy into the 85 to 93 percent range. Handwritten text is not reliably recognised; Tesseract's models are trained exclusively on printed typefaces and perform poorly on cursive or informal handwriting regardless of language. Multi-column layouts are processed column by column, with the tool concatenating columns into a single text stream in reading order. Tables survive OCR as space-aligned plain text but lose their formal row-and-column structure. If your document is in a language not listed, the English model provides partial coverage for European languages that share cognates and proper nouns with English, though accuracy for language-specific characters will be reduced. For Arabic, Chinese, Japanese, Korean, and other non-supported scripts, a dedicated offline tool or commercial OCR service with appropriate training data is necessary.

Quanto tempo demora o OCR e o número de páginas é relevante?

Processing time has two distinct phases: a one-time initialisation cost and a per-page recognition cost. The initialisation phase happens only on your first OCR job in a given language. The browser must download the Tesseract WebAssembly engine (approximately 3 MB), the MuPDF rasteriser, and the language training data file for your selected language (10 to 50 MB depending on the language — English is around 10 MB, German around 40 MB). On a typical home broadband connection this download takes 10 to 30 seconds. Once downloaded, the browser caches these files in IndexedDB storage so every subsequent OCR job in the same language begins in under one second with no network request. Per-page recognition runs at approximately 2 to 5 seconds per A4-sized page on a modern laptop, with timing varying based on the density of text, the font complexity, and the browser's available CPU resources. A 10-page document therefore takes roughly 30 to 60 seconds end-to-end including the first-run warm-up. A 100-page scan runs for approximately 4 to 8 minutes. Page count scales roughly linearly because each page is processed sequentially through the same pipeline. Mobile devices run WebAssembly at a factor of 2 to 3 times slower than modern laptop CPUs, so a 20-page document that takes one minute on a laptop may take 2 to 3 minutes on a mid-range smartphone. Critically, the browser tab remains responsive during OCR because the recognition runs in a Web Worker thread that is isolated from the main UI thread; you can switch tabs, scroll the page, and continue other browser tasks without disrupting the recognition job.

O que faço se o texto reconhecido estiver confuso ou errado?

Garbled OCR output almost always traces to one of three root causes, each with a clear remedy. The first and most common cause is a language mismatch: Tesseract's neural network models are optimised for the phoneme and character frequency patterns of specific languages, and running a German document through the English model, for example, produces confident but incorrect substitutions for umlauted characters such as ä, ö, and ü. Check the language selector, choose the language that matches the document's script and language, and re-run. The fix takes seconds. The second cause is poor scan quality. Tesseract achieves its highest accuracy on scans captured at 300 DPI or above in black-and-white or greyscale mode with good contrast. JPEG compression artefacts, skewed page angles above roughly 10 degrees, shadows from book bindings, bleed-through from reverse-side printing, and heavy background textures all degrade character recognition. Re-scanning at 300 DPI with a flatbed scanner rather than a phone camera typically restores accuracy to the 97-plus percent range. The third cause is that the PDF already contains an embedded digital text layer and does not actually require OCR. Some PDFs appear to be scans — perhaps because they were exported from image editing software — but already have a searchable text layer beneath the visual image. In those cases the OCR engine is doing redundant and less accurate work compared to simply extracting the existing layer. Use the PDF to Text tool first; if it returns readable content, you have a digital text PDF and OCR is unnecessary. For historical documents with unusual typefaces, heavy ornamentation, or degraded inks, specialist services with historical-text training data such as Transkribus produce better results than a general-purpose OCR engine.

O conteudo desta pagina esta disponivel sob CC BY 4.0.