OCR de PDF — PDF escaneado a texto — Herramienta Online Gratis

Name: OCR de PDF — PDF escaneado a texto
Availability: InStock
Rating: 4.8 (892 reviews)
Author: Sergio Robles

¿Qué es OCR de PDF — PDF escaneado a texto?

El OCR de PDF lee el texto de PDFs basados en imagen — escaneos, fotos de páginas, documentos enviados por fax o exportaciones en forma de imagen — para que puedas copiar, buscar y editar las palabras en lugar de solo mirarlas. Tesseract.js se ejecuta por completo en el navegador y admite nueve idiomas, incluidos inglés, español, alemán, francés, polaco y portugués. Los abogados extraen cláusulas de contratos escaneados para citarlas en un escrito. Los investigadores desbloquean citas de libros digitalizados que las bibliotecas solo publican como imágenes de páginas. Los contables copian cifras de facturas y extractos bancarios escaneados a una hoja de cálculo sin volver a teclearlas. Los candidatos a inmigración convierten certificados de nacimiento escaneados en texto editable para una solicitud de visado. El PDF nunca sale de tu dispositivo: las páginas se renderizan localmente como mapas de bits y se envían directamente al motor de OCR. La precisión del reconocimiento depende de la calidad del escaneo: una página a 300 DPI con buen contraste alcanza el 97–99% en alfabetos latinos, mientras que las fotos borrosas de papel arrugado tomadas con el móvil rondan el 85%.

¿Cuándo debo usar esta herramienta?

Extraer texto de contratos legales, declaraciones juradas y escritos judiciales escaneados para copiarlos y pegarlos en un informe.
Digitalizar páginas impresas de libros o escaneos de archivo para poder buscarlos, resaltar citas y referenciar pasajes.
Recuperar números de factura, fechas y totales de PDFs escaneados que llegaron sin capa de texto.
Desbloquear artículos académicos antiguos o formularios oficiales distribuidos como PDF basados en imagen para analizarlos.

¿Cómo aplico OCR a un PDF en línea?

1Suelta un PDF escaneado en el área de subida o haz clic para buscar el archivo.
2Elige en el desplegable el idioma que coincida con el documento.
3Haz clic en Extraer texto — el navegador carga el motor de OCR y empieza el reconocimiento.
4Observa la barra de progreso mientras cada página se renderiza y se lee el texto.
5Copia el texto reconocido al portapapeles o descárgalo como archivo .txt en texto plano.

Preguntas frecuentes

¿Se sube el PDF a un servidor para el OCR?

No — the complete OCR pipeline runs inside your browser tab without transmitting a single byte to any remote server. The tool uses two WebAssembly libraries that execute locally: MuPDF, an open-source PDF engine compiled to WebAssembly, rasterises each page of the PDF into a bitmap image entirely within the browser's sandboxed memory. Tesseract.js, a WebAssembly port of the widely used Tesseract OCR engine, then receives those bitmap images and performs character recognition against its trained language model, also in the same browser sandbox. The recognised text is written into the browser's DOM and offered as a plain-text download — all without leaving your device. You can verify this concretely by opening your browser's DevTools Network panel before dropping the PDF: during the entire OCR job, the only outbound requests you will observe are the one-time downloads of the Tesseract engine (~3 MB), the MuPDF library, and the language training data file for whichever language you selected. After those assets are fetched and cached in the browser's storage, the Network tab shows complete silence for the remainder of the job, including all page rendering, all OCR recognition, and the final text output. This architecture matters for sensitive document types. Scanned contracts awaiting signature, medical examination reports, passports and identity documents, immigration paperwork, and internal corporate filings all contain information that most organisations and individuals have strong reasons not to transmit to third-party servers. Because every computation happens inside your browser, WikiPlus receives no copy of your document, no copy of the recognised text, and no metadata about the file you processed.

¿Qué idiomas admite el OCR y qué precisión tiene?

The tool ships with nine languages: English, Spanish, German, French, Italian, Dutch, Polish, Portuguese, and Russian. These cover the major Latin-script and Cyrillic-script languages used in business, legal, and academic contexts across Europe and the Americas. Each language uses Tesseract's LSTM-based neural network model trained on large corpora of printed text in that language, which provides substantially better accuracy on modern typefaces than older pattern-matching approaches. Accuracy depends primarily on scan quality, font clarity, and page layout. A clean 300 DPI black-and-white scan of a standard book page or typed letter in any of the supported languages typically achieves 97 to 99 percent character accuracy on Latin scripts under good conditions. Reducing scan resolution below 200 DPI, using heavily compressed JPEG scans with visible artefacts, capturing documents at an angle with a phone camera, or processing pages with low contrast between ink and paper all reduce accuracy into the 85 to 93 percent range. Handwritten text is not reliably recognised; Tesseract's models are trained exclusively on printed typefaces and perform poorly on cursive or informal handwriting regardless of language. Multi-column layouts are processed column by column, with the tool concatenating columns into a single text stream in reading order. Tables survive OCR as space-aligned plain text but lose their formal row-and-column structure. If your document is in a language not listed, the English model provides partial coverage for European languages that share cognates and proper nouns with English, though accuracy for language-specific characters will be reduced. For Arabic, Chinese, Japanese, Korean, and other non-supported scripts, a dedicated offline tool or commercial OCR service with appropriate training data is necessary.

¿Cuánto tarda el OCR e influye el número de páginas?

La primera página de una sesión nueva siempre es la más lenta porque el navegador tiene que descargar el motor Tesseract (~3 MB) y los datos de entrenamiento del idioma (10–50 MB según el idioma). Esa descarga única suele tardar entre 10 y 30 segundos en una conexión doméstica de banda ancha y después los datos quedan en la IndexedDB del navegador, por lo que cualquier OCR posterior en el mismo idioma arranca en menos de un segundo. El reconocimiento en sí funciona a unos 2–5 segundos por página en un portátil moderno para una página A4 estándar. Un PDF de 10 páginas se termina en unos 30–45 segundos contando el calentamiento. Un escaneo de 100 páginas puede llevar de 4 a 8 minutos. El tiempo escala de forma aproximadamente lineal con las páginas. Los dispositivos móviles son entre 2 y 3 veces más lentos, así que los documentos largos conviene procesarlos en portátil. El navegador no se congela mientras se ejecuta el OCR; la herramienta usa un Web Worker para que la página principal siga respondiendo y puedas cambiar de pestaña durante el proceso.

¿Qué hago si el texto reconocido sale con errores o ilegible?

Garbled OCR output almost always traces to one of three root causes, each with a clear remedy. The first and most common cause is a language mismatch: Tesseract's neural network models are optimised for the phoneme and character frequency patterns of specific languages, and running a German document through the English model, for example, produces confident but incorrect substitutions for umlauted characters such as ä, ö, and ü. Check the language selector, choose the language that matches the document's script and language, and re-run. The fix takes seconds. The second cause is poor scan quality. Tesseract achieves its highest accuracy on scans captured at 300 DPI or above in black-and-white or greyscale mode with good contrast. JPEG compression artefacts, skewed page angles above roughly 10 degrees, shadows from book bindings, bleed-through from reverse-side printing, and heavy background textures all degrade character recognition. Re-scanning at 300 DPI with a flatbed scanner rather than a phone camera typically restores accuracy to the 97-plus percent range. The third cause is that the PDF already contains an embedded digital text layer and does not actually require OCR. Some PDFs appear to be scans — perhaps because they were exported from image editing software — but already have a searchable text layer beneath the visual image. In those cases the OCR engine is doing redundant and less accurate work compared to simply extracting the existing layer. Use the PDF to Text tool first; if it returns readable content, you have a digital text PDF and OCR is unnecessary. For historical documents with unusual typefaces, heavy ornamentation, or degraded inks, specialist services with historical-text training data such as Transkribus produce better results than a general-purpose OCR engine.

Creado y mantenido por Sergio Robles, fundador de WikiPlus. 8+ años en productos digitales — consulta Acerca de WikiPlus para conocer la metodología y el modelo de privacidad.

Actualizado el 2026-05-24

El contenido de esta pagina esta disponible bajo CC BY 4.0.