O que é Transcritor de Vídeo/Áudio?
O Video Transcriptor transforma a fala de qualquer video ou arquivo de audio em texto limpo e pesquisavel. Um modelo de reconhecimento de fala roda totalmente no seu navegador. Sem uploads, sem cadastro e sem cota mensal. Solte um arquivo e espere o modelo local terminar. Depois copie ou baixe o resultado. A ferramenta e feita para criadores de conteudo, jornalistas, pesquisadores, profissionais de marketing, estudantes e times de suporte. Funciona para qualquer pessoa que precisa de fala convertida em texto editavel rapido. Nenhuma gravacao sensivel vai para um servico externo. Funciona para clips curtos de dois minutos e entrevistas de uma hora. O modelo roda localmente em WebAssembly. Escolha desde tiny para um rascunho rapido ate large-v3 para precisao quase humana. Os idiomas incluem ingles, espanhol, alemao, frances, polones, portugues e mais de 90 outros.
Quando devo usar esta ferramenta?
- Transformar uma entrevista gravada pelo Zoom em uma transcrição de texto pesquisável
- Gerar legendas para um tutorial curto do YouTube que você produziu
- Indexar gravações de webinar por tópico para facilitar a referência da equipe
- Extrair citações diretas de um vídeo gravado de palestra magna
Como transcrever um arquivo de vídeo em texto?
- 1Envie seu arquivo de vídeo MP4 ou MOV na área de upload.
- 2Selecione o idioma falado ou deixe em detecção automática.
- 3Clique em Transcrever para iniciar o reconhecimento de fala local.
- 4Aguarde enquanto o modelo de IA processa o áudio no navegador.
- 5Copie a transcrição ou baixe como TXT ou legendas SRT.
Perguntas frequentes
Qual modelo de IA faz a transcrição?
The transcription is powered by OpenAI's Whisper model, specifically a quantized version compiled to WebAssembly that runs directly inside your browser. Whisper is an open-source automatic speech recognition (ASR) system trained by OpenAI on 680,000 hours of multilingual and multitask supervised data, making it one of the most accurate open ASR models publicly available. The WebAssembly build used here is derived from the whisper.cpp project by Georgi Gerganov, which ports the original PyTorch model to C++ and then compiles it to WASM so it can execute in any modern browser without server infrastructure. The specific model variant served is typically the whisper-tiny or whisper-base quantized checkpoint, selected to balance transcription accuracy with download size and inference speed within browser memory limits — larger model variants like medium or large require several gigabytes of RAM and are impractical for in-browser use. The model weights are fetched once from a CDN and then cached in the browser's Cache Storage API (via a Service Worker or the Fetch cache), so subsequent transcriptions do not require re-downloading the model. Inference uses WebAssembly SIMD instructions where available for significant speedup. Because the model runs locally, no audio or video data is sent to OpenAI or any external service. All processing runs entirely in your browser — no data leaves your device. As a practical tip, keep the browser tab open and foreground during transcription, as some browsers throttle background tabs' CPU allocation, which can significantly slow down WebAssembly inference.
Meu vídeo é enviado para um servidor remoto?
No, your video file never leaves your device. The video transcriptor is built entirely on browser-local technologies: when you select a video file, it is opened using the File API, which gives the browser direct access to the file on your disk without any network transfer. The audio track is extracted from the video by feeding it through the browser's built-in media decoder via an HTML Video element or the Web Audio API's AudioContext.decodeAudioData() method, producing a PCM audio buffer entirely within the browser process. That audio buffer is then passed directly to the local Whisper WebAssembly model for speech recognition inference. The Whisper model itself was previously downloaded and cached locally in your browser's storage — it is not a remote API call that sends audio to OpenAI's servers. The transcription result is generated in-browser and returned as text, which is then displayed in the output area. The only network request this page makes is the one-time download of the Whisper model weights on first use; that request contains no user data. This local processing design is particularly important for video content, which often contains confidential meetings, personal conversations, proprietary presentations, or sensitive interviews that should never leave a user's device. You can verify the offline operation by loading the page while connected, waiting for the model to download and cache, then disconnecting from the internet — transcription will continue to work without any network. As a practical tip, for maximum privacy when transcribing sensitive recordings, after the model has cached once, switch to airplane mode before uploading your video file.
Quanto tempo a transcrição leva para um vídeo longo?
Transcription time depends primarily on the length of the audio, the WebAssembly runtime performance of your device, and which Whisper model variant is loaded. As a practical benchmark, on a modern laptop with a mid-range CPU running the whisper-base.en model compiled to WASM with SIMD enabled, a 10-minute audio track typically transcribes in roughly 2–5 minutes in the browser. The inference speed is usually described as a real-time factor (RTF) — how many seconds of processing are required per second of audio. Browser WASM Whisper implementations typically achieve an RTF of around 3–10×, meaning one minute of audio takes 3–10 seconds to process. For a 60-minute video, that translates to roughly 3–10 minutes of processing time. Factors that slow transcription include: running on older CPUs that lack SIMD support and therefore fall back to scalar WASM execution (which can be 4–8× slower); browser tab being backgrounded, since browsers throttle CPU for background tabs; and using a larger model variant if available, which is more accurate but proportionally slower. There is no server-side acceleration because all inference runs locally. A progress indicator should update during processing to confirm the transcription is progressing. All processing runs entirely in your browser — no data leaves your device. As a practical tip, for long recordings over 30 minutes, consider splitting the video into 10–15 minute segments using a free tool before transcribing, both to manage processing time and to get usable partial results faster rather than waiting for a single long job.
Quais idiomas o transcritor suporta?
The Whisper model used by this tool was trained on audio spanning 99 languages, and it supports transcription across all of them to varying degrees of accuracy. Languages with the highest accuracy due to the largest representation in Whisper's training data include English, Spanish, French, German, Italian, Portuguese, Dutch, Russian, Chinese (Mandarin), Japanese, Korean, Arabic, Turkish, Polish, and Swedish. For these major languages, word error rates are comparable to commercial transcription services. Less-resourced languages — those with smaller amounts of training data — will transcribe with lower accuracy, and Whisper may occasionally hallucinate words or switch to a closely related language with more training data when it encounters unfamiliar phonemes. The model performs automatic language detection by default, analyzing the first 30 seconds of audio to identify the language before beginning full transcription; if you know the language in advance and the tool exposes a language selection dropdown, specifying it explicitly improves both accuracy and processing speed. Whisper also supports translation: in addition to transcribing in the source language, it can translate to English in a single inference pass. Code-switching — audio that switches between two languages mid-sentence — is partially supported but accuracy drops in transition regions. Accented speech within a supported language is generally handled robustly. All processing runs entirely in your browser — no data leaves your device. As a practical tip, for non-English transcription, explicitly selecting the language in the tool settings rather than relying on auto-detection consistently improves accuracy, especially for shorter clips where the 30-second auto-detection window may not have enough signal.
O conteudo desta pagina esta disponivel sob CC BY 4.0.