Best FREE Speech to Text AI - Whisper AI

Kevin Stratvert
18 Jan 202308:21

TLDRIn this informative video, Kevin introduces Whisper AI, a powerful speech-to-text tool developed by OpenAI, the creators of ChatGPT and Dalle2. Whisper is capable of transcribing speech into text with remarkable accuracy, even in noisy environments or with thick accents. It supports 97 languages and is both free and open source. The video demonstrates how to use Whisper with Google Colaboratory, which allows for code execution in a web browser without the need for a high-end computer. The process includes installing Whisper and ffmpeg, uploading an audio or video file, and choosing a Whisper model for transcription. The result is a high-quality transcript with proper capitalization and punctuation, available in different formats including SRT and TXT. Kevin also highlights Whisper's efficiency and accuracy for captioning YouTube videos, offering a superior alternative to Google's auto-generated captions.

Takeaways

  • πŸ“’ Whisper AI is a free and open-source speech-to-text tool developed by OpenAI.
  • 🌐 It supports transcription in English and 96 other languages.
  • πŸ”Š Whisper works effectively even in noisy environments and with thick accents.
  • πŸ’» You can use Whisper directly on your computer or via Google Colaboratory, which doesn't require a powerful PC.
  • πŸ”— Google Colaboratory allows you to run code in your web browser and is accessible through Google Drive.
  • πŸ“‚ To get started with Google Colaboratory, you need to install it from the Google Workspace Marketplace.
  • πŸ“ Once installed, you can create a new file and name it for future reference.
  • βš™οΈ Select a GPU or graphics card as the hardware accelerator for optimal performance.
  • πŸ“ Install Whisper and ffmpeg from GitHub within Google Colaboratory for audio and video file processing.
  • πŸ“‘ You can upload an audio or video file to transcribe, and Whisper will generate a text file, SRT, and VTT files.
  • ⏱️ Whisper offers different model sizes for transcription, ranging from tiny for speed to large for accuracy.
  • πŸ“‹ The transcription includes capitalization and punctuation for a high-quality output.
  • πŸ”„ To transcribe another file, simply upload a new audio or video file and update the file name in the code.
  • πŸ“˜ Whisper's advanced parameters allow you to customize the output, including saving location, translation, and language selection.
  • ⚠️ Remember to download your transcription files before leaving Google Colaboratory, as the runtime and files will be removed afterward.

Q & A

  • What is the purpose of the AI tool Whisper?

    -Whisper is an AI tool designed to convert speech into text. It's capable of handling multiple languages and can work effectively even in noisy environments or with speakers having thick accents.

  • Who created Whisper AI?

    -Whisper AI was created by OpenAI, the same company known for developing ChatGPT and Dalle2.

  • How many languages does Whisper support for speech to text conversion?

    -Whisper supports speech to text conversion in English and 96 other languages.

  • What is Google Colaboratory and how is it used in the context of Whisper AI?

    -Google Colaboratory is a service that allows users to run code directly in their web browser. It's used in conjunction with Whisper AI to run the AI model without needing a capable personal computer.

  • What is required to use Google Colaboratory for Whisper AI?

    -To use Google Colaboratory, one needs a Google account, and they must connect Google Colaboratory to Google Drive.

  • How does one install Whisper AI on Google Colaboratory?

    -Installation of Whisper AI on Google Colaboratory involves entering specific code into the Colaboratory environment to install Whisper from GitHub and ffmpeg for handling audio and video files.

  • What hardware accelerator is recommended when using Whisper AI on Google Colaboratory?

    -A GPU or graphics card is recommended as a hardware accelerator for running Whisper AI models efficiently on Google Colaboratory.

  • What types of files can be transcribed using Whisper AI?

    -Whisper AI can transcribe both audio and video files.

  • What are the different models available in Whisper AI and what is their main difference?

    -Whisper AI offers five different models ranging from tiny to large. The main difference is the balance between accuracy and resource usage, with the tiny model being the least resource-intensive and the large model offering the highest accuracy but requiring more resources and time.

  • What formats does Whisper AI provide for the transcribed text?

    -Whisper AI provides the transcribed text in SRT, TXT, and VTT formats. The TXT file contains plain text, while SRT and VTT include timestamps for the transcription.

  • How does one download the transcribed files after using Whisper AI on Google Colaboratory?

    -To download the transcribed files, click on the ellipsis or three-dot icon next to the file in Google Colaboratory and select 'Download'.

  • What additional parameters can be used with Whisper AI for transcription?

    -Additional parameters with Whisper AI include specifying the output save location, choosing to transcribe or translate a file, and setting the language, among others.

  • Why is it important to download transcribed files before leaving Google Colaboratory?

    -It's important to download transcribed files before leaving Google Colaboratory because the runtime will end and all files will be automatically removed once you leave the environment.

Outlines

00:00

πŸš€ Introduction to AI Speech-to-Text with Whisper

Kevin introduces the topic of converting speech to text using AI, specifically the Whisper tool developed by OpenAI. He mentions that Whisper's performance surpasses human transcription capabilities and is effective in various conditions, including noisy environments and thick accents. The tool is free and open source, supporting 97 languages. Kevin guides viewers on how to use Whisper via Google Colaboratory, which allows running code in a web browser without the need for a high-spec computer. He provides a step-by-step process for installing Whisper and its dependencies, setting up the runtime environment, and preparing the AI for transcription tasks.

05:01

πŸ“š Using Whisper for High-Quality Transcripts

The second paragraph demonstrates how to use Whisper to transcribe an audio file. Kevin explains the process of uploading an audio file into Google Colaboratory, initiating the transcription with a specific command, and choosing a model size that balances accuracy and processing time. He highlights the medium model as a good compromise. After transcription, viewers can download the text file, SRT (subtitle) file, and VTT file, which include timestamps. Kevin also shows additional parameters available for customization, such as output location, translation options, and language specification. He emphasizes Whisper's advantages over Google's auto-captions for his YouTube videos, including accuracy, capitalization, and punctuation. Before concluding, Kevin advises viewers to download their transcriptions before exiting Google Colaboratory to avoid losing work.

Mindmap

Keywords

πŸ’‘Speech to Text AI

Speech to Text AI refers to artificial intelligence technology that can convert spoken language into written text. In the video, this technology is demonstrated through the use of Whisper AI, which is capable of transcribing speech with high accuracy, even in noisy environments or when dealing with heavy accents.

πŸ’‘Whisper AI

Whisper AI is an AI tool developed by OpenAI that specializes in speech recognition and transcription. It is highlighted in the video for its ability to transcribe speech into text effectively, supporting multiple languages and working well under various conditions such as background noise or thick accents.

πŸ’‘OpenAI

OpenAI is a company known for developing advanced AI technologies. The video mentions OpenAI as the creator of both Whisper AI and other popular AI models like ChatGPT and Dalle2. OpenAI's work is central to the discussion as it showcases the capabilities of AI in different applications.

πŸ’‘Google Colaboratory

Google Colaboratory, often abbreviated as Colab, is a cloud-based platform that allows users to run code in their web browsers. In the context of the video, it is used to run Whisper AI without the need for a high-performance computer, making the transcription process accessible to a wider audience.

πŸ’‘GPU

GPU stands for Graphics Processing Unit, which is a type of hardware accelerator used in computers for rendering images, animations, and video. In the video, the speaker instructs the audience to select a GPU when setting up the runtime environment in Google Colab to optimize the performance of Whisper AI.

πŸ’‘ffmpeg

ffmpeg is a free and open-source software project that can handle multimedia data, including audio and video files. In the video, it is mentioned as a tool that needs to be installed in Google Colab to facilitate the processing of audio and video files for transcription.

πŸ’‘Transcribe

Transcribe refers to the process of converting spoken language into written form. In the video, the main purpose of using Whisper AI is to transcribe audio files, which is demonstrated through the uploading and processing of an MP3 file named 'cookies.mp3'.

πŸ’‘SRT file

An SRT file is a SubRip subtitle file format that includes timestamps to synchronize the text with the audio or video playback. The video shows how Whisper AI can generate an SRT file, which is useful for creating captions for videos.

πŸ’‘TXT file

A TXT file is a plain text file format used for storing written documents. In the context of the video, a TXT file is one of the output formats provided by Whisper AI, containing the transcribed text without timestamps, suitable for general text documentation.

πŸ’‘VTT file

VTT stands for Web Video Text Tracks, which is a format used for displaying captions and subtitles on the internet. Similar to SRT, a VTT file includes timestamps and is mentioned in the video as another output option for Whisper AI.

πŸ’‘Model Selection

Model Selection refers to the process of choosing the appropriate AI model for a specific task. The video discusses different Whisper AI models ranging from tiny to large, each with varying levels of accuracy, processing speed, and space requirements, allowing users to select a model that best fits their needs.

Highlights

Whisper AI is a speech-to-text tool that performs better than most humans in transcribing speech.

It supports English and 96 other languages.

Whisper AI can work effectively in noisy environments and with thick accents.

The tool is completely free and open source.

Developed by OpenAI, the same company behind ChatGPT and Dalle2.

Whisper can be installed directly on a computer or used via Google Colaboratory.

Google Colaboratory allows running code in a web browser without specific hardware requirements.

To use Whisper via Google Colaboratory, one must connect it to Google Drive and install the tool.

Whisper requires a GPU or graphics card for optimal performance.

The installation process for Whisper and ffmpeg takes approximately 23 seconds.

Users can drag and drop audio or video files into Google Colaboratory for transcription.

Whisper AI provides transcription in multiple formats: TXT, SRT, and VTT.

The SRT and VTT formats include timestamps for each segment of transcribed text.

Whisper offers five different models with varying levels of accuracy and processing time.

The medium model is recommended for a balance between accuracy and processing speed.

Transcription results include capitalization and punctuation for a high-quality output.

Additional parameters can be used with Whisper for customized transcription and translation tasks.

Files transcribed with Whisper can be downloaded before exiting Google Colaboratory.

The technology is used by the presenter for all YouTube video captions.

Whisper outperforms Google's auto-generated captions in accuracy and detail.