Speech-to-Text (STT)
Speech-to-Text (STT) functionality for fmus-vox.
This module provides functionality for transcribing speech to text using various models and techniques.
- fmus_vox.stt.transcribe(audio: str | Audio, model: str = 'whisper', language: str | None = None, **kwargs) str[source]
Transcribe audio to text using a specified model.
This is a simple functional API for quick transcriptions. For more control, use the Transcriber class directly.
- Parameters:
audio – Audio to transcribe (file path or Audio object)
model – Model to use for transcription (whisper, wav2vec, etc.)
language – Language code (if None, auto-detect)
**kwargs – Additional model-specific parameters
- Returns:
Transcribed text
- Raises:
TranscriptionError – If transcription fails
Examples
>>> # Transcribe an audio file >>> text = transcribe("recording.wav") >>> print(text)
>>> # Transcribe with a specific model and language >>> text = transcribe("recording.wav", model="whisper-large", language="en")
Transcriber Class
- class fmus_vox.stt.transcriber.Transcriber(model: str = 'whisper', **kwargs)[source]
Bases:
objectBase class for speech-to-text transcription.
This class provides the common interface for all transcription models and handles model loading, caching, and transcription.
- Parameters:
model – Name of the model to use (whisper, wav2vec, etc.)
device – Computation device (cpu, cuda, auto)
**kwargs – Additional model-specific parameters
- classmethod register_model(name: str, implementation: type) None[source]
Register a model implementation.
- Parameters:
name – Model name
implementation – Model implementation class
- static __new__(cls, model: str = 'whisper', **kwargs) Transcriber[source]
Create a new Transcriber instance of the appropriate subclass.
- Parameters:
model – Name of the model to use
**kwargs – Additional model-specific parameters
- Returns:
Transcriber instance
- Raises:
ModelError – If the model is not supported
- __init__(model: str = 'whisper', device: str | None = None, **kwargs)[source]
Initialize the transcriber.
- Parameters:
model – Name of the model to use
device – Computation device (cpu, cuda, auto)
**kwargs – Additional model-specific parameters
- transcribe(**kwargs)
- transcribe_with_metadata(**kwargs)
- async transcribe_async(audio: str | Audio, language: str | None = None) str[source]
Transcribe audio to text asynchronously.
- Parameters:
audio – Audio to transcribe (file path or Audio object)
language – Language code (if None, auto-detect)
- Returns:
Transcribed text
- Raises:
TranscriptionError – If transcription fails
- async transcribe_with_metadata_async(audio: str | Audio, language: str | None = None) TranscriptionResult[source]
Transcribe audio to text asynchronously with additional metadata.
- Parameters:
audio – Audio to transcribe (file path or Audio object)
language – Language code (if None, auto-detect)
- Returns:
TranscriptionResult object
- Raises:
TranscriptionError – If transcription fails
- stream(audio_stream: Generator[Audio, None, None], language: str | None = None) Generator[TranscriptionResult, None, None][source]
Stream transcription for incoming audio chunks.
- Parameters:
audio_stream – Generator yielding Audio objects
language – Language code (if None, auto-detect)
- Yields:
TranscriptionResult for each processed chunk
- Raises:
TranscriptionError – If transcription fails
Whisper Transcriber
Whisper model implementation for speech-to-text.
This module provides the WhisperTranscriber class which uses OpenAI’s Whisper model for transcription.
- class fmus_vox.stt.whisper.WhisperTranscriber(model: str = 'whisper', **kwargs)[source]
Bases:
TranscriberTranscriber using OpenAI’s Whisper model.
Whisper is a general-purpose speech recognition model that can transcribe speech in multiple languages and translate it to English.
- Parameters:
model – Whisper model size/variant (tiny, base, small, medium, large)
device – Computation device (cpu, cuda, auto)
download_root – Directory to download and store models
**kwargs – Additional model-specific parameters
- __init__(model: str = 'whisper-base', device: str | None = None, download_root: str | None = None, **kwargs)[source]
Initialize the Whisper transcriber.
- Parameters:
model – Whisper model size/variant (tiny, base, small, medium, large)
device – Computation device (cpu, cuda, auto)
download_root – Directory to download and store models
**kwargs – Additional model-specific parameters
- transcribe_with_metadata(audio: str | Audio, language: str | None = None) TranscriptionResult[source]
Transcribe audio to text with additional metadata.
- Parameters:
audio – Audio to transcribe (file path or Audio object)
language – Language code (if None, auto-detect)
- Returns:
TranscriptionResult object
- Raises:
TranscriptionError – If transcription fails
- stream(audio_stream: Generator[Audio, None, None], language: str | None = None) Generator[TranscriptionResult, None, None][source]
Stream transcription for incoming audio chunks.
- Parameters:
audio_stream – Generator yielding Audio objects
language – Language code (if None, auto-detect)
- Yields:
TranscriptionResult for each processed chunk
- Raises:
TranscriptionError – If transcription fails