Audio transcription support (#18398)

* install new packages for transcription support * add config options * audio maintainer modifications to support transcription * pass main config to audio process * embeddings support * api and transcription post processor * embeddings maintainer support for post processor * live audio transcription with sherpa and faster-whisper * update dispatcher with live transcription topic * frontend websocket * frontend live transcription * frontend changes for speech events * i18n changes * docs * mqtt docs * fix linter * use float16 and small model on gpu for real-time * fix return value and use requestor to embed description instead of passing embeddings * run real-time transcription in its own thread * tweaks * publish live transcriptions on their own topic instead of tracked_object_update * config validator and docs * clarify docs
2025-09-26 19:41:29 +08:00 · 2025-05-27 10:26:00 -05:00
parent 2385c403ee
commit 6dc36fcbb4
29 changed files with 2322 additions and 51 deletions
--- a/docs/docs/configuration/audio_detectors.md
+++ b/docs/docs/configuration/audio_detectors.md
@@ -72,3 +72,77 @@ audio:
    - speech
    - yell
 ```
+
+### Audio Transcription
+
+Frigate supports fully local audio transcription using either `sherpa-onnx` or OpenAI’s open-source Whisper models via `faster-whisper`. To enable transcription, it is recommended to only configure the features at the global level, and enable it at the individual camera level.
+
+```yaml
+audio_transcription:
+  enabled: False
+  device: ...
+  model_size: ...
+```
+
+Enable audio transcription for select cameras at the camera level:
+
+```yaml
+cameras:
+  back_yard:
+    ...
+    audio_transcription:
+      enabled: True
+```
+
+:::note
+
+Audio detection must be enabled and configured as described above in order to use audio transcription features.
+
+:::
+
+The optional config parameters that can be set at the global level include:
+
+- **`enabled`**: Enable or disable the audio transcription feature.
+  - Default: `False`
+  - It is recommended to only configure the features at the global level, and enable it at the individual camera level.
+- **`device`**: Device to use to run transcription and translation models.
+  - Default: `CPU`
+  - This can be `CPU` or `GPU`. The `sherpa-onnx` models are lightweight and run on the CPU only. The `whisper` models can run on GPU but are only supported on CUDA hardware.
+- **`model_size`**: The size of the model used for live transcription.
+  - Default: `small`
+  - This can be `small` or `large`. The `small` setting uses `sherpa-onnx` models that are fast, lightweight, and always run on the CPU but are not as accurate as the `whisper` model.
+  - The
+  - This config option applies to **live transcription only**. Recorded `speech` events will always use a different `whisper` model (and can be accelerated for CUDA hardware if available with `device: GPU`).
+- **`language`**: Defines the language used by `whisper` to translate `speech` audio events (and live audio only if using the `large` model).
+  - Default: `en`
+  - You must use a valid [language code](https://github.com/openai/whisper/blob/main/whisper/tokenizer.py#L10).
+  - Transcriptions for `speech` events are translated.
+  - Live audio is translated only if you are using the `large` model. The `small` `sherpa-onnx` model is English-only.
+
+The only field that is valid at the camera level is `enabled`.
+
+#### Live transcription
+
+The single camera Live view in the Frigate UI supports live transcription of audio for streams defined with the `audio` role. Use the Enable/Disable Live Audio Transcription button/switch to toggle transcription processing. When speech is heard, the UI will display a black box over the top of the camera stream with text. The MQTT topic `frigate/<camera_name>/audio/transcription` will also be updated in real-time with transcribed text.
+
+Results can be error-prone due to a number of factors, including:
+
+- Poor quality camera microphone
+- Distance of the audio source to the camera microphone
+- Low audio bitrate setting in the camera
+- Background noise
+- Using the `small` model - it's fast, but not accurate for poor quality audio
+
+For speech sources close to the camera with minimal background noise, use the `small` model.
+
+If you have CUDA hardware, you can experiment with the `large` `whisper` model on GPU. Performance is not quite as fast as the `sherpa-onnx` `small` model, but live transcription is far more accurate. Using the `large` model with CPU will likely be too slow for real-time transcription.
+
+#### Transcription and translation of `speech` audio events
+
+Any `speech` events in Explore can be transcribed and/or translated through the Transcribe button in the Tracked Object Details pane.
+
+In order to use transcription and translation for past events, you must enable audio detection and define `speech` as an audio type to listen for in your config. To have `speech` events translated into the language of your choice, set the `language` config parameter with the correct [language code](https://github.com/openai/whisper/blob/main/whisper/tokenizer.py#L10).
+
+The transcribed/translated speech will appear in the description box in the Tracked Object Details pane. If Semantic Search is enabled, embeddings are generated for the transcription text and are fully searchable using the description search type.
+
+Recorded `speech` events will always use a `whisper` model, regardless of the `model_size` config setting. Without a GPU, generating transcriptions for longer `speech` events may take a fair amount of time, so be patient.
--- a/docs/docs/configuration/reference.md
+++ b/docs/docs/configuration/reference.md
@@ -620,6 +620,19 @@ genai:
  object_prompts:
    person: "My special person prompt."

+# Optional: Configuration for audio transcription
+# NOTE: only the enabled option can be overridden at the camera level
+audio_transcription:
+  # Optional: Enable license plate recognition (default: shown below)
+  enabled: False
+  # Optional: The device to run the models on (default: shown below)
+  device: CPU
+  # Optional: Set the model size used for transcription. (default: shown below)
+  model_size: small
+  # Optional: Set the language used for transcription translation. (default: shown below)
+  # List of language codes: https://github.com/openai/whisper/blob/main/whisper/tokenizer.py#L10
+  language: en
+
 # Optional: Restream configuration
 # Uses https://github.com/AlexxIT/go2rtc (v1.9.9)
 # NOTE: The default go2rtc API port (1984) must be used,
--- a/docs/docs/integrations/mqtt.md
+++ b/docs/docs/integrations/mqtt.md
@@ -139,7 +139,7 @@ Message published for updates to tracked object metadata, for example:
  "name": "John",
  "score": 0.95,
  "camera": "front_door_cam",
-  "timestamp": 1607123958.748393,
+  "timestamp": 1607123958.748393
 }
 ```

@@ -153,7 +153,7 @@ Message published for updates to tracked object metadata, for example:
  "plate": "123ABC",
  "score": 0.95,
  "camera": "driveway_cam",
-  "timestamp": 1607123958.748393,
+  "timestamp": 1607123958.748393
 }
 ```

@@ -269,6 +269,12 @@ Publishes the rms value for audio detected on this camera.

 **NOTE:** Requires audio detection to be enabled

+### `frigate/<camera_name>/audio/transcription`
+
+Publishes transcribed text for audio detected on this camera.
+
+**NOTE:** Requires audio detection and transcription to be enabled
+
 ### `frigate/<camera_name>/enabled/set`

 Topic to turn Frigate's processing of a camera on and off. Expected values are `ON` and `OFF`.