Add ability to use Jina CLIP V2 for semantic search (#16826)

* add wheels * move extra index url to bottom * config model option * add postprocess * fix config * jina v2 embedding class * use jina v2 in embeddings * fix ov inference * frontend * update reference config * revert device * fix truncation * return np tensors * use correct embeddings from inference * manual preprocess * clean up * docs * lower batch size for v2 only * docs clarity * wording
2025-09-26 19:41:29 +08:00 · 2025-02-26 08:58:25 -06:00
parent 447f26e1b9
commit d0e9bcbfdc
10 changed files with 380 additions and 49 deletions
--- a/docs/docs/configuration/reference.md
+++ b/docs/docs/configuration/reference.md
@@ -536,6 +536,8 @@ semantic_search:
  enabled: False
  # Optional: Re-index embeddings database from historical tracked objects (default: shown below)
  reindex: False
+  # Optional: Set the model used for embeddings. (default: shown below)
+  model: "jinav1"
  # Optional: Set the model size used for embeddings. (default: shown below)
  # NOTE: small model runs on CPU and large model runs on GPU
  model_size: "small"
--- a/docs/docs/configuration/semantic_search.md
+++ b/docs/docs/configuration/semantic_search.md
@@ -5,7 +5,7 @@ title: Semantic Search

 Semantic Search in Frigate allows you to find tracked objects within your review items using either the image itself, a user-defined text description, or an automatically generated one. This feature works by creating _embeddings_ — numerical vector representations — for both the images and text descriptions of your tracked objects. By comparing these embeddings, Frigate assesses their similarities to deliver relevant search results.

-Frigate uses [Jina AI's CLIP model](https://huggingface.co/jinaai/jina-clip-v1) to create and save embeddings to Frigate's database. All of this runs locally.
+Frigate uses models from [Jina AI](https://huggingface.co/jinaai) to create and save embeddings to Frigate's database. All of this runs locally.

 Semantic Search is accessed via the _Explore_ view in the Frigate UI.

@@ -35,23 +35,47 @@ If you are enabling Semantic Search for the first time, be advised that Frigate

 :::

-### Jina AI CLIP
+### Jina AI CLIP (version 1)

-The vision model is able to embed both images and text into the same vector space, which allows `image -> image` and `text -> image` similarity searches. Frigate uses this model on tracked objects to encode the thumbnail image and store it in the database. When searching for tracked objects via text in the search box, Frigate will perform a `text -> image` similarity search against this embedding. When clicking "Find Similar" in the tracked object detail pane, Frigate will perform an `image -> image` similarity search to retrieve the closest matching thumbnails.
+The [V1 model from Jina](https://huggingface.co/jinaai/jina-clip-v1) has a vision model which is able to embed both images and text into the same vector space, which allows `image -> image` and `text -> image` similarity searches. Frigate uses this model on tracked objects to encode the thumbnail image and store it in the database. When searching for tracked objects via text in the search box, Frigate will perform a `text -> image` similarity search against this embedding. When clicking "Find Similar" in the tracked object detail pane, Frigate will perform an `image -> image` similarity search to retrieve the closest matching thumbnails.

-The text model is used to embed tracked object descriptions and perform searches against them. Descriptions can be created, viewed, and modified on the Explore page when clicking on thumbnail of a tracked object. See [the Generative AI docs](/configuration/genai.md) for more information on how to automatically generate tracked object descriptions.
+The V1 text model is used to embed tracked object descriptions and perform searches against them. Descriptions can be created, viewed, and modified on the Explore page when clicking on thumbnail of a tracked object. See [the Generative AI docs](/configuration/genai.md) for more information on how to automatically generate tracked object descriptions.

-Differently weighted versions of the Jina model are available and can be selected by setting the `model_size` config option as `small` or `large`:
+Differently weighted versions of the Jina models are available and can be selected by setting the `model_size` config option as `small` or `large`:

 ```yaml
 semantic_search:
  enabled: True
+  model: "jinav1"
  model_size: small
 ```

 - Configuring the `large` model employs the full Jina model and will automatically run on the GPU if applicable.
 - Configuring the `small` model employs a quantized version of the Jina model that uses less RAM and runs on CPU with a very negligible difference in embedding quality.

+### Jina AI CLIP (version 2)
+
+Frigate also supports the [V2 model from Jina](https://huggingface.co/jinaai/jina-clip-v2), which introduces multilingual support (89 languages). In contrast, the V1 model only supports English.
+
+V2 offers only a 3% performance improvement over V1 in both text-image and text-text retrieval tasks, an upgrade that is unlikely to yield noticeable real-world benefits. Additionally, V2 has _significantly_ higher RAM and GPU requirements, leading to increased inference time and memory usage. If you plan to use V2, ensure your system has ample RAM and a discrete GPU. CPU inference (with the `small` model) using V2 is not recommended.
+
+To use the V2 model, update the `model` parameter in your config:
+
+```yaml
+semantic_search:
+  enabled: True
+  model: "jinav2"
+  model_size: large
+```
+
+For most users, especially native English speakers, the V1 model remains the recommended choice.
+
+:::note
+
+Switching between V1 and V2 requires reindexing your embeddings. To do this, set `reindex: True` in your Semantic Search configuration and restart Frigate. The embeddings from V1 and V2 are incompatible, and failing to reindex will result in incorrect search results.
+
+:::
+
 ### GPU Acceleration

 The CLIP models are downloaded in ONNX format, and the `large` model can be accelerated using GPU hardware, when available. This depends on the Docker build that is used.