VidVecRep Operator

Description

The VidVecRep operator extracts vector representations from videos using the CLIP-ViT-B-32 model. It works by extracting I-frames (keyframes) from a video file using FFmpeg, then generating a 512-dimensional feature vector for each frame using the CLIP model. The operator yields both the average vector for the video and vectors for each I-frame.

Model Information

Model: CLIP ViT-B/32
Source: OpenAI, via HuggingFace Transformers
Vector Size: 512
Usage: The model is used to generate embeddings for video frames, enabling downstream tasks such as video similarity, clustering, and zero-shot classification.

System Dependencies

Python >= 3.10
FFmpeg
- On Windows, you have two methods -
  1. Download from ffmpeg.org and add to PATH.
  2. Use winget install ffmpeg from an elevated powershell. (Make sure you have winget installed first)
- On Linux/macOS, install via your package manager (e.g., sudo apt install ffmpeg).

Operator Dependencies

PyTorch >= 2.6.0
Torchvision >= 0.21.0
Transformers >= 4.51.1
Pillow >= 11.1.0

How to Run the Tests

Ensure that you are in the root directory of the feluda project.

Install dependencies (in your virtual environment):

uv pip install "./operators/vid_vec_rep"
uv pip install "feluda[dev]"

Ensure FFmpeg is installed and available in your PATH.
Run the tests:
```
pytest operators/vid_vec_rep/test.py
```

Usage

from feluda.factory import VideoFactory
from feluda.operators import VidVecRep

# Initialize the operator
operator = VidVecRep()

# Load a video
video = VideoFactory.make_from_file_on_disk("example.mp4")

# Extract features
frames = operator.run(video, remove_after_processing=False)

for image in frames:
   print(image.keys())

# Cleanup
operator.cleanup()

class operators.vid_vec_rep.vid_vec_rep.VidVecRep[source]

Bases: Operator

Operator to extract video vector representations using CLIP-ViT-B-32.

__init__() → None[source]: Initialize the VidVecRep class.

load_model() → None[source]: Load the CLIP model and processor onto the specified device.

static validate_system() → None[source]

Validate that required system dependencies are available.

Checks if FFmpeg is installed and accessible in the system PATH.

get_mean_feature() → torch.Tensor[source]

Compute the mean feature vector from the feature matrix.

Returns:: Mean feature vector
Return type:: torch.Tensor

analyze(fname: str) → None[source]

Analyze the video file and extract features.

Parameters:: fname (str) – Path to the video file

static extract_frames(fname: str) → list[PIL.Image.Image][source]

Extract I-frames from the video file using ffmpeg.

Parameters:: fname (str) – Path to the video file
Returns:: List of PIL Images
Return type:: list

extract_features(images: list) → torch.Tensor[source]

Extract features from a list of images using pre-trained CLIP-ViT-B-32.

Parameters:: images (list) – List of PIL Images
Returns:: Feature matrix of shape (batch, 512)
Return type:: torch.Tensor

gendata() → Generator[dict, None, None][source]

Yield video vector representations from the VidVecRep prototype.

Yields:

dict –

A dictionary containing:

vid_vec (list): Vector representation
is_avg (bool): A flag indicating whether the vector is the average vector or a I-frame vector

run(file: VideoFactory, remove_after_processing: bool | None = False) → Generator[dict, None, None][source]

Run the operator.

Parameters:

file (dict) – VideoFactory file object
remove_after_processing (bool) – Whether to remove the file after processing

Returns:

Yields video and I-frame vector representations

Return type:

generator

cleanup() → None[source]: Cleans up resources used by the operator.

state() → dict[source]

Returns the current state of the operator.

Returns:: State of the operator
Return type:: dict