VidVecRep Operator

Description

The VidVecRep operator extracts vector representations from videos using the CLIP-ViT-B-32 model. It works by extracting I-frames (keyframes) from a video file using FFmpeg, then generating a 512-dimensional feature vector for each frame using the CLIP model. The operator yields both the average vector for the video and vectors for each I-frame.

Model Information

  • Model: CLIP ViT-B/32

  • Source: OpenAI, via HuggingFace Transformers

  • Vector Size: 512

  • Usage: The model is used to generate embeddings for video frames, enabling downstream tasks such as video similarity, clustering, and zero-shot classification.

System Dependencies

  • Python >= 3.10

  • FFmpeg

    • On Windows, you have two methods -

      1. Download from ffmpeg.org and add to PATH.

      2. Use winget install ffmpeg from an elevated powershell. (Make sure you have winget installed first)

    • On Linux/macOS, install via your package manager (e.g., sudo apt install ffmpeg).

Operator Dependencies

  • PyTorch >= 2.6.0

  • Torchvision >= 0.21.0

  • Transformers >= 4.51.1

  • Pillow >= 11.1.0

How to Run the Tests

  1. Ensure that you are in the root directory of the feluda project.

  2. Install dependencies (in your virtual environment):

    uv pip install "./operators/vid_vec_rep"
    uv pip install "feluda[dev]"
    
  3. Ensure FFmpeg is installed and available in your PATH.

  4. Run the tests:

    pytest operators/vid_vec_rep/test.py
    

Usage

from feluda.factory import VideoFactory
from feluda.operators import VidVecRep

# Initialize the operator
operator = VidVecRep()

# Load a video
video = VideoFactory.make_from_file_on_disk("example.mp4")

# Extract features
frames = operator.run(video, remove_after_processing=False)

for image in frames:
   print(image.keys())

# Cleanup
operator.cleanup()
class operators.vid_vec_rep.vid_vec_rep.VidVecRep[source]

Bases: Operator

Operator to extract video vector representations using CLIP-ViT-B-32.

__init__() None[source]

Initialize the VidVecRep class.

load_model() None[source]

Load the CLIP model and processor onto the specified device.

static validate_system() None[source]

Validate that required system dependencies are available.

Checks if FFmpeg is installed and accessible in the system PATH.

get_mean_feature() torch.Tensor[source]

Compute the mean feature vector from the feature matrix.

Returns:

Mean feature vector

Return type:

torch.Tensor

analyze(fname: str) None[source]

Analyze the video file and extract features.

Parameters:

fname (str) – Path to the video file

static extract_frames(fname: str) list[PIL.Image.Image][source]

Extract I-frames from the video file using ffmpeg.

Parameters:

fname (str) – Path to the video file

Returns:

List of PIL Images

Return type:

list

extract_features(images: list) torch.Tensor[source]

Extract features from a list of images using pre-trained CLIP-ViT-B-32.

Parameters:

images (list) – List of PIL Images

Returns:

Feature matrix of shape (batch, 512)

Return type:

torch.Tensor

gendata() Generator[dict, None, None][source]

Yield video vector representations from the VidVecRep prototype.

Yields:

dict

A dictionary containing:
  • vid_vec (list): Vector representation

  • is_avg (bool): A flag indicating whether the vector is the average vector or a I-frame vector

run(file: VideoFactory, remove_after_processing: bool | None = False) Generator[dict, None, None][source]

Run the operator.

Parameters:
  • file (dict) – VideoFactory file object

  • remove_after_processing (bool) – Whether to remove the file after processing

Returns:

Yields video and I-frame vector representations

Return type:

generator

cleanup() None[source]

Cleans up resources used by the operator.

state() dict[source]

Returns the current state of the operator.

Returns:

State of the operator

Return type:

dict