Classify Video Zero Shot Operator

Description

The Classify Video Zero Shot operator classifies a video into user-provided labels using the CLIP ViT-B/32 model in a zero-shot fashion. It extracts I-frames from the video using FFmpeg, then uses the CLIP model to predict the most likely label for the video content.

Model Information

  • Model: CLIP ViT-B/32

  • Source: OpenAI, via HuggingFace Transformers

  • Usage: Zero-shot classification of video content by comparing extracted frame features to text label embeddings.

System Dependencies

  • FFmpeg

    • On Windows: Download from ffmpeg.org and add to PATH, or use winget install ffmpeg from an elevated PowerShell.

    • On Linux/macOS: Install via your package manager (e.g., sudo apt install ffmpeg).

Operator Dependencies

  • feluda[video]

  • torch >= 2.6.0

  • transformers >= 4.51.1

  • pillow >= 11.1.0

How to Run the Tests

  1. Ensure you are in the root directory of the feluda project.

  2. Install dependencies (in your virtual environment):

    uv pip install "./operators/classify_video_zero_shot"
    uv pip install "feluda[dev]"
    
  3. Ensure FFmpeg is installed and available in your PATH.

  4. Run the tests:

    pytest operators/classify_video_zero_shot/test.py
    

Usage Example

from feluda.factory import VideoFactory
from feluda.operators import ClassifyVideoZeroShot

# Initialize the operator
operator = ClassifyVideoZeroShot()

# Load a video
video_url = (
   "https://tattle-media.s3.amazonaws.com/test-data/tattle-search/cat_vid_2mb.mp4"
   )
file = VideoFactory.make_from_url(video_url)

# Classify the video
labels = ["cat", "dog"]
result = operator.run(video, labels)
print(result)

Output

{"prediction": "cat", "probs": [0.9849101901054382, 0.015089876018464565]}
class operators.classify_video_zero_shot.classify_video_zero_shot.ClassifyVideoZeroShot[source]

Bases: Operator

Operator to classify a video into given labels using CLIP-ViT-B-32 and a zero-shot approach.

__init__() None[source]

Initialize the ClassifyVideoZeroShot operator, loads the CLIP model and processor, and validates system dependencies.

static validate_system() None[source]

Validates that required system dependencies are available. (ffmpeg).

gen_data() dict[str, Any][source]

Generate output dict with prediction and probabilities.

Returns:

A dictionary containing:
  • prediction (str): Predicted label

  • probs (list): Label probabilities

Return type:

dict

analyze() None[source]

Analyze the video file and generates predictions.

Parameters:

fname (str) – Path to the video file

extract_frames() list[PIL.Image.Image][source]

Extract I-frames from the video file using ffmpeg.

Parameters:

fname (str) – Path to the video file

Returns:

List of PIL Images

Return type:

list

predict(images: list[PIL.Image.Image], labels: list[str]) torch.Tensor[source]

Run inference and gets label probabilities using a pre-trained CLIP-ViT-B-32.

Parameters:
  • images (list) – List of PIL Images

  • labels (list) – List of labels

Returns:

Probability distribution across labels

Return type:

torch.Tensor

run(file: VideoFactory, labels: list[str], remove_after_processing: bool = False) dict[str, Any][source]

Run the operator.

Parameters:
  • file (dict) – VideoFactory file object (must have a ‘path’ key)

  • labels (list) – List of labels

  • remove_after_processing (bool) – Whether to remove the file after processing

Returns:

A dictionary containing prediction and probabilities

Return type:

dict

cleanup() None[source]

Clean up resources used by the operator.

state() dict[str, Any][source]

Return the current state of the operator.

Returns:

State of the operator

Return type:

dict