Classify Video Zero Shot Operator

Description

The Classify Video Zero Shot operator classifies a video into user-provided labels using the CLIP ViT-B/32 model in a zero-shot fashion. It extracts I-frames from the video using FFmpeg, then uses the CLIP model to predict the most likely label for the video content.

Model Information

Model: CLIP ViT-B/32
Source: OpenAI, via HuggingFace Transformers
Usage: Zero-shot classification of video content by comparing extracted frame features to text label embeddings.

System Dependencies

FFmpeg
- On Windows: Download from ffmpeg.org and add to PATH, or use winget install ffmpeg from an elevated PowerShell.
- On Linux/macOS: Install via your package manager (e.g., sudo apt install ffmpeg).

Operator Dependencies

feluda[video]
torch >= 2.6.0
transformers >= 4.51.1
pillow >= 11.1.0

How to Run the Tests

Ensure you are in the root directory of the feluda project.

Install dependencies (in your virtual environment):

uv pip install "./operators/classify_video_zero_shot"
uv pip install "feluda[dev]"

Ensure FFmpeg is installed and available in your PATH.

Run the tests:

pytest operators/classify_video_zero_shot/test.py

Usage Example

from feluda.factory import VideoFactory
from feluda.operators import ClassifyVideoZeroShot

# Initialize the operator
operator = ClassifyVideoZeroShot()

# Load a video
video_url = (
   "https://tattle-media.s3.amazonaws.com/test-data/tattle-search/cat_vid_2mb.mp4"
   )
file = VideoFactory.make_from_url(video_url)

# Classify the video
labels = ["cat", "dog"]
result = operator.run(video, labels)
print(result)

Output

{"prediction": "cat", "probs": [0.9849101901054382, 0.015089876018464565]}

class operators.classify_video_zero_shot.classify_video_zero_shot.ClassifyVideoZeroShot[source]

Bases: Operator

Operator to classify a video into given labels using CLIP-ViT-B-32 and a zero-shot approach.

__init__() → None[source]: Initialize the ClassifyVideoZeroShot operator, loads the CLIP model and processor, and validates system dependencies.

static validate_system() → None[source]: Validates that required system dependencies are available. (ffmpeg).

gen_data() → dict[str, Any][source]

Generate output dict with prediction and probabilities.

Returns:

A dictionary containing:

prediction (str): Predicted label
probs (list): Label probabilities

Return type:

dict

analyze() → None[source]

Analyze the video file and generates predictions.

Parameters:: fname (str) – Path to the video file

extract_frames() → list[PIL.Image.Image][source]

Extract I-frames from the video file using ffmpeg.

Parameters:: fname (str) – Path to the video file
Returns:: List of PIL Images
Return type:: list

predict(images: list[PIL.Image.Image], labels: list[str]) → torch.Tensor[source]

Run inference and gets label probabilities using a pre-trained CLIP-ViT-B-32.

Parameters:

images (list) – List of PIL Images
labels (list) – List of labels

Returns:

Probability distribution across labels

Return type:

torch.Tensor

run(file: VideoFactory, labels: list[str], remove_after_processing: bool = False) → dict[str, Any][source]

Run the operator.

Parameters:

file (dict) – VideoFactory file object (must have a ‘path’ key)
labels (list) – List of labels
remove_after_processing (bool) – Whether to remove the file after processing

Returns:

A dictionary containing prediction and probabilities

Return type:

dict

cleanup() → None[source]: Clean up resources used by the operator.

state() → dict[str, Any][source]

Return the current state of the operator.

Returns:: State of the operator
Return type:: dict