Classify Video Zero Shot Operator
Description
The Classify Video Zero Shot operator classifies a video into user-provided labels using the CLIP ViT-B/32 model in a zero-shot fashion. It extracts I-frames from the video using FFmpeg, then uses the CLIP model to predict the most likely label for the video content.
Model Information
Model: CLIP ViT-B/32
Source: OpenAI, via HuggingFace Transformers
Usage: Zero-shot classification of video content by comparing extracted frame features to text label embeddings.
System Dependencies
FFmpeg
On Windows: Download from ffmpeg.org and add to PATH, or use
winget install ffmpegfrom an elevated PowerShell.On Linux/macOS: Install via your package manager (e.g.,
sudo apt install ffmpeg).
Operator Dependencies
feluda[video]
torch >= 2.6.0
transformers >= 4.51.1
pillow >= 11.1.0
How to Run the Tests
Ensure you are in the root directory of the
feludaproject.Install dependencies (in your virtual environment):
uv pip install "./operators/classify_video_zero_shot" uv pip install "feluda[dev]"
Ensure FFmpeg is installed and available in your PATH.
Run the tests:
pytest operators/classify_video_zero_shot/test.py
Usage Example
from feluda.factory import VideoFactory
from feluda.operators import ClassifyVideoZeroShot
# Initialize the operator
operator = ClassifyVideoZeroShot()
# Load a video
video_url = (
"https://tattle-media.s3.amazonaws.com/test-data/tattle-search/cat_vid_2mb.mp4"
)
file = VideoFactory.make_from_url(video_url)
# Classify the video
labels = ["cat", "dog"]
result = operator.run(video, labels)
print(result)
Output
{"prediction": "cat", "probs": [0.9849101901054382, 0.015089876018464565]}
- class operators.classify_video_zero_shot.classify_video_zero_shot.ClassifyVideoZeroShot[source]
Bases:
OperatorOperator to classify a video into given labels using CLIP-ViT-B-32 and a zero-shot approach.
- __init__() None[source]
Initialize the ClassifyVideoZeroShot operator, loads the CLIP model and processor, and validates system dependencies.
- static validate_system() None[source]
Validates that required system dependencies are available. (ffmpeg).
- gen_data() dict[str, Any][source]
Generate output dict with prediction and probabilities.
- Returns:
- A dictionary containing:
prediction (str): Predicted label
probs (list): Label probabilities
- Return type:
- analyze() None[source]
Analyze the video file and generates predictions.
- Parameters:
fname (str) – Path to the video file
- predict(images: list[PIL.Image.Image], labels: list[str]) torch.Tensor[source]
Run inference and gets label probabilities using a pre-trained CLIP-ViT-B-32.