Vorp Labs//Vision

November 25, 2025

Technical

Text-to-Tracking: One Prompt, Full Video Analysis

Track any object in video with a single text prompt. No training, no bounding box annotation—just describe what you want to find.

Phil Glazer•Founder

5 min read

Setup time0 hrs

What if you could analyze any video by just describing what you want to track?

"Player in white jersey." That's it. From that single prompt, we get:

Every player tracked across hundreds of frames
Persistent IDs that follow each person through the entire clip
Heat maps showing where they spent time
Speed and distance metrics
Trajectories and motion trails

No training data. No manual bounding box annotation. No fine-tuning.

The Demo

Here's a football play, raw vs. tracked:

The left side is the original broadcast footage. The right side shows what happens when you point SAM 3 at it with a text prompt.

Every colored box is a tracked player. The trails show their movement over the last 15 frames. The IDs persist—player #1 stays #1 from first frame to last.

The magic comes from Segment Anything Model 3 (SAM 3), Meta's latest vision foundation model.SAM 3 was released in August 2024. It's the first model to unify image and video segmentation with text prompts. Unlike traditional object detection that requires training on specific categories, SAM 3 understands natural language descriptions.

The pipeline:

Text prompt → SAM 3 segments matching objects in each frame
ByteTrack → Maintains consistent IDs across frames
Track stitching → Reconnects IDs when players temporarily disappear
Interpolation → Smooths gaps for clean visualization

We're using fal.ai to run SAM 3—no GPU setup required, just an API call.

Tip

fal.ai handles the heavy lifting. A 30-second clip processes in about 2 minutes via their API, costing roughly $0.15-0.30 depending on resolution.

What You Get

Beyond the tracked video, the system outputs structured data:

{
  "frame_idx": 0,
  "tracker_id": 7,
  "team": "NE",
  "x": 427,
  "y": 201,
  "confidence": 0.71
}

From this, we can compute:

Distance covered: Total yards each player ran
Speed: Instantaneous and average, in MPH
Heat maps: Position density over time
Trajectories: Full movement paths

The Prompt Matters

Different prompts give different results:

Prompt	What you get
"football player"	All players on field
"player in white jersey"	Just the home/away team
"player number 7"	Specific jersey number
"football"	The ball (harder—small and fast)
"helmet"	Every helmet visible

The more specific, the more targeted the tracking.

Limitations (Honest Take)

Not production-ready

This is a weekend prototype. The approach is solid, but these caveats matter if you're building something real.

This isn't magic. Some things are hard:

Camera angle matters. Broadcast footage shows a slice of the field. Players enter and exit frame constantly. The tracker handles this well, but you can't analyze what you can't see.

Formation detection is tricky. We built rule-based formation classification (shotgun vs. I-formation, 4-3 vs. nickel), but it works best with All-22 overhead footage, not sideline broadcast angles. That's a research direction, not a solved problem.

Small objects are hard. Tracking the football specifically is challenging—it's small, fast, and often occluded. Player tracking is more reliable.

Speed/distance needs calibration. Converting pixel movement to yards requires knowing the camera's view of the field. We estimate, but precise metrics need proper field registration.

What's Next

This is a foundation. Some directions we're exploring:

Route classification: "Show me every slant route from this game"
Coverage recognition: What defense was the opponent running?
Real-time processing: Live analysis during broadcasts
Multi-sport: Soccer, basketball, hockey—same approach

The interesting part isn't the tracking itself—it's what becomes possible when tracking is this easy.

Built with SAM 3, ByteTrack, and fal.ai. Weekend project that actually worked.

Want to see this in action?

We build systems like this for clients. Let's talk.

Get in touch View demos

←Back to all posts