Technical
Text-to-Tracking: One Prompt, Full Video Analysis
Track any object in video with a single text prompt. No training, no bounding box annotation—just describe what you want to find.
What if you could analyze any video by just describing what you want to track?
"Player in white jersey." That's it. From that single prompt, we get:
- Every player tracked across hundreds of frames
- Persistent IDs that follow each person through the entire clip
- Heat maps showing where they spent time
- Speed and distance metrics
- Trajectories and motion trails
No training data. No manual bounding box annotation. No fine-tuning.
The Demo
Here's a football play, raw vs. tracked:
The left side is the original broadcast footage. The right side shows what happens when you point SAM 3 at it with a text prompt.
Every colored box is a tracked player. The trails show their movement over the last 15 frames. The IDs persist—player #1 stays #1 from first frame to last.
How It Works
The magic comes from Segment Anything Model 3 (SAM 3), Meta's latest vision foundation model.SAM 3 was released in August 2024. It's the first model to unify image and video segmentation with text prompts. Unlike traditional object detection that requires training on specific categories, SAM 3 understands natural language descriptions.
The pipeline:
- Text prompt → SAM 3 segments matching objects in each frame
- ByteTrack → Maintains consistent IDs across frames
- Track stitching → Reconnects IDs when players temporarily disappear
- Interpolation → Smooths gaps for clean visualization
We're using fal.ai to run SAM 3—no GPU setup required, just an API call.
Tip
fal.ai handles the heavy lifting. A 30-second clip processes in about 2 minutes via their API, costing roughly $0.15-0.30 depending on resolution.
What You Get
Beyond the tracked video, the system outputs structured data:
{
"frame_idx": 0,
"tracker_id": 7,
"team": "NE",
"x": 427,
"y": 201,
"confidence": 0.71
}From this, we can compute:
- Distance covered: Total yards each player ran
- Speed: Instantaneous and average, in MPH
- Heat maps: Position density over time
- Trajectories: Full movement paths
The Prompt Matters
Different prompts give different results:
| Prompt | What you get |
|---|---|
| "football player" | All players on field |
| "player in white jersey" | Just the home/away team |
| "player number 7" | Specific jersey number |
| "football" | The ball (harder—small and fast) |
| "helmet" | Every helmet visible |
The more specific, the more targeted the tracking.
Limitations (Honest Take)
Not production-ready
This is a weekend prototype. The approach is solid, but these caveats matter if you're building something real.
This isn't magic. Some things are hard:
Camera angle matters. Broadcast footage shows a slice of the field. Players enter and exit frame constantly. The tracker handles this well, but you can't analyze what you can't see.
Formation detection is tricky. We built rule-based formation classification (shotgun vs. I-formation, 4-3 vs. nickel), but it works best with All-22 overhead footage, not sideline broadcast angles. That's a research direction, not a solved problem.
Small objects are hard. Tracking the football specifically is challenging—it's small, fast, and often occluded. Player tracking is more reliable.
Speed/distance needs calibration. Converting pixel movement to yards requires knowing the camera's view of the field. We estimate, but precise metrics need proper field registration.
What's Next
This is a foundation. Some directions we're exploring:
- Route classification: "Show me every slant route from this game"
- Coverage recognition: What defense was the opponent running?
- Real-time processing: Live analysis during broadcasts
- Multi-sport: Soccer, basketball, hockey—same approach
The interesting part isn't the tracking itself—it's what becomes possible when tracking is this easy.
Built with SAM 3, ByteTrack, and fal.ai. Weekend project that actually worked.
Want to see this in action?
We build systems like this for clients. Let's talk.