Skip to content
Vorp Labs//Vision
November 25, 2025

Technical

Zero-shot player tracking with SAM3

How we built a vision system that tracks players across frames with persistent IDs, even through occlusions and camera cuts.

P
Phil GlazerFounder
8 min read

In this piece, we walk through how we built a player tracking system using SAM3 and ByteTrack that is capable of taking a video clip of NFL game footage from a broadcast and tracking players through the clip.

Results

To give a sense for what the system generates, here's sample output showing the raw broadcast footage versus the tracking overlay:

Left: original broadcast footage. Right: SAM3-tracked output with persistent player IDs and motion trails.

Every colored box is a tracked player with the trails showing a players movement over the last 15 frames. Though there are improvements to be made, player IDs mostly persist across the video coherently (ex: player #1 stays #1 from first frame to last).

0
training images
1
text prompt
~85%
ID persistence
<1¢
per second

Methodology Overview

The core of the system is built around SAM3 (Segment Anything Model 3)—Meta's foundation model for zero-shot image segmentation—accessed via fal.ai. Unlike traditional object detection that requires training on specific categories, SAM3 accepts natural language descriptions and segments matching objects in each frame.

Our pipeline:

  1. Text prompt → SAM3 segments matching objects in each frame
  2. ByteTrack → Helps maintain consistent IDs across frames
  3. Team classification → K-means clustering separates teams by jersey color
  4. Track stitching → Reconnects IDs when players temporarily disappear
  5. Interpolation → Smooths gaps for clean visualization
VideoFrame ExtractionSAM3 APIByteTrackTeam ColorsTrack StitchingInterpolationTracked Video + JSON

Tip

SAM3 is open source—you can run it locally on a GPU or use it via API. We use fal.ai for the heavy lifting: a 30-second clip processes in about 2 minutes, costing roughly $0.15-0.30 depending on resolution.

Implementation Details

While making calls to SAM3 to segment objects in a given frame of a video clip is straightforward, things become more complicated when dealing with real broadcast footage; the camera angle moves over time, players occlude each other, and leave/re-enter frame.

ByteTrack Tuning

Out-of-the-box ByteTrack parameters are tuned for surveillance footage. Sports broadcast footage is different - as mentioned above, players leaving and re-entering frame (as well as disappearing behind players) make consistent tracking difficult. After testing across several clips, we settled on these settings:

tracker = sv.ByteTrack(
    track_activation_threshold=0.30,      # Lower than default—catch partial occlusions
    lost_track_buffer=90,                 # ~3 seconds at 30fps before dropping a track
    minimum_matching_threshold=0.55,      # More forgiving for fast motion
    frame_rate=effective_fps,
    minimum_consecutive_frames=3,         # Prevent 1-frame junk IDs
)

The lost_track_buffer=90 is the key insight. Players regularly disappear for 1-2 seconds (behind other players, camera cuts, leaving frame). The default 30-frame buffer can drop a player's ID too quickly.

Track Stitching

Even with a long buffer, ByteTrack occasionally fragments a single player into multiple IDs. Our track stitching pass merges them back:

# If track A ends and track B starts nearby within a few frames,
# and the distance is physically plausible (player can't teleport),
# they're the same person—merge B into A.
frame_gap = track_b.start_frame - track_a.end_frame
distance = euclidean(track_a.end_pos, track_b.start_pos)
max_plausible = 50 * (frame_gap + 1)  # ~15 yards/sec at 1080p
 
if distance <= max_plausible:
    merge(track_b, into=track_a)

This recovered about 15% of fragmented tracks in our test footage.

Team Classification

We automatically separate teams by jersey color using K-means clustering on HSV color values from each player's jersey region:

# Extract jersey colors and cluster into 2 teams
hsv_colors = [extract_jersey_hsv(frame, player) for player in detections]
kmeans = KMeans(n_clusters=2)
team_labels = kmeans.fit_predict(hsv_colors)
 
# Assign "light" to brighter cluster, "dark" to the other

This works reliably for any two-team matchup (white vs. green, red vs. blue, etc).

Output Data

Beyond the tracked video, we output structured JSON for each detection:

{
  "frame_idx": 0,
  "tracker_id": 7,
  "team": "light",
  "x": 427,
  "y": 201,
  "confidence": 0.71
}

From this data, we can compute:

  • Distance covered — total yards each player ran
  • Speed — instantaneous and average, in MPH
  • Heat maps — position density over time
  • Trajectories — full movement paths

Limitations

Early-stage

This is a working prototype. While the initial outputs are promising, moving to a production system will require further adjustments for reliability.

The approach has clear limitations. Some are inherent to the technology, others are specific to sports footage:

Camera angle matters. Broadcast footage shows a slice of the field. Players enter and exit frame constantly. The tracker handles this well, but it's not possible to analyze what can't be seen. Depending on the objective of the analysis being done, this might be a significant limitation.

Small objects are hard. We experimented with various clips where the football itself is in frame but it's challenging. The ball is small, moves quickly, and is often occluded. Player tracking is more reliable.

Speed/distance needs calibration. Converting pixel movement to yards requires knowing the camera's view of the field. We've made attempts to approximate this but getting more reliable results will require additional approaches and probably also manual calibration per clip.

Future Outlook

While this is still a prototype, we're actively building on the foundation:

  • Multi-sport expansion: Many of the same foundational approaches appear to apply to other sports like soccer, basketball, and hockey. We're testing across sports now.
  • Footage to analytics: The real value isn't tracked video, it's turning any game film into structured player data. Distance, speed, positioning, all exportable.
  • Looking for partners: If you have footage and want to explore what's possible, we'd love to work together.

Further Reading


Have footage you want to analyze? Send us a 30-second clip and we'll show you what we can extract: player positions, movement trails, team assignments, all as structured JSON. On us.

Interested in exploring this further?

We're looking for early partners to push these ideas forward.