THE SCIENCE22 FEB 2026

Why Machines Can't Track Moving Objects

6 min read

The Cocktail Party in Your Visual Field

You are watching 8 identical dots move across a screen. 4 of them flash briefly. Then they all start moving. Your job: keep track of the 4 that flashed.

This is Multiple Object Tracking (MOT), a paradigm developed by cognitive scientist Zenon Pylyshyn in the late 1980s. It sounds simple. It is not. And it exposes one of the deepest gaps between human cognition and artificial intelligence.

4-5
Objects humans reliably track simultaneously

How Your Brain Does It

Pylyshyn proposed a theory called FINST — Fingers of INSTantiation. The idea: your visual system has a small number of spatial indexes, like invisible fingers pointing at objects in the world. These indexes stick to objects as they move, operating below conscious awareness.

You do not track by remembering positions. You do not track by predicting trajectories. Your visual system assigns a kind of proto-identity to each target — a "this one" tag that moves with the object through space and time.

This is not a serial process. You are not rapidly switching attention between targets like a spotlight. The indexes operate in parallel, each one bound to its target, each one updating continuously.

Your brain does not compute trajectories. It assigns identity. That is the difference.

How Machines Try (and Fail)

Computer vision approaches MOT as a frame-by-frame detection and association problem. The pipeline looks like this:

  1. Detect all objects in frame N
  2. Detect all objects in frame N+1
  3. Match detections across frames using position, velocity, appearance
  4. Repeat

This works reasonably well when objects are visually distinct. It breaks down catastrophically when objects are identical — which is exactly the condition that matters.

When 2 identical dots pass close to each other, the system faces an assignment ambiguity. Which dot in frame N+1 corresponds to which dot in frame N? With 2 dots crossing, there are 2 possible assignments. With 3 dots in proximity, there are 6. With 4, there are 24. The combinatorics explode.

2-3
Objects AI reliably tracks through occlusion

Humans handle these crossings effortlessly. Your FINST indexes do not care about frame boundaries. They ride with the objects, maintaining identity through proximity events, partial occlusions, even brief disappearances. The tracking is continuous, not reconstructed.

The Biological Advantage

The gap is architectural. Computer vision systems process discrete frames. The human visual system processes a continuous stream. There is no "frame N" and "frame N+1" — there is only the unbroken flow of photons hitting your retina and the neural machinery that maintains object files across time.

This difference matters beyond dot-tracking. It explains why a goalkeeper can track a ball through a crowd of players. Why a parent can follow their child across a chaotic playground. Why a driver can maintain awareness of 4 nearby vehicles simultaneously.

These are not computational feats that require more processing power. They require a fundamentally different kind of processing — one that biological systems evolved over millions of years and that silicon has not yet replicated.

Where the Limit Lives

The human system is not unlimited. Performance degrades predictably:

  • Below 4 targets: near-perfect accuracy
  • At 5 targets: accuracy drops to ~85%
  • At 6-7 targets: accuracy falls below 70%
  • Beyond 8: essentially random

The limit appears to be structural — not a matter of practice or training, but a constraint of the underlying neural architecture. Some researchers link it to the number of available spatial indexes. Others connect it to visual working memory capacity. The debate continues.

What is not debated: within that 4-5 object window, humans operate with a reliability and robustness that no artificial system has matched under equivalent conditions.

Test Your Own Limits

Play Motion Bind

Motion Bind is THE VOID's implementation of the MOT paradigm. It starts with 1 target among distractors and scales to 7. Environmental mechanics — occlusion zones where dots disappear, merge events where dots temporarily converge — test the resilience of your tracking under stress.

Your FINST indexes are real. The question is how many you have and how well they hold under pressure. The void is watching.