Why Machines Can't Track Moving Objects
The Cocktail Party in Your Visual Field
You are watching 8 identical dots move across a screen. 4 of them flash briefly. Then they all start moving. Your job: keep track of the 4 that flashed.
This is Multiple Object Tracking (MOT), a paradigm developed by cognitive scientist Zenon Pylyshyn in the late 1980s. It sounds simple. It is not. And it exposes one of the deepest gaps between human cognition and artificial intelligence.
How Your Brain Does It
Pylyshyn proposed a theory called FINST — Fingers of INSTantiation. The idea: your visual system has a small number of spatial indexes, like invisible fingers pointing at objects in the world. These indexes stick to objects as they move, operating below conscious awareness.
You do not track by remembering positions. You do not track by predicting trajectories. Your visual system assigns a kind of proto-identity to each target — a "this one" tag that moves with the object through space and time.
This is not a serial process. You are not rapidly switching attention between targets like a spotlight. The indexes operate in parallel, each one bound to its target, each one updating continuously.
Your brain does not compute trajectories. It assigns identity. That is the difference.
How Machines Try (and Fail)
Computer vision approaches MOT as a frame-by-frame detection and association problem. The pipeline looks like this:
- Detect all objects in frame N
- Detect all objects in frame N+1
- Match detections across frames using position, velocity, appearance
- Repeat
This works reasonably well when objects are visually distinct. It breaks down catastrophically when objects are identical — which is exactly the condition that matters.
When 2 identical dots pass close to each other, the system faces an assignment ambiguity. Which dot in frame N+1 corresponds to which dot in frame N? With 2 dots crossing, there are 2 possible assignments. With 3 dots in proximity, there are 6. With 4, there are 24. The combinatorics explode.
Humans handle these crossings effortlessly. Your FINST indexes do not care about frame boundaries. They ride with the objects, maintaining identity through proximity events, partial occlusions, even brief disappearances. The tracking is continuous, not reconstructed.
The Biological Advantage
The gap is architectural. Computer vision systems process discrete frames. The human visual system processes a continuous stream. There is no "frame N" and "frame N+1" — there is only the unbroken flow of photons hitting your retina and the neural machinery that maintains object files across time.
This difference matters beyond dot-tracking. It explains why a goalkeeper can track a ball through a crowd of players. Why a parent can follow their child across a chaotic playground. Why a driver can maintain awareness of 4 nearby vehicles simultaneously.
These are not computational feats that require more processing power. They require a fundamentally different kind of processing — one that biological systems evolved over millions of years and that silicon has not yet replicated.
Where the Limit Lives
The human system is not unlimited. Performance degrades predictably:
- Below 4 targets: near-perfect accuracy
- At 5 targets: accuracy drops to ~85%
- At 6-7 targets: accuracy falls below 70%
- Beyond 8: essentially random
The limit appears to be structural — not a matter of practice or training, but a constraint of the underlying neural architecture. Some researchers link it to the number of available spatial indexes. Others connect it to visual working memory capacity. The debate continues.
What is not debated: within that 4-5 object window, humans operate with a reliability and robustness that no artificial system has matched under equivalent conditions.
Test Your Own Limits
Play Motion BindMotion Bind is THE VOID's implementation of the MOT paradigm. It starts with 1 target among distractors and scales to 7. Environmental mechanics — occlusion zones where dots disappear, merge events where dots temporarily converge — test the resilience of your tracking under stress.
Your FINST indexes are real. The question is how many you have and how well they hold under pressure. The void is watching.