Back

Playing 5D Chess with AI

Mar 23, 2026

4 min

The lesson from chess that security missed

In 1997, IBM's Deep Blue defeated Garry Kasparov in a six-game match. The moment was treated as a milestone in artificial intelligence, debated in academic circles, and then largely forgotten by everyone outside of computer science departments.

What happened next is more instructive than the match itself.

Today, Stockfish, a free, open-source chess engine, runs on commodity hardware and plays at a level that would have been considered impossible in 1997. A $50 Android phone running Stockfish defeats every human grandmaster, every time, without exception. Magnus Carlsen, the highest-rated player in history, would lose 100 games out of 100.

The interesting question is why.

It's not raw intelligence. Carlsen can evaluate positions, recognise patterns, and develop strategic plans across a 40-move game. His intuition, developed over decades, allows him to discard bad lines of play almost instantly. By any reasonable definition, he is brilliant.

But Stockfish evaluates millions of positions per second. It doesn't rely on intuition because it doesn't need to. It calculates every variation, assigns numerical evaluations, and selects optimal moves with mechanical consistency. It doesn't get fatigued in the endgame. It doesn't make psychological errors under pressure. It doesn't satisfice.

The gap between Carlsen and Stockfish isn't talent. It's operational persistence at computational scale.

How penetration testing actually works

The security industry has spent two decades refining penetration testing methodology.

PTES, OWASP, CREST, NIST. These frameworks represent genuine institutional knowledge about how to systematically evaluate an organisation's defensive posture. The practitioners who execute these engagements are often deeply skilled. Many have backgrounds in software development, network engineering, or intelligence work. Some have spent years studying specific vulnerability classes or exploitation techniques.

None of that is the problem.

The problem is operational economics.

A typical web application penetration test runs for one to two weeks. Within that window, a consultant must understand the application's business logic, map its attack surface, identify potential vulnerabilities, develop exploitation chains, document findings, and produce a report suitable for both technical and executive audiences.

Time constraints force prioritisation. Experienced testers develop heuristics. Mental shortcuts that direct attention toward the most probable vulnerability classes given the technology stack, industry vertical, and apparent development maturity. These heuristics are effective. They're why good pentesters consistently find real issues.

But heuristics are, by definition, incomplete. They're approximations that trade comprehensiveness for speed. A consultant might check the 100 most likely attack paths based on their experience and the engagement timeline. They will not check all 10,000 possible paths. They cannot.

When the engagement ends and the report ships, the client receives an accurate assessment of vulnerabilities that were discoverable within the constraints of human attention and billable hours.

That's the actual product: security validation at human scale.

The compounding problem of low-severity findings

Vulnerability scanners and manual assessments both produce findings classified by severity. Critical and high-severity issues get immediate attention. They're clear, exploitable, and carry obvious business risk. Medium and low-severity findings enter a backlog. Some get remediated. Most don't.

This is rational behaviour given limited resources. If you have 40 engineering hours allocated to security remediation this quarter, you spend them on the findings with the highest individual impact.

The problem is that severity classifications evaluate findings in isolation.

A verbose error message that discloses internal path information is low severity. A misconfigured S3 bucket with overly permissive read access is medium severity. An API endpoint with weak rate limiting is low severity. A session token implementation that doesn't properly expire on logout is low severity.

Individually, none of these findings would concern a reasonable security team. Collectively, chained in sequence by an attacker who has the time to explore their interactions, they constitute a path to domain compromise.

These chains exist in most environments. They remain undiscovered because discovering them requires the kind of exhaustive, systematic exploration that doesn't fit within engagement timelines or human attention spans. Finding one chain is possible. Finding all of them, consistently, across every environment, is not.

The asymmetry problem

Offensive security has always been characterised by asymmetry. Defenders must protect every asset, every endpoint, every authentication flow, every API, every third-party integration, every legacy system that somehow became load-bearing. Attackers need to find one viable path.

This asymmetry has historically been balanced by attacker resource constraints. Sophisticated attacks required sophisticated attackers. Individuals or groups with technical depth, operational patience, and sustained motivation. The barrier to entry limited the threat population.

That balance is shifting.

The same foundation models and computational infrastructure available to security researchers are available to threat actors. The barrier to conducting systematic, automated reconnaissance against target organisations has dropped substantially. The ability to generate and test exploitation hypotheses at scale is no longer limited to nation-state intelligence services.

Organisations are defending against adversaries whose operational tempo is increasingly determined by compute availability rather than human attention.

The operational shift

We built Aether because we spent years conducting penetration tests and red team engagements, and we understood what we were leaving on the table.

Every engagement ended the same way. The report contained real findings. The client received genuine value. And somewhere in the environment, unexplored because we ran out of hours, were chains we didn't have time to find.

Aether is designed to operate at a different scale. Full-spectrum assessment across web applications, APIs, mobile applications, cloud infrastructure, wireless networks, internal network segments, and identity systems. Continuous operation rather than point-in-time snapshots. Systematic exploration of finding combinations rather than isolated severity classifications.

This doesn't eliminate the need for human expertise. Contextual judgment, business logic understanding, novel research, and adversary emulation all require human practitioners. But the exhaustive, systematic work of exploring every path and testing every chain benefits from computational persistence in the same way chess benefits from Stockfish.

The practitioners who adapt to this shift will operate differently. Less time on mechanical reconnaissance. More time on the strategic and creative work that humans do well. Findings that arrive pre-validated with exploitation proof rather than theoretical risk ratings.

The defensive side changes too. Security teams will operate with visibility into how their environment actually fails under sustained offensive pressure, not how it performs against a time-constrained sample.

What comes next

The comparison to chess is imperfect in one important respect. Chess is a closed system with fixed rules. Security is an open system where the rules change constantly. New technologies, new vulnerability classes, new attacker techniques, new defensive controls.

That openness means human expertise remains essential. Machines are very good at exhaustive search within defined parameters. Humans are very good at recognising when the parameters themselves need to change.

The future of offensive and defensive security operations will involve both. Computational persistence handling the scale problem. Human judgment handling the novelty problem.

Aether is our contribution to that future.

Your previous pentest wasn't bad. It was conducted by humans operating within human constraints. Those constraints no longer define the threat environment.

The question is whether they should continue to define your defensive posture.