Teaching Chess with Language Models

I recently built a Chess Learning platform as part of my journey to achieve an Expert level FIDE ranking. The tool (entirely open-source and local, meant for personal use) is fairly straightforward: sync your games from Chess.com, run them through an engine, and identify your recurring mistakes. In working on this project I came across an interesting problem: getting an LLM to explain why a particular move in chess is good or bad.

Modern engines (Stockfish, AlphaZero, Leela, etc.) have essentially solved chess evaluation; they will tell you the ~best move in any position. As a semi-lazy but decent human player (~2000 rating), I find engines to be useful, but not quite as useful as a conceptual explanation. Chess books offer some of the guidance I seek but they're unpersonalized and much harder to apply without dedicated study. I'm also not interested in finding a coach at this time. An eval line alone isn't helpful instruction for a student seeking to understand what strategy is appropriate for a given situation and why a move is favorable or not. On the surface this appears to be a mere language problem which LLMs should be able to tackle.

The wrong tool for the job

After doing some digging, I found broad agreement that LLMs cannot play chess well. Frontier models routinely make illegal moves and fail to correctly assess even basic positions. Benchmarks from 2025 found o3 making illegal moves in 87.5% of games and o4-mini in 94.3% — worse than GPT-3.5-turbo-instruct on raw legality. Larger thinking budgets don't close the gap, suggesting the limitation is not simply a matter of compute.

The failure doesn't appear to be architectural either. Additional research by Adam Karvonen shows that a 50M-parameter GPT trained on human games (from Lichess) can reach ~1500 Elo and develop a coherent internal board state.

But using an LLM to play chess strikes me as the wrong framing to begin with, given that evaluation is already "solved". The interesting question isn't whether LLMs can compete with purpose-built engines (they can't) but whether they can do something that engines are incapable of: explaining a position in terms which a student can learn from.

Why it's still hard

If LLMs are bad at playing chess, the obvious solution is to pair the model with an engine. In the simplest form, provide the engine's top line to the model and have it explain. This fails in a specific way: the model receives the what but not the how. It may know that e4 is +0.3 and what subsequent moves are to follow, but the model lacks visibility into all of the branches that the engine explored to reach that conclusion. It also has no sense of what state each continuation leads to, and no framework for translating an eval score into a concept a student can act on.

A naive input representation can also compound this. Raw PGN and FEN notation are not natural inputs for a language model, and force the model to do significant inferential work to reconstruct what's on the board (e.g. whether pieces are pinned, a mating net is being set up) before it can actually reason about the position. This is a different failure mode than "the model doesn't know chess."

Bridging the gap

A chess-trained model learns to build a board representation on its own, whereas an off-the-shelf model needs us to hand it one. In my own exploration, these techniques noticeably improved how accurately LLMs reasoned about positions:

Supplement FEN with explicit piece lists. Rather than asking the model to parse FEN notation, I generated a plaintext piece list (White: King(e1), Queen(d1)...), annotated each move with from→to squares, and labeled captures with the piece taken (e.g. cxd3 (×Rook)). The model never has to infer what's on a square.
Pass multiple evaluations as ground truth. The LLM explains what the engine found rather than exploring or evaluating for itself. Providing additional lines can also help the model understand some of the tensions in the position.
Inject strategic principles. I drew on Karpov's Find the Right Plan to give the model some specific concepts to seek and call out.

Each of these either translates the position into a language-native form, or offloads the part the model can't do reliably, before asking the model to reason about it.

Taking it further

The most relevant work I came across is Tang et al. (2026), which formalizes a similar approach into a training pipeline. Their core insight mirrors what prompt engineering approximates: Stockfish knows the correct answer but can't articulate it; LLMs can articulate but can't evaluate. Their solution was to combine both — Stockfish provides ground-truth principal variations, an LLM generates natural language explanations using Feigned Discovery Prompting (reasoning as if the answer is unknown, rather than post-hoc justification). Feigned discovery forces the model to reason about its options until it arrives at the optimal move, resulting in a thorough and grounded chain of thought (rather than reasoning backwards from the solution and rationalizing incorrectly).

Their results were impressive but constrained only to puzzles which have a single correct continuation. Real game positions remain a harder problem as there may not be an immediate tactical objective to pursue and the optimal strategy may not be clear yet.

I plan on incorporating some of these ideas in the next iteration of the tool. If you're interested in collaborating, feel free to reach out.