In Praise of LLM Ping Pong

Adam Sturdee
May 17
4 min read

There is a tidy line doing the rounds at the moment that large language models are making us stupid. Outsource your thinking, atrophy your mind, and so on. It is a comfortable thesis if you have not actually changed how you work in the last eighteen months. It also happens to be wrong.

What the evidence I see, in classrooms and in my own work, points to something quite different. Used carelessly, LLMs flatten the output. Used well, they sharpen it. The variable is not the model. It is the operator. And the operators who are pulling away from everyone else are doing something specific. They are running models against each other.

I have come to call this LLM Ping Pong.

Two modes worth distinguishing

There are really two flavours of this, and it is useful to keep them separate.

Ping Pong is adversarial. You ask Model A for a draft. You hand it to Model B and ask, in effect, to take it apart. Where is the reasoning thin? What did this miss? What would a sceptical reader say? You then return the critique to Model A, or to a third model, and ask for a revision. Two or three volleys in, you have something that is structurally better than any single model would have produced on the first attempt. The friction is the point.

Relay is collaborative. Model A drafts the strategy. Model B writes the prose. Model C tightens the structure. Each model contributes its strongest pass and hands the baton on. Less friction, more compounding. You use each model for what it is genuinely good at, rather than asking one to do everything.

Both work. The choice depends on whether you want a piece of thinking stress-tested or simply refined.

Why this is not cheating

The pushback I hear is that this is all just outsourcing. If three models did the work, did you really do anything?

Yes, and the work you did is the most important part. You set the brief. You judged each output. You decided which critique was fair and which was the model being pedantic. You chose what to keep, what to cut, what to send back for another pass.

This is taste. It is editorial judgement, applied at speed. It is the same skill that distinguishes a good editor from a typist, or a good headteacher from a passable one. The models did not give you taste. They gave you more raw material against which to exercise it.

And taste, it turns out, is not getting less valuable. It is getting more valuable. When everyone has access to fluent prose at zero marginal cost, the differentiator is the person who can tell which version of fluent prose is actually any good.

What this means for schools

There are three audiences in a school I would point this at.

Senior leaders drafting policy, strategy, or parental communications should be running their work through more than one model. Ask one to draft. Ask another to challenge it as a governor, a union rep, or a sceptical parent might. The result is calmer, sharper, and far less likely to need a clarifying email the next morning.

Teachers preparing resources, schemes of work, or feedback on pupil writing benefit from the same discipline. The first draft is the cheap part. The second pass, where you ask a different model to find the holes, is where the quality lifts.

Students are the group we owe the clearest teaching on this. The instinct in schools has been to ask whether they used AI. That is the wrong question and it cannot be policed. The right question is whether they exercised judgement. Did they read the output critically? Did they ask a second model to check the first? Did they form a view of which version was better, and why? That is a teachable skill, and it is the skill that will distinguish them in any field they enter.

We are not training pupils to compete with the models. We are training them to direct the models.

Ping Pong is the discipline we should be teaching.

A note for builders

There is also a quieter point in here for anyone building a product on top of frontier models.

If your entire offer is a thin wrapper around one provider, you are one product launch away from being bulldozed. The day that provider ships the feature you charge for, your moat is gone.

But if your product orchestrates multiple frontier models, putting them in Ping Pong or Relay against each other on behalf of the user, the picture changes. You are no longer competing with the models. You are using their competition for your customer’s benefit. That is a far harder thing for any single provider to replicate, because the value is not in the model. It is in the choreography.

This shapes how I think about the product I am building at STAR21, though it applies just as well to anyone building in this space. The interesting work is not in picking the best model. It is in arranging several so that the user gets the best of each.

The discipline underneath

None of this works without the human at the centre. The point of Ping Pong is not that the models do the thinking for you. It is that they give you more drafts, more angles, and more critiques to think with. Your job is to choose well.

That is a craft. It rewards practice. It rewards taste. And it is, quite obviously, not the death of human thought.

It is the most interesting time to be thinking in a very long time.

Adam Sturdee is a senior leader and co-founder of Starlight, the UK’s teacher-first AI-powered transcript-based coaching platform for educators.

His work sits at the intersection of dialogic practice, instructional leadership and responsible AI strategy for schools and trusts.

He will be presenting his research on AI-supported coaching at the BERA TEAN Conference 2026: https://www.bera.ac.uk/conference/bera-tean-conference-2026

In Praise of LLM Ping Pong

Recent Posts

Comments