Minerva

Single best answer questions are the backbone of written medical examinations. They are reliable to mark, hard to game, and well-suited to testing clinical reasoning across a broad curriculum. They are also expensive and slow to produce. Writing a good SBA requires a subject matter expert, familiarity with the examination format, careful calibration of the distractors, and usually several rounds of review before anything goes near a candidate.

The primary FRCA MCQ paper sits at the harder end of this. The Royal College of Anaesthetists releases a small bank of reference questions, and a cottage industry of question bank subscriptions has grown up around the gap between those and the thousands of practice questions trainees actually need.

This seemed like an obvious place to ask whether large language models could help — not to answer FRCA questions, which others have already looked at,¹ but to write them.

Minerva is a command-line tool I wrote in Python to explore this.² It uses a combination of retrieval-augmented generation (RAG), prompt engineering, and few-shot learning against a corpus of anaesthetic reference material to produce SBAs in the style and format of the primary FRCA MCQ. The RAG component pulls relevant facts from the knowledge base before each question is generated, grounding the output in the actual curriculum rather than the model’s general training data. Few-shot examples of RCoA-style questions help shape the format.

The system was evaluated with a simple blinded study. Ten reference questions were obtained from the RCoA sample bank, joined by twenty topic-matched questions generated by Minerva, and assembled into a quiz delivered to eight anaesthetists with a range of seniority. Participants were asked to rate each question on clarity, relevance, and difficulty using five-point Likert scales, and to identify whether they thought each question had been written by a human or an LLM.

Poster presented at the Association of Anaesthetists Resident Doctor Conference, June 2025

The results were more definitive than I expected.

On every metric — clarity, relevance, and difficulty — the LLM-generated questions demonstrated statistically significant equivalence to the RCoA reference items. Clarity scores were 4.26 ± 0.79 for the RCoA questions and 4.06 ± 0.88 for the Minerva questions (p < 0.01). Relevance and difficulty followed the same pattern. The correct answer rate was 63.8% for the RCoA questions and 56.9% for the LLM questions, which is roughly where you want a good SBA to sit.

More striking was the source identification data. Participants were no better than chance at identifying which questions were LLM-generated. When assessors couldn’t reliably distinguish the two sets, it becomes hard to argue there’s a meaningful quality difference.

These were presented as a poster at the Association of Anaesthetists Resident Doctor Conference in June 2025, and subsequently published as an abstract in the Anaesthesia supplement.³

The implications sit at two levels.

For trainees, this suggests that high-quality, curriculum-aligned revision questions are now generatable at essentially zero marginal cost. Whether that undermines the business case for commercial question banks is an open question — though it is an interesting one.

For the RCoA and other examining bodies, the question is more uncomfortable. If LLM-generated questions are indistinguishable from expert-written ones, what does that mean for the security and integrity of future examinations? The answer is probably that it doesn’t change very much in the short term, since the harder problem is not writing plausible questions but writing valid ones that perform consistently across candidate cohorts — something that would require much more rigorous psychometric validation than this pilot provided.

The code for Minerva is on GitHub.

Birkett L, Fowler T, Pullen S. Performance of ChatGPT on a primary FRCA multiple choice question bank. British Journal of Anaesthesia. 2023;131(2):e34–e35. ↩
Harris G. Minerva [Software]. 2024. https://github.com/glfharris/minerva ↩
Harris GLF. Evaluation of large language models in writing single best answer questions for the primary FRCA. Anaesthesia. 2025;80(S3):9–101. Abstract 094. doi:10.1111/anae.16654 ↩

George's Marvellous Medicine

A place for things I read, write, and do

Can large language models write indistinguishable primary FRCA questions?