One of the Holy Grail items of quiz bowl is something that shifts the balance of power and moves us from a constant demand for questions for events to a state where there is more supply of questions than demand for them. Whenever there's an advance in text generating AI, someone has a flash of insight that if we can get a tireless, ever-working AI to begin creating questions, the supply might catch up to demand. It's a bit of fallacy in that we have the supply of questions which can be used for practice is enough for 95% of people's demand, but what that 5% wants is to never have to worry about a distinction between practice material and tournament material, but the idea of a machine able to keep asking questions of you is appealing.
Recently, when ChatGPT was opened up for public use, it caused a great deal of concern in my social media feeds whether it could destroy the process of essay writing as an examination technique. And as some of the people asking the question were quiz bowl alumni, and adjacent, the question was inevitably asked: "Could we use ChatGPT to write questions?" and the first couple attempts were pretty wonky.
I've been considering the question of writing automation in quiz bowl for over 20 years. And while I think automation with an AI and natural language processing could be useful, it's overengineering the solution to a problem that doesn't require it.
One of the things we developed early in NAQT's development was a mechanism to create simple questions via scripts and templates. We called this development Sampo, after the magical mill which produced gold, flour, and salt. Over the course of five years, scripts generated questions to Sampo's writer number to the order of around 3000 questions.
Our reasoning for this was that the majority of complaints about questions have always stemmed not from them being repetitive in style, but in being oddly styled. Basically, questions that are imitative of previous successful styles are both forgettable and uncontroversial. So if you have something that is created once by a human, you have something that could be turned into a template and generate a series of questions using the same templates.
This was used to great effect for computation questions for high school competition. Once a writer uncovered a truly pyramidal computation tossup, we created a template so that the idea could be used over multiple years with different initial conditions for the problem. Basically there’s a problem of scale. You need to have questions of the same template separated by a certain number of questions, or tournaments to avoid repeating the concept. Ideally, you want a template to appear no more than once a year. That is something only large annual producers, or those dumping everything into a question bank can manage, and small operators can’t produce enough events to make templates worth their while.
Why did we stop with Sampo? There's a fundamental limit to the number of questions and subjects one can write with a script and tabular data. There's also the fundamental problem that it didn't write evenly across subjects, it wasn't so much filling needs for us as much as increasing backlogs in categories that weren't demand stresses. Also, once you saw the pattern enough you recognized what you needed to study was just a small set of data, and we didn't like that. Sampo also needed an editor, both for the templates, and simple fact checking. Because it got the same scrutiny as other writers, it took time and resources from editors. The choice for us was either to invest lots of time developing lots more templates that could fill the holes, or simply to not use it. Since we never got it to the point of being economically useful outside of increasing computation supply, we abandoned it around 2006.
A second effort I tried (initially as a way to learn python's tkInter interface) was an attempt to create a writing assistant. It took the form of an application, which had three windows. At the top left was a notepad, and at the bottom left was a single line entry. On the right was a list widget. When an answer was filled into the bottom left entry, the list on the right would populate with common clues, and once the clues were present, you could click those into the notepad and build up the clues in the question.
What killed this process was I simply couldn't build enough clues about enough answers into the system. I consciously avoided putting uniquely identifying first clues in the system, because once those are populated, they would become a crutch and overused.
The third idea I had for this was seeing that a template could be used as a starting point, and that the writer could fill in pieces like a uniquely identifying first clue, or a third part based on specific detail. When I made the note in high school practice about how books and authors form a block in writing, I was thinking of this. If I could pull a unique third part about Le Petit Prince, could I tell ChatGPT that I need the rest of the question, and could it write the two parts that are book-author identifications? That seems possible and probably an achievable goal.
This is why I don't think AI will get us where we would want it to go. While we could have the program create a single question on a subject, it's going to be far easier to task it with writing many questions on the same subject. Your economy of scale still has the same base problem as scripting. And scripting is still cheaper and allows a partial solution faster.
What an AI needs to be taught before it can work in quiz bowl:
- Pyramidality, and more generally the importance of position in the question of the clue.
- You would have to train it to identify "interesting" clues. A teaching dataset for that would be a nightmare to produce.
- Identification of what facts at its disposal are uniquely identifying versus other clues that eliminate some possible answers based on what's possible at each point in the question.
- Over the course of questions on the same answer, it needs to recognize clues that the question must go through to make sense to the player, and choose clues in a manner similar to that of people, meaning its selection must resemble human choices. If it chooses to include an obscure clue preferentially, or if it tries to use clues an equal number of times it will bend the distribution.
- It needs to be trained on quiz bowl questions, the difference between quiz bowl questions and other forms of communication it has been trained on, and the differences between different forms of quiz bowl questions. And as it's doing that, it almost has to be untrained in other forms of communication to prevent it from doing things like including the answer in the question, or the easy clues before the hard.
- Because it is pulling from a much larger set of training data, it needs to give the editor all of its sources used to create its work. The editor has to understand not only what it is saying (meaning it has to be a logical and grammatically correct construction), but it has to be able to open its sources and allow an editor access so that the editor can follow the inferences it made in writing the question, to see if it's missed alternate answers and possibilities within the question. It can't be allowed to be its own editor and fact-checker. While it would be able to note how it found each clue, and the context, for it to be a competent writer, it has to consider at each clue of the question whether alternate answers are possible. That would require running its logic against a much larger database, and seeing where ambiguity could exist. That problem is much harder than simply writing a question from facts, and I don't think that an AI can earn enough trust in its methods that an editor will not examine every statement it makes.
I do believe that an AI could do some of the job suggesting clues that my co-writing application would have offered, but I still see the editor having difficulty trusting even an AI co-writer's research. And there we see the problem. Even if the writer's time is reduced, the editor's time is probably increased, and because there are fewer editors than writers in most organizations, this increases the bottleneck it's trying to remove. Editors will be the biggest bottleneck to any sort of adoption of scripting, AI co-writing, or full automation.
Scripting and automation serve a purpose, and can be used to extend the supply of questions, and because they have to rely on a simpler list of sources than an AI can do with its training set, they have less to prove in creating utility. It's possible for an AI to reach that utility, but I don't know if there's enough editors willing to invest their time into improving a process that may not be faster than what they have now.