Week 220: Considering AI's Hallucination Problem

And a quick recap of HSNCT

May 30, 2024

https://www.theverge.com/2024/5/15/24154808/ai-chatgpt-google-gemini-microsoft-copilot-hallucination-wrong

The starting point for this thought was the hallucination problem in AI. I personally don't think the hallucination problem is completely solvable, but I think that there are a set of problems and creative tasks for which the hallucinations can be constrained enough to make the hallucination problem acceptable, or have the same level of failure rate as a human. But I don't think that quiz bowl writing falls into that space.

If I have a resistance to AI it's this: All forms of AI are biased towards creating a solution to the prompt they are given. Essentially they are too eager to please the user. The AI is designed to produce an answer. If an AI doesn't produce answers, it's going to be reprogrammed on the next iteration, or it's going to be given a bigger language model.

It's essentially the same thing as an episode of House, when presented an answer to a pathology by a resident in a lecture, House informs them that their solution would have killed the patient.

“I'm sure that this goes against everything you've been taught, but right and wrong do exist. Just because you don't know what the right answer is, maybe there's even no way you could know what the right answer is, doesn't make your answer right or even okay. It's much simpler than that. It's just plain wrong.”

―Gregory House, "Three Stories"

Strip the moral application of "right and wrong" and replace it with factual "right and wrong" and you see where I stand. And in the construction of questions for quiz bowl, a little wrong is the same as a lot wrong.

Now I tie this to our discussion of where quiz bowl will be in 20 years, because at some point the economic pressures are going to push an AI solution into quiz bowl. And I am forced to consider a number of questions in sequence:

1) Are quiz bowl players willing to accept questions generated by AI for play?

Probably not initially. Presented with what I just said above, and knowing the minimal acceptance of mistakes present in the circuit, an AI solution will produce errors in questions it creates at an unacceptable rate for a long period of time.

2) Are quiz bowl editors willing to do the same and edit the questions created by AI?

If I were an editor who knew that I had questions from an AI in the set, I'd spend far more time on the edits than I would otherwise. That's not necessarily an emotional statement on my part. I recognize the types of errors people make when typing, or researching, or choosing their answer line. I recognize when people have a brain fart and attach the wrong first name to a last name, or fail to type a word they're thinking of, or when the auto-correct takes their typo in a new and exciting direction of insanity. There's a fingerprint to those sorts of wrong, and I know where those are likely to show up. I don't have any feel for how an AI is going to create errors, and I really have no interest in spending my time watching it fail.

So I tend to think if an editor is editing AI questions it's not going to be by choice, it will be by necessity. Something like needing an extra packet for a tournament the night before, or having to replace a packet. It will be an attempt to use something that will not be pleasant for the editor, and I suspect because the first few attempts will be bad, they won't let the players know what they're getting. So I don't think any one on the production side or the consumption side will do this if there's any consideration.

3) Are people who purchase questions for events going to be willing to accept questions from AI?

This is where I think the question changes, because these are the people furthest from the competition, while still having decision making power. Purchasers might go for the novelty of an AI created set, or they might simply go for the price. And it will be because of the next question.

4) Will anyone be able to distinguish AI questions from human-produced questions?

This is where pyramidal questions will show a difference. The hardest thing an AI will have to do in mimicking a human question writer is not choice of clues, but making the clues fit an order of obscurity over making natural language statements. This will be obvious to a player, obvious to a writer, and completely missed by someone who isn't exposed to the corpus of questions as regularly as a player is. Players have internalized a large amount of knowledge about questions from a relatively small set of text. There's probably less than 10 million quiz bowl questions in all formats that could be slurped up by a language model. And the rest of the corpus that makes up that language model is not going to have the same rules about order. An AI model would have to extract the rules of "What's the right order of clues for this answer?", "What's an appropriate difficulty level for use?", and "Are we creating a correct balance of categories?" from a very small sample, and not take into account the rest of the language corpus it's using for knowledge to assemble the question.

Would it eventually reach a solution for a successful imitation of quiz bowl style and format and factuality and clue order? Maybe. Would it be given enough tries to get there reliably without getting dismissed as not a possible solution? Probably not.

Now I'm going to make a split here between the circuit and shorter question formats including television. If length and arrangement of clues is the distinguishing characteristic by which AI created questions are acceptable, then there might be shorter questions which could be created by an AI to satisfactory level. The issue here is that such questions might be able to created by non-AI methods (simple templating of tabular data, or other scripting.) A shorter length also limits the amount of editing required to make questions acceptable, or to decide to reject them and create replacements.

In the realization that this was the 25th High School National Championship Tournament, I found myself looking at my role in this year’s festivities, and realizing, I’ve developed into a very reliable set of skills for this tournament, none of which I’m the best at any more, but had I not done the task before, nobody else would have known how to do it better. And sometimes the rest of the team uses me to see if a task can be optimized. Last year, I was testing the third-official position in the online question reader, and sent a few bug reports to R, which made some of this year’s use of the material better.

This year’s case was the cleanup tasks. On Sunday, after your room is done with playoffs or consolation, you are supposed to drop your buzzer and clock off in one room, and go across the hall for one cleanup task, usually inspecting some room which closed the round after yours. After you’ve done that task, you’re given your Sunday food money, and are free to go. Normally this gets accomplished by dribs and drabs, some rooms are done after round 22, then another few at 23, and a roughly even pace of people go flowing through the cleanup room. This year, because of the huge number of consolation round requests, we had a problem. Over half the rooms were done after round 25. So before 25, we had a few people checking a number of rooms, and after 25 we had a mass of people waiting for the next rooms to close down and be checked, and very few rooms left to check. Pressed for a solution to the mass of people, I resorted to triage, and then handshakes. We put people who were flying out that night on first priority for tasks, and then filled out our list giving them room keys and instructions to wait until the people in the room were done cleaning up before going in. That still left us with 40 or so people to check the remaining 10 rooms for round 27. As I had them form a line around the room, when I ran out of rooms to check, I made them promise me that they’d be back for the finals, and after the finals, they’d clean up the big rooms. And then I paid them up front for their cleanup task.

Shortly before I went up to work on the press releases, I hear that there were 50 some people who came back to help out on disassembly and schlepping equipment to headquarters. I’m always gratified to find my faith in the tournament staffers to be correctly placed, and slightly amazed that they do these things for me.

I came back from HSNCT on Monday morning as the storms came into Atlanta. While I was delayed an hour in flying out, I had time to examine how our press releases made it out into the world. I had gotten all of the press releases launched by 2am Sunday, after the 9pm conclusion of the event. And even with that time crunch, I was able to include photographs from the tournament in around one-half of the press releases. That has helped boost the pickup rate of the press releases, and has given us a lot better response from papers, AND television. We gave the top 12 teams a full distribution to their metro area's television stations in addition to their papers. The only thing we didn't have was a full set of finals and award ceremony pictures, but they've come out tonight. I'm finishing the last follow ups to the papers and stations tonight.

When the papers and television stations publish the story, I see it come up in my daily digest of newsfeeds. The other thing that has done is gotten me started on searching for senior information right in the thick of graduation articles. I'm usually a couple weeks late on that sort of thing this time of year, but I'm over 30 seniors located for the list already. Normally that's a number from the middle of June, but we're way ahead of schedule this year. Hopefully we can see a marked increase in senior participation over last year.

Holed up with a buzzer

Week 220: Considering AI's Hallucination Problem

And a quick recap of HSNCT