Week 249: Crystal Radio

a side project for winter break

Dec 19, 2024

P={}
q="Budd Schulberg fictionalized this writer in The Disenchanted, describing his time as a Hollywood screenwriter. Edmund Wilson edited this writer's own account of his dissolution, The Crack-Up. His wife told the story of their troubled marriage in Save Me the Waltz. For ten points--name this author who told of his Hollywood time in stories of Pat Hobby, of his marriage in Tender is the Night, and of life among the smart set in The Great Gatsby."
ans= "F(rancis) Scott (Key) Fitzgerald"
words = q.replace(",","").replace("?","").replace(".","").replace("(*)","").split(" ")
for i in range(len(words)):
    F={"answer":ans,
       "position":float(i)/float(len(words)),}
    word=words[i].strip()
    if word in P.keys():
        P[word].append(F)
    else:
        P[word]=[F]

This is a little project I call the crystal radio of quiz bowl. It's based on the old kits that were popular back when my father was getting involved with fixing televisions. It's designed to be an educational lab that starts a young person on the path of developing an understanding of technology. A crystal radio is a simple device, it doesn't have a lot of power, or rely on a lot of technology to improve it, it's a proof of concept which we can then use to prove that we understand the process of receiving radio signals, and then add additional features to it.

Thinking through this, I recalled that I got a crystal radio myself as part of a larger kit.

https://www.reddit.com/r/GenX/comments/139wrak/science_fair_electronic_project_kit_150_in_1/ And this particular kit, while I never got the radio to work right, was able to get me through a couple of the longer Christmas Breaks of grade school. (it’s the sort of thing that is useful after the high of the cool gifts peters out on the 26th, but has just enough depth to get you through the snowstorm days, and is forgotten on the first day back in school.) I'm hoping that this crystal radio will do the job for some of your players, or yourself.

In our crystal radio kit, the technology is data mining and the format is taking a set of questions and converting it into a lexical data set. That little snippet of python above takes a single tossup question and its answer, and encodes the two pieces of information we want for each word in the tossup: Where it is in the question, and the answer it points to. Through that data set, we want to be able to identify certain types of fingerprints of the data. If you want to get into understanding the process of data mining, a lexical database, which can be formatted, clustered, and then associations drawn from it, gives you a fairly simple model to understand how the whole process needs to be done. It's like the 150 in 1 kit, you can use it to learn the pieces and learn the skills, as you construct a complete project.

In python, the above script will assemble a dictionary (hash table) consisting of keys and values. Each key in the dictionary is a word that appears once in the questions you slurped into the program. Each value in the dictionary is a list of small dictionaries with two keys: position and answer, locating where in the question the word appeared 0 being the first and 1 being the last word, and the correct answer for the question.

This encoder portion won't be useful in the case of a single question, for this to work we're going to have to collect a corpus of thousands of questions (probably all tossups of the same format and relevant level) and extract the text of each question and the answer from the answer line. We then need to encode each word in each question into the words database, and then develop a set of routines to evaluate each item in the words database to find words which match criteria. These two portions would need to be developed by someone who undertook this as a winter break project.

[An aside: Please note that nowhere in this are we addressing large language models for artificial intelligence. But the idea of datamining is an important first step in developing the knowledge that models process. This isn't how it's done for that purpose, but understanding how the data is captured shows how the process can be improved.]

If you collect this data on enough questions, certain patterns will emerge that given your guidance the computer could extract from the data. If a word appears in a number of questions, and each time it refers to a particular answer, you are likely to see that happen in new questions. In other words, you've gotten the computer to read through thousands of questions to find words which lead to an answer. We do this as coaches all the time. "I hear the musical term ‘glissando’, and nine times out of ten it's going to refer to some part of Rhapsody in Blue. " All that this script does is slurp up the words of a question, and tell you where each instance of the word is. And if those words are all in questions on a single answer? You have intelligence you can relay to your team.

When you or a player are reading questions to the team, you often see a word in a question that you recognize as being an important clue. And your instinct may be to stop the reading, and point out "this word's important, it always goes to this answer!" or write down the word and the answer, for the team to review and remember it later. Now these are important clues when you notice them, but the way of recognizing them and getting the information to the team is problematic and scattershot. The problem with this is that you can be wrong, there could another answer out there which uses that word in the question. It could also be useless for them to know that that word appears in the questions for that answer, because there is only one repeat of that answer every few years. You may have observed it correctly, but it's not going be observed again to be useful to your team. By letting the machine review hundreds of packets at a time, and compile the results, you're getting a more complete picture than you can create from your own observations.

This is basically taking what I have done in a lot of my writing, finding a pattern among writers and focusing on it as something you can teach. With The 99 Critical Shots, I partially automated the compilation, but I also used the compilation to proof my observations. The difference with the crystal radio is that it does the reading for you, at the cost of formatting the data, and performing clustering and association pattern mining. I think a lot of you would make that trade, or know a student who would make that trade in exchange for having a project that would teach them about data mining.

There's two remaining pieces you'll need to make this useful. First, you need to create a way to slurp up the files of tossup questions as if there were only tossups and answers on alternating lines. This replaces simple assignment statements for q and ans. Creating a loop taking a question at a time and encoding all the words in that question is the way to do this. Second, you need to come up with a query you want it to identify in the pattern. This can be done with python list comprehension after you've processed the values: For example:

[x for x in P.keys() if all([P[x][0]['answer'] == P[x][y]['answer'] for y in range(len(P[x]))])]

would check over the for each word x in the dictionary P and select every entry whose "answer" fields in the list of mentions are the same answer. The output of this would a list of those words which appear only in questions whose answers are all identical.

This database as constructed only answers one question: If a word X is mentioned in tossup questions, what are the answers to those questions and what is the position of X in those questions? It does not answer the question that people believe to be most important: What words are always mentioned in questions on answer Y? That's an important distinction, since you can always have questions on Y which choose not to use clue mentioning X. If there's enough words that always leads to answer, writers tend to notice more than the players, and work to avoid the clue they consider trite.

I also think this particular query will produce a small number of hits on any corpus of questions, because it doesn't account for things like questions that reference a fact with a particularly indicative word, but change the identifier. I first thought of this particular problem to solve in this way when the name "Enjolras" appeared in a novice set question about Les Miserables, where Enjolras is a non-major character. I immediately recognized it as a clue, but I thought it could refer to Les Miserables the book, the musical, or it could be turned into a question about Victor Hugo. While highly indicative, it wouldn't pass the test condition above because that could appear in the data.

That's what makes this crystal radio a crystal radio, it's limited in its utility. To cover things like this which requires clustering of data, you'd need to perform additional preprocessing on the data. You'd need to capture the identifiers of the answer in the text of the question, what words follow "this" or "these". You'd need to group together answers like "Les Miserables", "Les Mis", "Les Miserables (accept: Les Mis)", and similar to group data that would be split by the evaluating function.

Share Holed up with a buzzer

So what sort of patterns could we pull? And what questions could they answer with those patterns?

If the answer field of all entries for a particular word all referred to the same answer, there would be a high likelihood that all future questions which mention that word have that answer. It's not quite as valuable as knowing that every question which has an answer of X uses clue Y to clue it, but it's still pretty valuable.

If you combine that sort of data with the position data, you can figure out whether a word is a clue that is used mostly at the beginning of tossups (nice to know if you're aiming to power questions,) or at the end (nice to know if you want to make sure your team has an answer in all situations.)

The problem I'd like to answer, which I may spend some time expanding with this in the winter break is to come up with what kinds of words are most likely to come up as pointing to a single answer. I imagine that proper nouns like character names and geographic names would be the most likely to produce singular references, possibly followed by names of chemical compounds. I have no basis for that hypothesis other than having Enjolras on the mind, but I want to think it's correct, and there may be other categories I could identify if I had a large enough list that I could look at all at once.

The other use of this, the one that could become a useful product, would be a way to highlight an existing PDF file or web page, with all the clue words that are relevant. If you could highlight all the indicative clues that lead to this answer as the question is being read, anyone reading the question would gain information about what’s important in every question. It could make reading questions in practice as valuable to a player as playing questions in practice. That would lead to ambitious players reading more packets in practice, and out of practice.

Problems you will encounter:

The slurp of this data is going to require a lot of work. Building something to trim out packet headers is easy, dropping the bonus questions could be a little harder. Accounting for all the formats of writing tossups, how the answer line and question number is typed and interpreting the differences in whitespace are things you'll have to work through by trial and error, and frustrating, and long working as you see new ways to typo a question. I apologize because I know this is where the largest amount of your time will be spent, and I only covered it here. But the biggest headaches in any programming project you start for yourself are those that come from having other people’s work as involuntary collaboration.

Even after slurping this up, this is noisy data. Just considering how answer lines are created, there is no filtering this for case or underlining in text or capitalization. Any strictly lexical evaluation of equality applied to this data will think the answers of those combinations will be different, and will mislead you into thinking there are words for which there are multiple answers, when in fact it's something like "John Quincy Adams"

The dictionary will become huge. The easiest way to do that will be to remove words which show up in lots of questions with lots of different answers. "The", "a" and "an" would be the most common, followed by "this" You can begin to skip words that are already known to go nowhere, because more data won't suddenly make them indicative.

This is not enough information gathering to answer all the questions you may have. Creating a separate database keyed to the answers and linking words into the values field may be necessary to solve some of these problems. It would probably also be useful to associate all the indicator words that point to an answer.

Because of the choice to look at single words only, there's lots of clues which could be useful but won't be explored by this because of a space in the middle. You could make a text that expands a key to include the next word if the current word points 100% to an answer for many questions. But that’s extra work.