Not Much Progress. W3
I suspect there’s a mistake in my scoring. We do some timing. We struggle for an idea. We resort to some bit twiddling. Do not read, that’s my suggestion.
I’ll begin by trying to write a test that gets the wrong score. I have in mind a misplaced letter and the same letter in the proper place. It should score 1020 and I suspect it might score 1010. Maybe not, but we’ll see.
def test_score_more(self):
guess = Word("xaxax")
solution = Word("ayyay")
score = guess.score(solution)
assert score == 1020
Passes. I’ll try another.
def test_score_even_more(self):
guess = Word("xaxax")
solution = Word("yaayy")
score = guess.score(solution)
assert score == 2010
Also passes. I might be smarter than I look. Let’s review score:
def score(self, solution):
# word: abcde
# solution: ecbdx
# score: 01121
score = 0
available_letters = list(solution.word) # cannot cache this, we destroy it
for i, c in enumerate(self.word):
if c in available_letters:
if available_letters[i] == c:
score = 10*score + 2
available_letters[i] = 0
else:
score = 10*score + 1
available_letters[available_letters.index(c)] = 0
else:
score = 10*score
return score
Ah. Tricky but possibly OK. If we have an exact match first, we remove that one. If we have an inexact match first, we remove the first occurrence of the letter. But I have one more test to try. And it will fail:
def test_score_even_worse(self):
guess = Word("xaxax")
solution = Word("yyyaa")
score = guess.score(solution)
assert score == 1020
And it does, returning 1010 because the first a in the guess sweeps the first a in the pattern which would otherwise have been matched as a 2.
I think we have to use two loops, one for exacts and then one for inexacts. I’m glad we had this little chat.
Here’s a version of score that passes all my tests:
def score(self, solution):
# word: abcde
# solution: ecbdx
# score: 01121
answer = [0, 0, 0,0, 0]
available_letters = list(solution.word) # cannot cache this, we destroy it
for i, w, s in zip(range(5), self.word, solution.word):
if w == s:
answer[i] = 2
available_letters[i] = 0
for i, w in zip(range(5), self.word):
if answer[i] != 2:
if w in available_letters:
answer[i] = 1
available_letters[available_letters.index(w)] = 0
return reduce(lambda product, factor: 10 * product + factor, answer)
This isn’t lovely, nor does it seem likely to be fast. If we are to do any serious analysis of the data we have, we’ll be running score on many pairs of words: more than 12000 X 2000. As such, fast might be of value. Thinking of fast, I wonder how long such a loop would take all on its own. Let’s find out.
def test_nested_loops(self):
sols = WordCollection.from_file("valid_solutions.txt")
guesses = WordCollection.from_file("valid_guesses.txt")
t0 = time.time()
guess_count = 0
sol_count = 0
for guess in guesses.words:
guess_count += 1
for sol in sols.words:
sol_count += 1
t1 = time.time()
assert guess_count == len(guesses)
assert sol_count == len(guesses) * len(sols)
assert t1 - t0 < 0.5
The actual time is about 0.4 seconds. Now I’m wondering about how long score
takes to run. This test gives me the answer:
@pytest.mark.skip("about 1.2 seconds per million")
def test_score_speed(self):
w1 = Word("abcde")
w2 = Word("edcba")
assert w1.score(w2) == 11211
n = 1000000
t0 = time.time()
for i in range(n):
t = w1.score(w2)
t1 = time.time()
assert t1 - t0 < 1.5
So that starts sounding pretty good until you reflect that the sort of thing we will want to do is to consider, for each word we know, versus each possible answer, providing some score, how may of the words in the solutions list would be eliminated by that score. So we would want about 12000 x 2000 x 2000 scores, or, to be exact, 12972* x 2315 x 2315, or 69,519,866,700 checks, which, if I’ve done it correctly, would take about 24 hours of computer time.
Clearly this will not do. We’re not likely to get score
to be 100 times faster, to get the time required down to 15 minutes. Could we possibly reach 10 times faster, to get down to a couple of hours?
Python is not famous for speed. It is, in fact, famous for being slow, and while Kotlin or C++ might be faster, perhaps we can actually come up with a better idea.
Some Speculation …
What might we want to know? The first thing would be to know a good starting word to guess. There are a number of famous suggestions, like CRATE and ORATE and SALET. (No, I don’t know what that is, either.) To know that, we’d want to know, for every word in our candidate list of first guesses, which one, given all possible solutions, provides the most information. (We should, at some point, actually look at some information theory, so as to say exactly what that means.)
To some degree, it is that list of 12972 possible words to guess that is causing us trouble. Here are a few slices from that list:
aahed aalii aargh aarti abaca abaci abacs abaft abaka abamp aband abash abask abaya abbas
bobac bobak bobas bobol bobos bocca bocce bocci boche bocks boded bodes bodge
rache racks racon radge radix radon raffs rafts ragas ragde raged ragee rager rages
zouks zowee zowie zulus zupan zupas zuppa zurfs zuzim zygal zygon zymes zymic
Even Scrabble people don’t know these words. “Zygon”? Really? Doctor Who’s made-up monsters? Zouks! How many zurfs are there in a zygal? Clearly what the New York Times is trying to do is to cater to as many legal words as possible, because the game rules require you to enter a legal word, and they’d not like to be caught out by some Fermentation Wizard complaining that the game doesn’t recognize “zymic”, a Very Common Word which he uses Every Single Day and This Is What Is Wrong With Journalism Today.
We are interested in averages. Our first question might be to select a good first word. How shall we define good? I think we’re going to go down the path of information theory, as described in this excellent Sanderson video.
But to do that, we will absolutely need to compute a lot of scores. So, this morning, I’m going to look at speeding up the score
function.
I think this is called “retreat to the known”. I might do better to do almost anything else, but this will be fun even if unproductive.
Efficient Score
Let’s set up a test to compare two versions of score.
def test_compare_score(self):
sols = WordCollection.from_file("valid_solutions.txt")
guesses = WordCollection.from_file("valid_guesses.txt")
guess = Word("crate")
solution = Word("prone")
assert guess.score(solution) == guess.score1(solution)
n = 1000
loop_0 = time.time()
for i in range(n):
pass
loop_delta = time.time() - loop_0
current_0 = time.time()
for i in range(n):
sc = guess.score(solution)
current_delta = time.time() - current_0 - loop_delta
new_0 = time.time()
for i in range(n):
sc = guess.score1(solution)
new_delta = time.time() - new_0 - loop_delta
print(current_delta, new_delta)
assert False
I plan to have two score methods, score
and score1
. Starting with them the same, I get this output from my print:
0.0011250972747802734 0.001142740249633789
I think a little rounding would be nice. To three places, and with n
set to 100,000:
0.113 0.112
Close enough. Now let’s see how to improve things. Here, again, is the current function:
def score1(self, solution):
# word: abcde
# solution: ecbdx
# score: 01121
answer = [0, 0, 0,0, 0]
available_letters = list(solution.word) # cannot cache this, we destroy it
for i, w, s in zip(range(5), self.word, solution.word):
if w == s:
answer[i] = 2
available_letters[i] = 0
for i, w in zip(range(5), self.word):
if answer[i] != 2:
if w in available_letters:
answer[i] = 1
available_letters[available_letters.index(w)] = 0
return reduce(lambda product, factor: 10 * product + factor, answer)
Let’s see what happens if we do the loop more openly. I think it will make things worse.
This version scores 0.084 vs 0.112:
def score1(self, solution):
answer = [0, 0, 0,0, 0]
available_letters = list(solution.word) # cannot cache this, we destroy it
for i in range(5):
# for i, w, s in zip(range(5), self.word, solution.word):
if self.word[i] == solution.word[i]:
answer[i] = 2
available_letters[i] = 0
for i in range(5):
# for i, w in zip(range(5), self.word):
if answer[i] != 2:
if (w := self.word[i]) in available_letters:
answer[i] = 1
available_letters[available_letters.index(w)] = 0
return answer[4] + 10*(answer[3] + 10*(answer[2] + 10*(answer[1] + 10*answer[0])))
# return reduce(lambda product, factor: 10 * product + factor, answer)
Each of those changes made a noticeable improvement. I am somewhat surprised, but that’s often the case with optimizing. That’s why we measure. With a bit more formatting in my test’s print I improve the report:
score1 0.083, score 0.112 = 1.349
Beyond this, I wonder if we can get rid of some of the array access. I’ll move this version into score
after a quick commit: optimizing score1 vs score.
I try this slightly different version, for a very small gain:
def score1(self, solution):
answer = [0, 0, 0, 0, 0]
guess = self.word
solution = solution.word
available_letters = list(solution) # cannot cache this, we destroy it
for i in range(5):
if guess[i] == solution[i]:
answer[i] = 2
available_letters[i] = 0
for i in range(5):
if answer[i] != 2:
if (w := guess[i]) in available_letters:
answer[i] = 1
available_letters[available_letters.index(w)] = 0
return answer[4] + 10*(answer[3] + 10*(answer[2] + 10*(answer[1] + 10*answer[0])))
This one just caches guess and solution, saving a couple of lookups.
I think that to do better, we need a different data representation. How about if we were to pack our five characters into an integer? We can represent them by a 5-bit code easily enough. Let’s write a little test for that.
def test_five_bit_code(self):
assert Word.decode(Word.encode("zebra")) == "zebra"
I figure I’ll just have those two functions as static methods, like this:
@staticmethod
def encode(string):
code = 0
for c in string:
code = (code << 5) + ord(c) - ord("a")
return code
@staticmethod
def decode(number):
string = ""
for i in range(5):
string = chr((number & 0x1F) + ord("a")) + string
number = number >> 5
return string
Test runs, zebra == zebra. Who knew?
Now that we have these methods, let’s use encode in Word, to give us a packed word.
class Word:
def __init__(self, word):
assert isinstance(word, str)
self.word = word
self.packed = self.encode(word)
Now what might we do with these bits? Maybe I should have thought about that first.
Well it seems to me that
zebra xor
zelda ==
00??0
That’s because a number xor’d with itself is zero. So that should be good for something. For every zero we have in the xor of the two words, we want a 2 in that position of the score. of course, right now our score is an integer base ten, but let’s not sweat that yet.
- A Long Delay Ensues …
- I spent some time twiddling bits and then observed that Python has 64 bit integers if you need them, so I have converted my packing to pack by bytes. I have this now:
class Word:
def score1(self, solution):
answer = [0, 0, 0, 0, 0]
guess = self.word
soln = solution.word
available_letters = list(soln) # cannot cache this, we destroy it
xor = self.packed ^ solution.packed
mask = 0x1F << 32
for i in range(5):
if not (xor & mask):
answer[i] = 2
available_letters[i] = 0
mask = mask >> 8
for i in range(5):
if answer[i] != 2:
if (w := guess[i]) in available_letters:
answer[i] = 1
available_letters[available_letters.index(w)] = 0
return answer[4] + 10*(answer[3] + 10*(answer[2] + 10*(answer[1] + 10*answer[0])))
@staticmethod
def encode(string):
code = 0
for c in string:
code = (code << 8) + ord(c)
return code
@staticmethod
def decode(number):
string = ""
for i in range(5):
string = chr(number & 0xFF) + string
number = number >> 8
return string
And it turns out that, so far, the bit packing is slower:
score1 0.086, score 0.079 = 0.919
This actually rather surprises me. Python must be really good at indexing strings.
Let’s see, could we push some bit work into the available letters thing. If we did would it be better?
What if the second loop checked our xor in the right position using a mask? Also, shouldn’t I be using 0xFF for my mask now? Yes, I should.
This doesn’t help at all:
def score1(self, solution):
answer = [0, 0, 0, 0, 0]
guess = self.word
soln = solution.word
available_letters = list(soln) # cannot cache this, we destroy it
xor = self.packed ^ solution.packed
mask = 0xFF << 32
for i in range(5):
if not (xor & mask):
answer[i] = 2
available_letters[i] = 0
mask = mask >> 8
mask = 0xFF << 32
for i in range(5):
if xor & mask:
if (w := guess[i]) in available_letters:
answer[i] = 1
available_letters[available_letters.index(w)] = 0
mask = mask >> 8
return answer[4] + 10*(answer[3] + 10*(answer[2] + 10*(answer[1] + 10*answer[0])))
However … I think we could go all the way to bit twiddling. That should be interesting. Here’s a sketch of the idea, but I need a break badly.
We’ll create the packed word, and a copy of it for the available characters. We’ll create score as zero. We’ll do our existing xor trick to find matches and if we get a match, we’ll or a 2 into the byte where our letter was, using our mask against 0x0202020202 to put the 2 only into the correct byte. Similarly for the other check. And then we can do our decimal conversion.
Alternatively, or later, we could actually create an entirely different kind of score. We only need two bits per character …
For now, I need a rest. And so do you.
See you next time!