One more go at concurrent. W11.
I’ve thought of a streamlined way to get the expected information values from the file data, and want to try another shot at parallelism.
The SolutionDictionary is a huge data structure. Its pickled form is 186.8 meg, and the in-memory form is probably larger than that. But the pinnacle of our analysis so far is a list of the 13,000 words in the allowed words list, each associated with a single real number, the expected information if you use that word as your first guess.
So, supposing that that’s what we want to know, it seems that we can get that information more directly. In essence, instead of creating and saving the SolutionDictionary, we can stream the information and just build up the pairs we need, word: info
. And that process might be able to be done on multiple processors to advantage.
Let’s find out.
Along the way, I have a p-baked idea, for 0 <= p <= 1, about an approach to infrequently used code bits, that might be better than setting them up as tests and then skipping them.
- Admission
- I didn’t try my idea yet. Got too involved in the concurrency. Soon, maybe. Then again, maybe never.
Begin with a Test.
I have two small tests for expected information now. The smaller is this one:
def test_known_values(self):
solutions = WordCollection.from_strings("azzzz", "zbzzz", "zzczz", "zzzdz")
guess = Word("dcbay")
gd = GuessDescription(guess, solutions)
keys = set(gd.score_descriptions.keys())
expected = {10, 100, 1000, 10000}
assert keys == expected
info = gd.expected_information()
assert info == 2
We’ll make another version of that test, calling a function to get the information.
def test_known_values_function(self):
solutions = ["azzzz", "zbzzz", "zzczz", "zzzdz"]
guess = "dcbay"
info = expected_information(guess, solutions)
assert info == 2
Note that I’m just working with strings and lists now, as part of making this as lean and efficient as I reasonably can.
We need the function. We basically edit existing methods to get this:
def expected_information(guess, solutions):
buckets = {}
total = len(solutions)
for solution in solutions:
score = compute_score(guess, solution)
try:
buckets[score] += 1
except KeyError:
buckets[score] = 1
info = 0
for count in buckets.values():
probability =count /total
info += probability*log2(1/probability)
return info
def compute_score(guess, solution):
answer = [0, 0, 0, 0, 0]
available_letters = list(solution) # cannot cache this, we destroy it
for i in range(5):
if guess[i] == solution[i]:
answer[i] = 2
available_letters[i] = 0
for i in range(5):
if answer[i] != 2:
if (w := guess[i]) in available_letters:
answer[i] = 1
available_letters[available_letters.index(w)] = 0
return answer[4] + 10*(answer[3] + 10*(answer[2] + 10*(answer[1] + 10*answer[0])))
I nearly trust this one test, but not quite, because I made a fatal mistake and the test still passed. I had buckets[solution]
instead of buckets[score]
and the test passed because there are four solutions that create four scores, so it came out the same. So let’s adapt the other test as well.
def test_first_info_function(self):
collection_of_words = WordCollection.from_file("valid_solutions.txt").words
solutions = [word.word for word in collection_of_words]
word = "abate"
expected_info = expected_information(word, solutions)
assert expected_info == pytest.approx(4.63, abs=0.05)
That passes as well. I am satisfied that my two new functions will compute expected information from a guess and a set of solutions.
Now let’s see how long it takes to do all the words we have.
def test_all_words(self):
collection_of_solution_words = WordCollection.from_file("valid_solutions.txt").words
solutions = [word.word for word in collection_of_solution_words]
collection_of_all_guess_words = WordCollection.from_file("valid_combined.txt").words
guesses = [guess.word for guess in collection_of_all_guess_words]
t0 = time.time()
unordered = map(expected_information, guesses, repeat(solutions))
ordered = sorted(unordered, key=lambda pair: pair[1], reverse=True)
print(time.time() - t0)
print(ordered[0:10])
assert False
About 28 seconds, and the same answers as we got before. So that’s good. We’re really only interested in the timing.
Parallelizing Idea
What I thought might be more efficient would be to set things up so that each parallel process would do 1/8 of the words, and that each process would read in the words and solutions, on the grounds that that’s faster than pickling and unpickling them.
First, though, let’s just try the thing with the straightforward conversion to concurrent
:
def test_all_words_concurrent(self):
collection_of_solution_words = WordCollection.from_file("valid_solutions.txt").words
solutions = [word.word for word in collection_of_solution_words]
collection_of_all_guess_words = WordCollection.from_file("valid_combined.txt").words
guesses = [guess.word for guess in collection_of_all_guess_words]
t0 = time.time()
with ProcessPoolExecutor(8) as exec:
unordered = exec.map(expected_information, guesses, repeat(solutions))
ordered = sorted(unordered, key=lambda pair: pair[1], reverse=True)
print(time.time() - t0)
print(ordered[0:10])
assert False
That runs in 6.5 seconds as opposed to 28. The processors are all pegged for about the whole six second count.
Rather than go to the effort of developing an object to do the reading on the parallel side, let’s first explore the time needed to do the pickling. If it’s small, this is about as good as it gets.
def test_pickling(self):
collection_of_solution_words = WordCollection.from_file("valid_solutions.txt").words
solutions = [word.word for word in collection_of_solution_words]
collection_of_all_guess_words = WordCollection.from_file("valid_combined.txt").words
guesses = [guess.word for guess in collection_of_all_guess_words]
unordered = map(expected_information, guesses, repeat(solutions))
ordered = sorted(unordered, key=lambda pair: pair[1], reverse=True)
t0 = time.time()
dump = pickle.dumps((guesses, solutions))
pickle.loads(dump)
dump = pickle.dumps(ordered)
pickle.loads(dump)
print(time.time() - t0)
assert False
THe time printed is 0.004 seconds.
That’s definitely not the bottleneck in our 6.5 seconds, even if we count the fact that concurrent
will have to do the packing and unpacking 8 times. That’s still only 0.032 seconds.
For fun, I tried just using 4 cores, and the time goes to 7.8, so the additional 4 “efficiency” cores are doing a bit of the work.
Summary
So, an interesting conclusion, and we didn’t go quite where I thought we would. I see no likely advantage to avoiding the pickling of this simpler data: pickling looks very unlikely to be the bottleneck. If we really wanted more speed, I suspect we would have to look at the efficiency of the scoring code itself. However, neither I nor my colleagues have any ideas that look better than what we have already.
We must be about tapped out on Wordle. Next time I’ll probably summarize learnings and either go back to the game or move on to something new, if some squirrel catches my eye. Ideas welcome.
I hope you’ll join me.