RUBY - algorithm for generating abstract keys


#1

Hello,
I recently finished Ruby and Ruby on rails courses and now I'm working on my personal project.
The biggest struggle for me so far was optimization of my code, because i need it to be as fast as possible

As an example, let user input be
input = "cg?"

i also have an array which contains letters of the alphabet
alphabet = ["a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z"]

In the input "?" acts like blank tile in Scrabble - it could be any letter.
What i need from my program is to generate keys, which in this case would be
final_keys = ["2ac", "2bc", "2cc", "2ag", "2bg", "2cg", "2dg", "2eg", "2fg", "2gg", "3acg", "3bcg", "3ccg", "3cdg", "3ceg", "3cfg", "3cgg", "2c", "2g", "3cg"]
each key consists of number of letters and letters in alphabetical order. As you can see last three keys are different - this is becase in my code "2c" acts exactly as ["2cd", "2ce", "2cf" .... ] would act. I want it as fast as possible that's why im generating only "2c" instead of ["2cd", "2ce", "2cf" .... ].

My code For generating these keys is here:

        def choose_letters(letters, key)
          letters = letters.select{ |l| l <= key.chars.last }
        end
        
        letters = input.gsub(/\s+/, "").chars.sort
        keys = (2..letters.length).flat_map{ |n| letters.combination(n).map(&:join).map{|n| n = n.length.to_s + n.chars.sort.join } }.uniq
        keys_with_blanks = keys.select{ |k| k.include? "?" }.map{ |k| k.gsub("?","") }
        final_keys = keys_with_blanks.map do |k|
          current_letters = choose_letters(all_letters, k)
          k = current_letters.map{ |l| l.gsub(l,l+k) } 
        end.flatten.map{ |k| k.chars.sort.join }.uniq
        final_keys += keys_with_blanks

This is the slowest part

         final_keys = keys_with_blanks.map do |k|
           current_letters = choose_letters(all_letters, k)
           k = current_letters.map{ |l| l.gsub(l,l+k) } 
         end.flatten.map{ |k| k.chars.sort.join }.uniq

My input sometimes reaches up to 15 characters, for example input = "abcdefghijklmn?"
For this input my program needs about 3-4 seconds which is to long.

QUESTION

How can i refactor my code to make it even faster? Up to 13 characters it's very fast. I need it to be faster for up to 15 characters.


#2

Assuming you have a word list, why are you generating combinations?
Search in the list for words that match the pattern instead, you can probably afford to search through the whole list, but you can split it up by length first and after that it might just not matter


#3

My dictionary includes 2_700_000 ~ words. First it was a database - couple of seconds to find words with given rack of letters. Now the words are stored as a Trie.
Given a rack "tile" i sort the input to get "eilt". Then i generate keys, ex. "4eilt" and call trie.children("4eilt") which give me words "tile" and "lite". With other keys i get any possible word made from the rack using 2 up to input.length letters. This is very fast way of getting all the possible words. Problem started now when i want to implement blank tile functionality. Combinations get big and it slows me down. I'd call it efficient up to 13 characters. I need it to be very fast up to 15, hence I'm looking to improve my code somehow. Or maybe i need different approach with walking my Trie without using so many combinations. (searching is fast even when i want to check 500_000 keys or so.. Generating these keys can be slow - with 15 character input it takes 3-4 seconds :frowning:


#4

This is probably over my head... I'm just trying to reason what it is you are after.

alphabet = ["a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m",
            "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z"]
c = 13
a = alphabet[0...c].push('?').combination(3).map(&:join)
b = a.select{|x| x.include? '?'}

a is an ordered array, as is b.


#5

And why would searching through 500k words be slow when checking 500k keys is fast?

1 million 20-char random words in an array
Takes 0.4s to go through with == on one core
You probably have at most 500k words per length, so that's down to 0.2s
Split that up on 4 cores and you're at 50ms
Don't run it on a laptop and and it'll be a bit less still

Yeah, you can do some clever stuff.

But you're using something complicated that's several times slower than just walking through the list to argue that you need to do something clever


#6

Oh. You don't know the length of the result?

Well.. You could buy some time by doing the heavy lifting in C++ and use unix pipes to keep that program alive with the word list loaded in memory, ready to start searching as soon as it gets input (or a native ruby extension though I imagine that's a little more to figure out how to do). You can also do some initial pruning like (contains a, contains b, contains c, ..) and go for the list with the smallest match, that's no guarantee for worst case though, and takes quite a bit of memory

As for scrabble algorithms that beats me, and it's kind of unclear what you want to do as well. I'd have me a good long googling session


#7

I'm making scrabble solver app, with given the rack of letters, the app returns all words that could be made from that rack (no board). I'm storing my dictionary in a Trie structure - i'm using fast_trie gem. Maybe i don't understand how should i search for possible words - that's why i stored my words in a way described earlier in this topic.
Now i returned to Trie with words added exactly like they are - no weird keys. Still looking for answer how to solve my problem - search the trie over and over to find all the possible answers.
Here is the app - current version uses Trie with words stored the weird way - and i have to generate the keys. These are Polish words if anyone is curious. If you use 15 characters with 1 "?" - tile it gets slow, because it's generating the keys. Looking forward to find some effective way to search the regular-word Trie.

Edit: Sorry for previous overcomplicating the problem :slight_smile:


#8

Could you follow only the paths in the trie that would match? That would avoid generating all possibilities and instead only follow where there looks like there are words

But again you might need other constraints to narrow down the search as soon as possible. If you have different trees for "contains a" "contains b" "contains c" and so on to z, and then pick whichever one is smallest and start searching from there (how many words contain a? e? t? how big would the largest such tree be? would that be small enough? It would be trading memory for speed (assuming 50MB per tree, you'd need upwards to 1GB to have mostly-duplicated trees one for each letter, but perhaps that's enough to get reasonable-sized trees)


#9

Tree of all words is about 35 MB which is nice. Splitting it into many trees would be inefficient for sure - it would take much more space since many duplicate nodes would be created in each Trie.
About the first part of your post.. I guess your right - my idea of creating so many combinations must be the "brute-force" way - i guess i need to find a way to recursively check the nodes for existing children and then whole words.


#10

But does it matter if it takes a lot of space? If that gets the size down then it may be a very good tradeoff


#11

Actually it does matter. Searching 1 30mb Trie is as fast as searching 5 mb trie (one of many). Having one Trie has another advantage - user only needs to load this once - when visiting main page. If i had words split into A-Trie, B-Trie, C-Trie etc then if user input was "abcdefg" then it would have to load 7 different tries and it would vary from search to search. So having one trie makes is easy, because it loads once and is available until user is on the site :).
If I come up with an idea how to walk down the Trie and find words that can be made from the rack I won't have to make hundreds of thousands of combinations when input gets big. I'll let you know!


#12

Yeah, it takes logarithmic time to look up a word if you know which word it is

However, if you don't know, and need to look down several paths, then the size matters a whole lot

Can I see the list?


#13

Here you can download polish dictionary i use - these are over 2_700_000 words viable in word games, such as scrabble.
Here is fast_trie gem i use. You can download it initialize a trie like this

require 'trie'
trie = Trie.new
IO.readlines("path/to/file.txt").each { |w| trie.add(w.chomp) }
trie.save("any_name_you'd_like")

Then you can load it trie = Trie.read("path/to/trie") and play with it trie.has_key? "komputer".


#14

Apparently everything in that list has an a and an i in it.
so that doesn't really help cut things down

But if you know which letters you're allowed to use, then you could go down all paths in the trie that can be made from them, would that be sufficient?


#15

Yes, I guess if I do that it will be incredibly fast.
I'm working on a recursive function to make that happen.


#16

I think i need something like:

search_word(node, rack)
  $words.push(node.full_state) if node.terminal? # add words to list if it exists
  return if rack.empty? && used_last_letter
  search_word(next_letter(node), take_from(rack)) if rack.empty? #when run out of tiles, switch first letter and search again
  search_word(append(node), take_from(rack)) # go to next node if it has children 
end

Oversimplification for input "hello"

    search_word("h", "ello") -> not a word but has children
    search_word("he", "llo") -> "he" is a word, make note
    search_word("hel", "lo") -> not a word but has children
    search_word("hell", "o") -> "hell" is a word
    search_word("hello", "") -> "hello" is a word, rack is empty so call yourself with next letter
    search_word("e","hllo") -> not a word but has children
    search_word("eh","llo") -> "eh" is a word, no children for "l", "l" or "o", call with next letter...

And so until it reaches return . Did i miss something?


#17

Couldn't you just follow all children for which you still have the corresponding tile?

If you're at he and you have an 'l' tile and an 'a' tile then you can search both 'hel' and 'hea' but not 'her' because you don't have an 'r' tile


#18

append() part will take care of it. It will call search_word only for letters having children(from the rack passed to function) - which is what you've just said. Now need to code that though

After I make this recursive search i'm sure searching will be super fast and I won't have to generate any combinations - which was time consuming. Then I need to take care of wildcards


#19

wildcard tiles are about the same as a letter

if have the letter, search child
else if have wildcard, search child


#21

require 'trie'
# load full word trie
$trie = Trie.read("full_word_trie")
puts "TRIE LOADED"
#methods
def substract(arr1, arr2) # ["a", "a", "b"] - ["a"] = ["a", "b"], instead of ["b"]
  arr1 if arr2.empty?
  str = arr1.sort.join
  arr2.each { |l| str.sub!(l,"") }
  str.chars
end
#search method
def search_word(keys)
  children = []
  return if keys.empty?
  if keys[0].length == $max_length
    keys.each do |key|
      $words.push(key) if $trie.has_key? key
    end
  else
    keys.each do |key|
      rack = substract($letters, key.chars)
      rack.each do |tile|
        my_key = key + tile
        $words.push(my_key) if $trie.has_key? my_key
        children.push(my_key) if $trie.has_children? my_key
      end
    end
  end
  search_word(children)
end
#
# variables
$letters = %w[a b c d e f g h i j k l m n o]
$max_length = $letters.size
$words = []
keys = $letters.uniq

search_word(keys)
$words.uniq!
puts "Found words: #{$words.size}"

This is my take on searching the tree for words made from given letters. For input.length = 15 it's still much less than a second to find all the possibilities. Now need to painlessly handle wildcards.

Edit: after benchmarking it, it takes about 310ms per 15 letter input without wildcards so far.