If there are N words in the dictionary, and longest word has length w, this is O(N*(w**2)), which I think is pretty good if N is large, and w is small
Define a "club" as a set of words which are all directly connected to one another.
"fat", "hat", "sat"
all belong to the club named '.at', because they can all be converted by a substitution
at the position of the '.'.
For intermediate, "at" is added to the ".at" club, because "at" can be converted to any
of those words, and any of those words can be converted to "at".
Breadth first search involves moving from a word to all of its clubs, and from clubs to all of their words.
Once navigation away from a club has occurred, it is emptied (no more need to use it, all members have been visited).
This helps avoid the O(N**2) behavior of dense graphs (where N nodes may have N**2 edges).
Another optimization (good for dense graphs) is to simultaneously search from both "begin" and "end", and stop when those searches first intersect.
Edit: If word lengths are long (words are longer than the alphabet length), you could do something similar. Any time you "find" a word by doing a substitution at position x, tell the word that when searching for neighbors, it doesn't need to do substitutions at that position, any such words have been already found. Similar logic could probably be used for the intermediate code (need to think about it). This probably gives a complexity (N words, word size w, alphabet size a) of O(N*w*a).
Edit2: I downloaded an online list of English words (more than 300k words), modified the code a bit so that once the set of clubs were made, they could be reused (instead of emptying a club when visited, just mark them as visited). It turns out that connections are pretty sparse. With repeated testing of random pairs of words, the search rarely hit more than 10k words before either finding a path, or proving there was no path. That means that preprocessing the entire dictionary is not efficient, unless you are going to search for paths between multiple pairs of words. It takes my code about 100 seconds to preprocess the dictionary, and typically about 0.15 seconds to run the challenge on a pair of words. When it finds a long path (longest I saw was 18 steps), that takes about 3 seconds.
Edit3: A bit more on thought processes for this challenge. To solve this problem, there are two issues.
1) Find the single-step neighbors of a word.
2) Using (1) find the path (or perhaps just the cost of the path) from beginWord to endWord.
For item (2) a simple breadth-first search seems appropriate. You could do some optimistic searching (prefer to search in the direction of lowered edit-costs), but with so many dead-ends, the extra work doesn't seem worth it.
For finding a word's neighbors, the potential search strategies seem to be
a) Look at every word in the dictionary to see if it is a neighbor of the current word. O(dictionarysize*wordsize) to find a single set of neighbors.
b) Look at every "transformation" of the current word, and see if that transformation is in the dictionary (after putting the dictionary into a set). So for "cat", a possible transformation is "cgt", so we'd check to see if "cgt" is in the dictionary. Preprocessing the dictionary is O(dictionarysize*wordsize), but finding neighbors of a word is O(wordsize*alphabetsize). If the dictionary is bigger than the alphabet, and you end up doing very many "neighbor" searches, this is a win.
c) Put words into groupings that make things more efficient. Possible groupings to put neighbors together:
- Group words by length, since even for intermediate, neighbors have similar lengths.
- Group words by "first half" and "second half". Any neighbor of mine (with the same length) has either the same first half, or the same second half, and that search is easy to extend to slightly longer or shorter neighbors. Consider looking for neighbors of "football". The dictionary I downloaded has over 49k eight-letter words, but only 88 of them either begin with "foot" or end with "ball". This costs more preprocessing (each word is now in two hash tables), but a single hash-table lookup gives a fairly short list of candidate matches.
- Lots of other potential groupings (group words by their count of each vowel). For all kinds of counting strategies, words can only be neighbors if their counts are no more than one apart.
- I ended up taking the grouping a bit farther, so that if-and-only-if two words are in a same group, they are neighbors. The disadvantage is that it takes a lot of preprocessing, O(dictionarySize*wordSize*wordSize) in both time and space. One advantage was some simplicity (never having to check for partial matches). Also, when several words are in the same group (cat,hat,fat,mat,pat,sat,bat,rat,vat), there are O(N^2) edges between the words, but the search only every has to spend time on O(N) of those edges (there are O(N) edges between the "group" and its words). In hindsight, large groups like this are probably uncommon for longer words.
- Potential other optimizations (not done)
++ Group all words by length (fast in python), but don't put them into other groups (hash tables) unless/until a search actually reaches their word length.
++ Use different techniques based on word length and number of words of a particular length. The dictionary I downloaded had less than four hundred words of length 20. Any non-trivial processing of a list that short may be wasted.