Letter Frequency Analysis?


#1

Hi all,
I'm a relatively new coder. I took a few classes in high school, but it's been years, so a lot slipped my mind. I'm trying to write a code where you input text into a text box, click submit, then the program highlights or changes the font color of the most common 1, 2, 3 and 4 letter combinations. So for example, if the input text read "The thick thesaurus thumped around", I want the code to highlight all the "TH"s and have a readout at the bottom that says something like "TH=4". Is something like that even possible? I'm sure this is a really dumb question, but hopefully you guys can help me out. Thanks so much.


#2

@johobus28 ,

In order to solve this problem, you need to define it in sufficient detail to eliminate ambiguities. For example, in this text ...

The thick thesaurus thumped around

... the letter combination, th, does indeed occur four times. However, the combination, the, occurs twice, and both those occurrences, of course, include the th combination. If you intend to highlight or color code the most common combinations, how will you handle such overlaps?


#3

Hello, again, @johobus28 ,

Is this for an assignment? If so, you will need to use whatever coding and programming languages are prescribed by the instructor to implement a solution. One good direction may be to use HTML, CSS, and JavaScript to create an input text area and perhaps a button on a web page, and then an output area for the results.

Following is some Python 3.x code that illustrates an algorithm that might work for you. Obviously, it does not create a web page, but it might help you think about the problem. You can translate it to JavaScript or another language of your choosing, and then adapt it for producing friendly output, as part of your solution.

# LetterCombinations.py
# March 18, 2016
one = dict() # dictionary of one-letter combination frequencies
two = dict() # dictionary of two-letter combination frequencies
three = dict() # dictionary of three-letter combination frequencies
four = dict() # dictionary of four-letter combination frequencies
text = input("Enter the text: ")
text = text.lower() # make this case-insensitive
# one-letter combinations
for i in range(len(text) - 0):
    slice = text[i: i + 1]
    if slice.isalpha():
        one[slice] = one.get(slice, 0) + 1
# two-letter combinations
for i in range(len(text) - 1):
    slice = text[i: i + 2]
    if slice.isalpha():
        two[slice] = two.get(slice, 0) + 1
# three-letter combinations
for i in range(len(text) - 2):
    slice = text[i: i + 3]
    if slice.isalpha():
        three[slice] = three.get(slice, 0) + 1
# four-letter combinations
for i in range(len(text) - 3):
    slice = text[i: i + 4]
    if slice.isalpha():
        four[slice] = four.get(slice, 0) + 1
# display dictionary contents
print(one)
print(two)
print(three)
print(four)

From here, I'll leave it to you and other users and Moderators who code in HTML, CSS, and JavaScript more regularly than I do to discuss the details of a complete solution.