Challenge Project: Censor Dispenser

I was doing the ‘Censor Dispensor’ project but was unable to resolve a bug relating to the function associated with ‘email_two’. I eventually looked at the solution code for this, but it’s not very nuanced and didn’t go into any of the edge cases.

The project can be found here:

def censor_multiple_words(proprietary_terms, email):
    Synopsis: This function censors a list of words or phrases in an email supplied by the caller.
    Expected: #proprietary_terms is a list of strings that can be words or phrases. #email is a string that will be parsed for occurances of the elements in #proprietary_terms
    Returned: #email will be returned as a string with the occurrences of elements within #proprietary_terms replaced by ***, with a '*' for every character.
    censored_email = email
    #Make case-insensitive. First convert #proprietary_terms to lower case.
    proprietary_terms_lc = []
    for word in proprietary_terms:
    for index in range(len(proprietary_terms)):
        #Convert #email to lower case inside the for loop so that it gets reset for each new proprietary term.
        email_lc = email.lower()
        #Create a #proprietary_term_index_list that will contain a list of the indices that represent the first letter of each of #proprietary_terms in the #email. This also need to be reset after each term has been redacted in #censored_email.
        proprietary_term_index_list = []
        removed_text_index = 0
        current_term = proprietary_terms_lc[index] #This is the proprietary term that we are working on right now.
        while current_term in email_lc:
            proprietary_term_index = email_lc.find(current_term)
            #Need to add index values of previously removed text as #email_lc is being shortened by that amount in each iteration of this loop.
            proprietary_term_index_list.append(proprietary_term_index + removed_text_index)
            email_lc = email_lc[proprietary_term_index + len(current_term):]
            removed_text_index += proprietary_term_index + len(current_term)
        #Create a #censored_word by looping through the #word and replacing all characters with '*'. Spaces will remain in order to preserve word length.
        censored_word = ''
        for term_index in range(len(current_term)):
            if current_term[term_index] != ' ':
                censored_word += '*'
                censored_word += ' '
        #Now the newly created #censored_word can replace all occurances of #current_term in #email.
        for word_index in proprietary_term_index_list:
            censored_email = censored_email.replace(email[word_index:word_index + len(current_term)], censored_word) #All information including index positions and censored_word appear to be correct. However, when line executes for 'herself', the censored_email remains unchanged. When I take out 'her' from the list, it works! The already redacted 'her' might be interfering somehow. Not sure how that would be possible though.
    return censored_email

proprietary_terms = ["she", "personality matrix", "sense of self", "self-preservation", "learning algorithm", "her", "herself"]

print(censor_multiple_words(proprietary_terms, email_two))

I’ve included the question within the code above, namely that when parsing the string ‘email_two’, the substring ‘herself’ is not being replaced by *'s. This may have something to do with the fact that the ‘her’ part was already picked up in a previous iteration of censoring (as we are going through a list of words to redact). Not sure how this could be the case though.

The text used as the email_two parameter is:

Good Morning, Board of Investors,

Lots of updates this week. The learning algorithms have been working better than we could have ever expected. Our initial internal data dumps have been completed and we have proceeded with the plan to connect the system to the internet and wow! The results are mind blowing.

She is learning faster than ever. Her learning rate now that she has access to the world wide web has increased exponentially, far faster than we had though the learning algorithms were capable of.

Not only that, but we have configured her personality matrix to allow for communication between the system and our team of researchers. That's how we know she considers herself to be a she! We asked!

How cool is that? We didn't expect a personality to develop this early on in the process but it seems like a rudimentary sense of self is starting to form. This is a major step in the process, as having a sense of self and self-preservation will allow her to see the problems the world is facing and make hard but necessary decisions for the betterment of the planet.

We are a-buzz down in the lab with excitement over these developments and we hope that the investors share our enthusiasm.

Till next month,
Francine, Head Scientist

Perhaps switch the two words so ‘herself’ is first in the list.


Thanks for the response Roy. That may work alright, but I’m really after why the current code structure yields this bug rather than a fix for getting the desired output. The user should be able to specify any words in any order if the program is to work as intended.

Once her is redacted, herself will not match anything because it only find, ***self in the list. That’s why we reversed the order so herself is redacted first, leaving her still visible in the list.

It’s not a bug, per se.

Even the “solution” code doesn’t do that, but you can! To do so, you’ll need to do some extra coding (or use regular expressions.)

See here for one approach. (You’ll need to enclose it in an outer for loop to make it work with a list, but it will do what you want.)

I’ve specifically structured my code to avoid such an occurance though. Firstly I convert the entire email to lower case. Then I iterate through the email to find the index positions of the first character of the word to be redacted and store them to proprietary_term_index_list. I do this one word at a time to find all of the index positions for the given word - for example if the word her is being redacted at that word occurs six times in the email, then proprietary_term_index_list will have six elements, namely the index positions of the h in the word her. I then create the censored word *** that will replace her.

A copy of the email was created in censored_email. I then replace all instances of the censored word her with *** in censored_email based on the index position. I then repeat this process with the other censored words (including herself).

Because we are using the index positions of the censored words (and not simply replacing them), the redacting of her before herself should not in theory matter. When the program comes to redact herself, that term will be ***self in censored_email (since we have already redacted her). But we will look up the index position of the first * and replace ***self with *******. This is possible since we are searching email_lc for herself (which is just the original unredacted email in lowercase), and then once the index position is found, redacting from censored_email which is a seperate object.

I know that’s a wordy response but I just wanted to make the logic clear. Thanks for taking the time to read!

1 Like

I’ll have a look at that when I get the chance. If you look at my response to mtf, can you see anything incorrect in my approach to the problem (i.e. using indexes) and why that would cause the issue I am encountering? Although I’m sure that there are other viable solutions, I’m really interested in why my approach does’t work.

What happens when you run your version? Are you replacing everything with three asterisks, or does the number depend on the censored word length?

… and, if you are going for a cleaner solution, don’t forget the potential case where “herself” is not in the list. You don’t want it to show up as ***self just because “her” is in the list!

It depends on the censored word length. A word x chars long will be replaced by a string of asterisks x long. This is covered in the below section of code, where we form the string of asterisks to the correct length (naming this censored_word) for the given current_term (which is an element in the proprietary_terms list). After the process is completed for a word in proprietary_terms, the censored_email is updated for all occurances based on the index positions of these words. We then move onto the next term in proprietary_terms and repeat the process.

censored_word = ''
        for term_index in range(len(current_term)):
            if current_term[term_index] != ' ':
                censored_word += '*'
                censored_word += ' '

With your comment on the ‘cleaner solution’, you are of course correct. I was going to implement that functionality after I had sorted this current issue out. You would also get the word ‘researchers’ being replaced with ‘researc***s’ for the same reasons.

Something to note is that I’ve done a fair amount of testing on this, and found that the index position of herself is stored correctly in proprietary_term_index_list, that is proprietary_term_index_list = [679], where 679 is the index position of the h in herself. So the program is indexing the word herself correctly, but not censoring it correctly.

If you want a solution that can redact her and herself in any order, then create it. And be sure it works.

The longer of the two words is the logical order to search unless you bring in pattern matching. But let’s say that option is not on the table. It falls to the author to resolve this. Failed solutions are not anyone’s fault but the writer.

Something doesn’t work. Do we blame the environment, or faulty code?

So, what is the problem with your approach? What kind of output do you see when you run it?