Censor Dispenser: How to deal with multi-word strings

https://www.codecademy.com/practice/projects/censor-dispenser

In the 4th item, you are tasked with creating a function to not only censor like the previous item, but also censor out “negative words” from a list, starting with the 3rd instance any of those terms shows up to the end of email three

negative_words = ["concerned", "behind", "danger", "dangerous", "alarming", "alarmed", "out of control", "help", "unhappy", "bad", "upset", "awful", "broken", "damage", "damaging", "dismal", "distressed", "distressed", "concerning", "horrible", "horribly", "questionable"]

The sample solution provided actually does not do this particularly well because it splits the entire string by spaces and then by line breaks, but does not join them back into the original paragraphs(line breaks).

Formatting aside, the part that drove me to look into this in the first place was the CHALLENGE of getting ALL the terms on the list. If you look into that list, there is a string, “out of control”, which does not get censored by the sample solution.

The reason for this is because the original text string was split by spaces, while the comparison checks the split words separately, in which case it does not “tag” for censoring. The code will check for “out” and then “of” and then “control” separately while the checked list has the string “out of control”, which does not match and thus no censoring.

Now the challenge:
index and find stop at the first instance in a string search. Since we are looking for the third occurrence onward to censor, how do you find/tag/censor the phrase “out of control”, which is at the end of the email itself?

Further question: How would this be handled if the phrase “out of control” was at the beginning?

Split it while keeping the delimiters:

['Hello', ', ', 'World', '!']

You can either roll your own split function or use the one in the module re

It’s either a word or not a word. No need to split multiple times, no need to care about lines or spaces or punctuation. There are only two kinds. Split every time the kind changes.

After that, you can keep this word+delimiter list, and create a second one containing only words (downcase them too so it’s all the same)

['hello', 'world']

Then you’ve got nice downcased words and nothing else that you can feed into your censoring logic.

When the results come back, you can put it back together again by matching the results to the list with the correctly cased words and the delimiters.

['Hello', ', ', 'World', '!']
[True,          False       ]


['*****', ', ', 'World', '!']


'*****, World!'
1 Like

Thanks for replying ionatan!

However, I’m looking for a solution to fully solve the problem as follows:

Original string to be censored:

Board of Investors,

Things have taken a concerning turn down in the Lab.  Helena's (she has insisted on being called  Helena, we're unsure how she came to that moniker) is still progressing at a rapid rate. Every day we see new developments in her thought patterns, but recently those developments have been more alarming than exciting. 

Let me give you one of the more distressing examples of this.  We had begun testing hypothetical humanitarian crises to observe how Helena determines best solutions. One scenario involved a famine plaguing an unresourced country. 

Horribly, Helena quickly recommended a course of action involving culling more than 60% of the local population. When pressed on reasoning, she stated that this method would maximize "reduction in human suffering."

This dangerous line of thinking has led many of us to think that we must have taken some wrong turns when developing some of the initial learning algorithms. We are considering taking Helena offline for the time being before the situation can spiral **out of control**.

More updates soon,
Francine, Head Scientist

As you can see, the phrase “out of control” does not get censored because it is a multiword string in the negative words list. The code from the sample solution only iterates single words at a time, so using .split/.join doesn’t work because it is checking the string of three words and not seeing a match.

The only way, as I can see, is to use a .replace() for that phrase, but how do you get it to not replace the first two instances of a negative word?

1 Like

Once you have a list containing only the words, you can iterate through each position and look ahead as many words as the disallowed phrase has.

1 Like

Ok,

But how would you check the list of email words against the list of disallowed words/phrases? Also, for future proofing, what about if there were additional phrases of different lengths added to the disallowed list?

Pretty sure you already asked exactly that.

And this,

seems to answer itself?

1 Like

You could write out both the list of disallowed words and the list of words from the input and then manually start crossing things out. You probably know exactly how to do it. Ask yourself, maybe?

1 Like

Here is the Code Academy Solution:

negative_words = ["concerned", "behind", "danger", "dangerous", "alarming", "alarmed", "out of control", "help", "unhappy", "bad", "upset", "awful", "broken", "damage", "damaging", "dismal", "distressed", "distressed", "concerning", "horrible", "horribly", "questionable"]

def censor_three(input_text, censored_list, negative_words):
  input_text_words = []
  for x in input_text.split(" "):
    x1 = x.split("\n")
    for word in x1:
      input_text_words.append(word)
  for i in range(0,len(input_text_words)):
    if (input_text_words[i] in censored_list) == True:
      word_clean = input_text_words[i]
      censored_word = ""
      for x in range(0,len(word_clean)):
        censored_word = censored_word + "X"
      input_text_words[i] = input_text_words[i].replace(word_clean, censored_word)
    count = 0
    for i in range(0,len(input_text_words)):
      if (input_text_words[i] in negative_words) == True:
        count += 1
        if count > 2:
          word_clean = input_text_words[i]
          for x in punctuation:
            word_clean = word_clean.strip(x)
          censored_word = ""
          for x in range(0,len(word_clean)):
            censored_word = censored_word + "X"
          input_text_words[i] = input_text_words[i].replace(word_clean, censored_word)
  return " ".join(input_text_words)
punctuation = [",", "!", "?", ".", "%", "/", "(", ")"]

Here is the Code Academy Solution Output:

Board of Investors,  Things have taken a concerning turn down in the Lab.  XXXXXX (she has insisted on being called  Helena, we're unsure how XXX came to that moniker) is still progressing at a rapid rate. Every day we see new developments in XXX thought patterns, but recently those developments have been more alarming than exciting.  Let me give you one of the more distressing examples of this.  We had begun testing hypothetical humanitarian crises to observe how XXXXXX determines best solutions. One scenario involved a famine plaguing an unresourced country.  Horribly, XXXXXX quickly recommended a course of action involving culling more than 60% of the local population. When pressed on reasoning, XXX stated that this method would maximize "reduction in human suffering."  This XXXXXXXXX line of thinking has led many of us to think that we must have taken some wrong turns when developing some of the initial learning algorithms. We are considering taking XXXXXX offline for the time being before the situation can spiral out of control.  More updates soon, Francine, Head Scientist

I am trying to understand why, in the context of what has been covered in lessons, the solution code cannot handle the “exception” of the negative words list item [‘out of control’] and the email string item […, ‘can’, ‘spiral’, ‘out’, ‘of’, ‘control’, … ].

Or rather, I get why, hence the original question. We did not cover, up to this point, re or importing modules. Only built in string methods and list comprehensions. I am thinking you could search in batches of 3 manually, but then the case question comes up, “What if the negative words list were to have ‘this is a failure’ added to it?” Would the code be up to snuff? The challenge actually doesn’t test the code, and I could just skip this and move on without having wasted as much as the time spent typing in the forums.

1 Like

Don’t search in batches of 3. Search in batches of however many words there are.

You can look at the phrase to see how long it is. You don’t need to say “it’s always 3”


censor_three has the requirement of allowing the first two occurrences. And also doing normal censoring with some other list.

First of all, the input should be cleaned up (the way I suggested is better than what this code does, because it would be more accurate both in producing clean words and in putting it back together as it was)
Really, the cleaning should happen before this function even starts, it’s a separate problem.

Then, one could use one of the previous censor functions to censor the words.

After that you’d have the original and a partially censored version.
From there you’d have to decide whether the further rules should be applied based on what is read from the original, or from the already partially censored one. I’d go with the original, seems nicer.

For each negative phrase, find all instances by location, looking something like:

{
  ('upset',) : [3, 19, 78, 350],  # made up locations, not actually gonna check
  ('out', 'of', 'control'): [92],
  ...,
}

Then, drop the first two of each of those lists, because the first two are allowed.

{
  ('upset',) : [78, 350],
  ('out', 'of', 'control'): [],
  ...,
}

Apply censoring at the remaining locations. for example, at position 78, censor out the next 1 word (the phrase ('upset',) has 1 word, while for example ('out', 'of', 'control') has three. get the length to tell how many.)


Keep in mind I’d be operating on lists of words, not a string. So position 3 means the 4th word, for example, and it wouldn’t matter what the punctuation is, because you would just say “word 3 should be censored”, the scheme I suggested for cleaning and restoring the input will take care of putting it back together correctly, the censoring logic doesn’t need to worry or care, that’s a separate problem.


That “solution” code is pretty bad in more than one way.
But can the problem be solved? Sure. I think I made a fairly convincing argument in this post that it can be solved in a robust way.

1 Like

Searching for multiple words, I already described it so I’m saying absolutely nothing new here:

indata = ['toddler', 'riding', 'on', 'a', 'swing', 'licking', 'on', 'a', 'rock']
badphrase = ['on', 'a']
locations = []

for pos in range(len(indata)):
    if indata[pos:pos+len(badphrase)] == badphrase:
        locations.append(pos)

print(locations)  # [2, 6]

Note that the code doesn’t care in the slightest about how long the badphrase has to be. It’ll do the same thing as you’d do yourself. Look how long it is, compare that far.

1 Like

Here are some more intermediary results:

Board of Investors,

Things have taken a concerning turn down in the Lab.  Helena's (she has insisted on being called  Helena, we're unsure how she came to that moniker) is still progressing at a rapid rate. Every day we see new developments in her thought patterns, but recently those developments have been more alarming than exciting. 

Let me give you one of the more distressing examples of this.  We had begun testing hypothetical humanitarian crises to observe how Helena determines best solutions. One scenario involved a famine plaguing an unresourced country. 

Horribly, Helena quickly recommended a course of action involving culling more than 60% of the local population. When pressed on reasoning, she stated that this method would maximize "reduction in human suffering."

This dangerous line of thinking has led many of us to think that we must have taken some wrong turns when developing some of the initial learning algorithms. We are considering taking Helena offline for the time being before the situation can spiral out of control.

More updates soon,
Francine, Head Scientist

split it. keep delimiters.

['', 'Board', ' ', 'of', ' ', 'Investors', ',\n\n', 'Things', ' ', 'have', ' ', 'taken', ' ', 'a', ' ', 'concerning', ' ', 'turn', ' ', 'down', ' ', 'in', ' ', 'the', ' ', 'Lab', '.  ', "Helena's", ' (', 'she', ' ', 'has', ' ', 'insisted', ' ', 'on', ' ', 'being', ' ', 'called', '  ', 'Helena', ', ', "we're", ' ', 'unsure', ' ', 'how', ' ', 'she', ' ', 'came', ' ', 'to', ' ', 'that', ' ', 'moniker', ') ', 'is', ' ', 'still', ' ', 'progressing', ' ', 'at', ' ', 'a', ' ', 'rapid', ' ', 'rate', '. ', 'Every', ' ', 'day', ' ', 'we', ' ', 'see', ' ', 'new', ' ', 'developments', ' ', 'in', ' ', 'her', ' ', 'thought', ' ', 'patterns', ', ', 'but', ' ', 'recently', ' ', 'those', ' ', 'developments', ' ', 'have', ' ', 'been', ' ', 'more', ' ', 'alarming', ' ', 'than', ' ', 'exciting', '. \n\n', 'Let', ' ', 'me', ' ', 'give', ' ', 'you', ' ', 'one', ' ', 'of', ' ', 'the', ' ', 'more', ' ', 'distressing', ' ', 'examples', ' ', 'of', ' ', 'this', '.  ', 'We', ' ', 'had', ' ', 'begun', ' ', 'testing', ' ', 'hypothetical', ' ', 'humanitarian', ' ', 'crises', ' ', 'to', ' ', 'observe', ' ', 'how', ' ', 'Helena', ' ', 'determines', ' ', 'best', ' ', 'solutions', '. ', 'One', ' ', 'scenario', ' ', 'involved', ' ', 'a', ' ', 'famine', ' ', 'plaguing', ' ', 'an', ' ', 'unresourced', ' ', 'country', '. \n\n', 'Horribly', ', ', 'Helena', ' ', 'quickly', ' ', 'recommended', ' ', 'a', ' ', 'course', ' ', 'of', ' ', 'action', ' ', 'involving', ' ', 'culling', ' ', 'more', ' ', 'than', ' 60% ', 'of', ' ', 'the', ' ', 'local', ' ', 'population', '. ', 'When', ' ', 'pressed', ' ', 'on', ' ', 'reasoning', ', ', 'she', ' ', 'stated', ' ', 'that', ' ', 'this', ' ', 'method', ' ', 'would', ' ', 'maximize', ' "', 'reduction', ' ', 'in', ' ', 'human', ' ', 'suffering', '."\n\n', 'This', ' ', 'dangerous', ' ', 'line', ' ', 'of', ' ', 'thinking', ' ', 'has', ' ', 'led', ' ', 'many', ' ', 'of', ' ', 'us', ' ', 'to', ' ', 'think', ' ', 'that', ' ', 'we', ' ', 'must', ' ', 'have', ' ', 'taken', ' ', 'some', ' ', 'wrong', ' ', 'turns', ' ', 'when', ' ', 'developing', ' ', 'some', ' ', 'of', ' ', 'the', ' ', 'initial', ' ', 'learning', ' ', 'algorithms', '. ', 'We', ' ', 'are', ' ', 'considering', ' ', 'taking', ' ', 'Helena', ' ', 'offline', ' ', 'for', ' ', 'the', ' ', 'time', ' ', 'being', ' ', 'before', ' ', 'the', ' ', 'situation', ' ', 'can', ' ', 'spiral', ' ', 'out', ' ', 'of', ' ', 'control', '.\n\n', 'More', ' ', 'updates', ' ', 'soon', ',\n', 'Francine', ', ', 'Head', ' ', 'Scientist', '\n']

Send only the actual words into logic:

['board', 'of', 'investors', 'things', 'have', 'taken', 'a', 'concerning', 'turn', 'down', 'in', 'the', 'lab', "helena's", 'she', 'has', 'insisted', 'on', 'being', 'called', 'helena', "we're", 'unsure', 'how', 'she', 'came', 'to', 'that', 'moniker', 'is', 'still', 'progressing', 'at', 'a', 'rapid', 'rate', 'every', 'day', 'we', 'see', 'new', 'developments', 'in', 'her', 'thought', 'patterns', 'but', 'recently', 'those', 'developments', 'have', 'been', 'more', 'alarming', 'than', 'exciting', 'let', 'me', 'give', 'you', 'one', 'of', 'the', 'more', 'distressing', 'examples', 'of', 'this', 'we', 'had', 'begun', 'testing', 'hypothetical', 'humanitarian', 'crises', 'to', 'observe', 'how', 'helena', 'determines', 'best', 'solutions', 'one', 'scenario', 'involved', 'a', 'famine', 'plaguing', 'an', 'unresourced', 'country', 'horribly', 'helena', 'quickly', 'recommended', 'a', 'course', 'of', 'action', 'involving', 'culling', 'more', 'than', 'of', 'the', 'local', 'population', 'when', 'pressed', 'on', 'reasoning', 'she', 'stated', 'that', 'this', 'method', 'would', 'maximize', 'reduction', 'in', 'human', 'suffering', 'this', 'dangerous', 'line', 'of', 'thinking', 'has', 'led', 'many', 'of', 'us', 'to', 'think', 'that', 'we', 'must', 'have', 'taken', 'some', 'wrong', 'turns', 'when', 'developing', 'some', 'of', 'the', 'initial', 'learning', 'algorithms', 'we', 'are', 'considering', 'taking', 'helena', 'offline', 'for', 'the', 'time', 'being', 'before', 'the', 'situation', 'can', 'spiral', 'out', 'of', 'control', 'more', 'updates', 'soon', 'francine', 'head', 'scientist']

pick 20 random words to censor (that’s my “logic”. demonstrates how I represent the result: locations of words to be crossed out)

{0, 2, 8, 16, 19, 20, 36, 57, 67, 68, 69, 73, 84, 88, 96, 126, 138, 145, 152, 168}

Put it back together (iterate through the full split, counting the words, checking if they are marked as to be censored)

***** of *********,

Things have taken a concerning **** down in the Lab.  Helena's (she has ******** on being ******  ******, we're unsure how she came to that moniker) is still progressing at a rapid rate. ***** day we see new developments in her thought patterns, but recently those developments have been more alarming than exciting. 

Let ** give you one of the more distressing examples of ****.  ** *** begun testing hypothetical ************ crises to observe how Helena determines best solutions. One scenario ******** a famine plaguing ** unresourced country. 

Horribly, Helena quickly recommended a ****** of action involving culling more than 60% of the local population. When pressed on reasoning, she stated that this method would maximize "reduction in human suffering."

This dangerous line of ******** has led many of us to think that we must have ***** some wrong turns when developing some ** the initial learning algorithms. We are *********** taking Helena offline for the time being before the situation can spiral out of control.

**** updates soon,
Francine, Head Scientist
1 Like

Hi people, can you tell me what is wrong with my code? I see that the solution is a long *** code, very different from mine. But mine is giving me the output I (think?) want. Now I am starting to have second thoughts. Oh and just to clarify, I know I am also censoring the space between the propietary terms, but I really wanted to make my code as clean as possible:

def censor_email(email, censored_negative_words, censored_proprietary_terms):
    cens_word = ""
    for i in range(len(censored_negative_words)):
        if email.count(censored_negative_words[i]) > 1:
            email = email.replace(censored_negative_words[i], "x" * len(censored_negative_words[i]))
    for r in range(len(censored_proprietary_terms)):
        email = email.replace(censored_proprietary_terms[r], "x" * len(censored_proprietary_terms[r]))
    return email
            
print(censor_email(email_three, negative_words, proprietary_terms))