Hey Badger perhaps you can help me with #3 I’m having some issues:
Sure! I’ll be more than happy to do that!
To be frank there’s no quick and easy way to do this step correctly.
Even codecademy dev’s code doesn’t REALLY work as it should! Yeah it’s short and simple, but it replaces parts of words it shouldn’t touch! For example it changes “researchers” to “researcXXXs”
There’s “her” in there but that’s definitely not that word!!!
If you want to be precise with what your code does you’ll definitely need to consider working with punctuation before you look for and replace the word and if you don’t want to bother with singular or plural form, just change “learning algorithm” to “learning algorithms” in the proprietary_terms list since it only appears in plural form, then I’d just look for phrases in original string and use
.replace() since it’s harder for whole phrase to mach with some random sequence of words and create a mismatch.
But back to your code
e = email.split()splits the string at whitespaces so:
- you won’t be able to put the string back together with original line breaks (break string at lines first)
- your code can’t pick up on phrases since for example “personality matrix” != “personality”, “matrix”
- I can see that you output is way too long. That’s because your
n.append()is nested inside few
- so for each time
for terms in proprietary_termsand then
for word in eruns (notice that second one runs many times inside first loop) it appends a new word to
If my math is correct - 1519 times to be exact
- here’s the version that does the same thing and has clean output:
def censor2(email): e = email.split() for terms in proprietary_terms: for word in e: if terms == word: word_index = e.index(word) #finds index of a word to replace in e list e[word_index] = "*"*len(word) #replaces that word at its index in e list, each time loop runs it has "new" e list with previous word already replaced return " ".join(e) #joins back e list into str
I’ve spend like 2 hours trying to find a way to make your code work without making to many changes but I couldn’t find the solution
I really hope you’ll find one! I can learn form you a lot too!
Your use of
"*"*len(word) is way more efficient than my loops that iterate through each character in the word to replace it with
If unfortunately you won’t be able to find one I’m afraid you’ll have to write your function from scratch
I’d rather you come up with you own solution than I give you one so feel free to go over my code and if you’d like some more specific help fell free to ask
I just want to warn you - I’m only learning python and most of the time I don’t really know what I’m doing
No worries at all, I feel like I don’t know what I’m doing half the time either hahaha. Thanks for your help, your use of index is great! Also omg 2 hours, I didn’t expect that but it’s much appreciated! Definitely fixed some issues with my code, I’ll let you now if I figure out the rest.
I don’t understand how your
censor2 function would properly detect a multiword phrase in
e = email.split, each element in the resulting list will be a string that is one word long (some will have punctuation too). Therefore, if an element of the list
proprietary_terms is two words long, e.g. “personality matrix”, your code won’t detect a match in
if terms == word (your fourth line of code) will not be
I also don’t see how this accounts for case sensitivity or substrings within words. (you don’t want your function to turn “researchers” into “researc***s” just because “her” is an element in
In my code, instead of using
.split() I just made a list with one element, which is the original string
def censor_list(emailstring): emaillist = [emailstring] for i in range(len(proprietary_terms)): if list[i] in emaillist: fixedemailstring = emaillist.replace(list[i], '-') emaillist = [fixedemailstring] return emaillist
I’m not very satisfied with it, because something tells me it’s not the most efficient. It also doesn’t deal with case sensitivity. I can imagine basically repeating the if statement to include all the case variants of
list[i] but that doesn’t feel like a very elegant solution.
Additionally, I’m really stuck on how to solve the substring issue. Doing some research on stack exchange, I’ve come across a lot of stuff about regex, but we haven’t covered that in the CS path in Codecademy yet, and it seems a bit complex for this stage in the path. I thought about adding a space before and after the word (i.e. change line 4 to
if " " + list[i] + " " in emaillist: but then you wouldn’t catch plurals. Any ideas anyone?
I’ve almost forgotten about this post! haha
Don’t worry about not understanding how it detects multi word phrases, it just doesn’t do that
I wrote it as a cleanner version of @bit9852919571 code from GitHub, not neccecarely the correctly working one
The original code:
def censor2(email): e = email.split() n =  for terms in proprietary_terms: for word in e: if terms == word: n.append("*"*len(word)) else: n.append(word) return " ".join(n)
To answer to your solution:
creating the list in your case is unnecessary, because first you wrap text in list
emaillist = [emailstring] and then you unwrap it with every
emaillist. There’s also no need for
if statement and you can just iterate through
proprietary_terms without indexing it!
Your function could look just like this:
def censor_list(emailstring): for i in range(len(proprietary_terms)): if proprietary_terms[i] in emailstring: emailstring = emailstring.replace(proprietary_terms[i], '-') return emailstring
Or even like this:
def censor_list(emailstring): for term in proprietary_terms: emailstring = emailstring.replace(term, '-') return emailstring
And I also found the solution!
No regex required and it does everything!
- It’s case insensitive!
- It checks if it’s the exact match!
- It replaces phrases without removing white spaces!
- It also works with plural words! - kind of
Only regular ones / brakes on some “Possession” words.
But I guess it’s even harder to make it aware of context and grammar
You can always make it leave the “s” out and just censor everything before or add special cases,
but for this exercise I like it this way
- It doesn’t care for interpunction!
- No extra python, just String Methods!
- And the cherry on top - it’s only 17 lines long! Haha
Great thinking with white spaces! This idea led me to using
# Output censored equivalent function def censor(term): words = term.split() for n in range(len(words)): words[n] = "".join(['-' for n in range(len(words[n]))]) return " ".join(words) # Censor provided terms in text function def censor_list(text, terms): # Create second list with plural forms plur = [i + "s" for i in terms] # Combine lists terms = plur + terms for term in terms: # Call censor function to create replacement replace = censor(term) # Find case insensitive version of the term in the text for n in range(text.lower().count(term)): # Find starting index of matching word start = text.lower().find(term) # Create end index end = start + len(term) # Check if term is not a part of the bigger word if not text[start - 1].isalpha() and not text[end].isalpha(): # Replace word with replace var in specific index rage text = "".join((text[:start],replace,text[end:])) return text print(censor_list(email_two, proprietary_terms))
Thanks, this was really helpful. I was on stack exchange and so many people were yelling “strings are immutable!” that I didn’t realize you can overwrite strings within functions. That was key. Also, yes good catch on the superfluous
I really like the way you handled cases where a term shows up more than once. I did this in a much less elegant way, by creating a list of the first indices for each appearance of the term. Anyways here’s what I wrote before I saw your reply. It kind of has a version of your
def(censor) built into it, which probably isn’t best practice.
def censor_two(email): for phrase in proprietary_terms: if phrase in email.lower(): if email.lower().count(phrase) > 1: indices =  index = 0 while True: try: index = email.lower().index(phrase, index) indices.append(index) index += 1 except ValueError: break for i in indices: if email[i - 1].isalpha() == False: if email[i + len(phrase)].isalpha() == False or email[i + len(phrase):i + len(phrase)+2] == 's ': email = email[:i] + '*' * len(phrase) + email[i + len(phrase):] else: continue else: email = email.replace(phrase, len(phrase) * '*') email = email.replace(phrase.lower(), len(phrase) * '*') email = email.replace(phrase.title(), len(phrase) * '*') return email
I’m super curious to hear what you think of the advice in this post . Essentially, [ionatan] (https://discuss.codecademy.com/u/ionatan) advocates for separating the text or email into a list of words and searching that. I noticed neither of us did that. We only iterated over the lists of
proprietary_words and then updated the text/email with the censored version as appropriate. I know that the [ionatan] is not necessary, since I’ve been able to solve all the functions in the challenge, but I wonder if this is a better way to do it. They obtained a list of words and a separate list of non-word strings, and one big question I have is how to they preserve the original order/sequence when they have to eventually join the two lists back together…
It might initially seem so. But would I, after reading your code, be able to create input where your code misbehaves?
proprietary_terms = ['a'] print(censor_two('ab'))
proprietary_terms = ['a'] print(censor_two('a a'))
Traceback (most recent call last): File "a.py", line 36, in <module> print(censor_two('a a')) File "a.py", line 17, in censor_two not email[i + len(phrase)].isalpha() IndexError: string index out of range
@anonymusbadger’s code fails too:
print(censor_list('a b', ['b']))
Traceback (most recent call last): File "b.py", line 30, in <module> print(censor_list('a b', ['b'])) File "b.py", line 25, in censor_list if not text[start - 1].isalpha() and not text[end].isalpha(): IndexError: string index out of range
What do you mean by better, how would that be defined,
what can be improved on? Maybe that in itself says something about what that other way would be.
nothing is ever moved out of order. if you write the data down on paper you’ll be able to put it back together again without me having to tell you how.
you could do it with one list as well if you skip separating out the words
Thanks @ionatan for pointing out fail cases in my code
Hope you can find more hahaha!
Made a quick fix:
def censor(term): # Output censored equvalent function words = term.split() for n in range(len(words)): words[n] = "".join(['-' for n in range(len(words[n]))]) return " ".join(words) def censor_list(text, terms): # Censor provided terms in text fuction text = " " + text + " " plur = [i + "s" for i in terms] terms = plur + terms for term in terms: replace = censor(term) for n in range(text.lower().count(term)): start = text.lower().find(term) end = start + len(term) # Check if term is not a part of the bigger word if not text[start - 1].isalpha() and not text[end].isalpha(): text = "".join((text[:start], replace, text[end:])) return text.strip() print(censor_list(email_two, proprietary_terms))
I know str(’ a b’) would output str(‘a b’) but since we are working with emails in this case, spaces at the beginning and end would be a styling error anyway.
Well I did that with my original code (look at first post) but I just don’t like how it works and looks. It just doesn’t seem very “pythonic” to do that - it’s quite complex (deconstructing, reconstructing, lists within lists), has too many special cases (interpunction, phrases, plurals, etc) and a lot of code just to do some simple stuff.
Also I timed it and it’s just slower…
List of words from input text (email_two) - 0.06 - 0.07 s
String Methods (email_two) - 0.04 -0.05 s
It’s fine, although it’s less readable ( and PEP8 loves readability haha) and I guess and it doesn’t take spaces in the phrase into account.
You don’t need
if email.lower().count(phrase) > 1:.
if phrase in email.lower(): takes care of that already You’re unnecessarily double checking for the same thing although in a different way.
if phrase in email.lower() is more PEP 8 compliant.
I think I understand how it works but to be frank your code is quite hard to fallow from
while True: for me
there’s only two things. letters and non-letters
sounds like it’s got more to do with some specifics and not the overall strategy
this would find things that are part of other words
as would that. you have some condition guarding against it, but next iteration will find the same one again - so putting something that should be replaced behind something part of a word should make that misbehave, for example:
censor_list('research search', ['search'])
if you use any of replace/find/index/count then it’s probably immediately wrong, they don’t care about words
here’s code that I wrote just now which splits at word boundaries. there are no special cases. split, compare, put back together.
from itertools import groupby, product def split(text): return ["".join(group) for _, group in groupby(text, str.isalpha)] def clean(text): return list(filter(str.isalpha, split(text.lower()))) def censor(text, phrases): text = split(text) phrases = map(clean, phrases) marked = set() for i, phrase in product(range(len(text)), phrases): size = len(phrase) here = list(map(str.lower, text[i : i + 2 * size : 2])) if here == phrase: # can add condition HERE to allow X first occurrences marked.update(range(i, i + 2 * size, 2)) for location in marked: text[location] = "*" * len(text[location]) return "".join(text) print(censor("a", ["a"])) print(censor("aa", ["a"])) print(censor("a a", ["a"])) print(censor("a b", ["b"])) print(censor("research search", ["search"])) print(censor("hello, how are you", ["hello how"]))
Good point! @ionatan
Yeah… it breaks it.
Well I guest it’s true. I was using quite old code that I wrote back when I didn’t really know what I was doing
Looking at your code I must say that it looks nice, clean, it does it’s job well and I really like it. So I totaly get your point.
Only thing I found is that it doesen’t pick up on plurals but that’s the minor complaint since you can fix that very easily by creating them from original phrases input like I did.
However lets say that we want to stick as close as possible to the corse and I don’t remember if importing modules was covered before this exercise (maybe it was idk)
@dllllll didn’t use RegEx bacause of that.
Would it be as clean and simple without it?
Modules are just nicely wraped python so you can write what they do on your own, but
I’m not sure if there’s a clean and simple way to do it for a beginner…
I’m trying to tie my hands because of that while writing code for this exercise, but I’m no pro and I can see that you know your Python well haha
there’s a lot of use of map filter product groupby. nobody doing the course would write it that way.
that’s because I recognize those patterns. that’s how I write those things. most of them have nothing to do with avoiding writing code
groupby does though. that is doing some heavy lifting.
but. somebody that considers what they’re looking for will figure it out, because the concept of it is plain.
at each word boundary (mismatch in str.isalpha outcome), save the current group, start a new one
so this then, is the “missing” code:
def map(f, xs): res =  for x in xs: res.append(f(x)) return res def filter(p, xs): res =  for x in xs: print(x, p(x)) if p(x): res.append(x) return res def product(xs, ys): res =  for x in xs: for y in ys: res.append((x, y)) return res def groupby(xs, f): res =  current =  currentf = f(xs) for x in xs: if f(x) == currentf: current.append(x) else: res.append(current) current = [x] currentf = f(x) if len(current) > 0: res.append(current) return res
and, again, someone doing the course would not write those functions, or would know to. but if they’re doing their due diligence and consider what they need, they will come up with the same things.
So, here we go, this all ought to be covered.
def split(xs): res =  current =  currentf = xs.isalpha() for x in xs: if x.isalpha() == currentf: current.append(x) else: res.append("".join(current)) current = [x] currentf = x.isalpha() if len(current) > 0: res.append("".join(current)) return res def clean(text): res =  segments = split(text) for seg in segments: if seg.isalpha(): res.append(seg) return res def censor(text, phrases_): text = split(text) phrases =  for phrase in phrases_: phrases.append(clean(phrase)) marked =  for i in range(len(text)): for phrase in phrases: size = len(phrase) here =  for loc in range(i, i + 2 * size, 2): if loc >= len(text): break here.append(text[loc]) if here == phrase: # can add condition HERE to allow X first occurrences for mark in range(i, i + 2 * size, 2): marked.append(mark) for location in marked: text[location] = "*" * len(text[location]) return "".join(text) print(censor("a", ["a"])) print(censor("aa", ["a"])) print(censor("a a", ["a"])) print(censor("a b", ["b"])) print(censor("research search", ["search"])) print(censor("hello, how are you", ["hello how"]))
Great thinking! Great code!
Just add plural versions of phrases and it works perfectly! (it wouldn’t pick up on “learning algorithms” in email two otherwise it and would just skip it entirely!)
And I know I already expressed the problems that doing this in such simple way brings with it!
[...] def censor(text, phrases_): text = split(text) plur = [i + "s" for i in phrases_] phrases_ = plur + phrases_ phrases =  for phrase in phrases_: phrases.append(clean(phrase)) [...]
I think it’s thread closed!!!
I’d argue you’d add a plural version to the input and that it is therefore already handled.
You’d run into similar things with apostrophes (’) and probably a dozen other things.
Something that’s probably missing in the project description, something very important, is to establish what it should be able to handle.
But to get some kind of basic functionality that is correct … you probably need to split it, or it would be silly not to.
That’s true assuming that the input is something we can edit and even though it isn’t really stated in the exercise that we can’t. I’m not sure that we are allowed to, since we are supposed to work on our code and (I guess) not on what’s provided, and if I remember correctly (maybe not) handling plurals was a part of the problem.
So my assumption was - “What’s provided is not to be messed with!”
But I think that’s not really up to me or you to decide.
Creators didn’t really specify that problem so I guess it’s up to peronal decision how you are goint to handle it!
Couldn’t agree more!
I think it only creates confusion
I am a beginner in python and codecadamy and wanted some help/explanations for the censor dispenser challenge. I was able to figure out step 2 by myself (censor out a phrase), but I am struggling to figure out how to censor a list of words in step 3. Please help and critique if necessary. Here was my function for step 2.
email_one = open(“email_one.txt”, “r”).read()
email_two = open(“email_two.txt”, “r”).read()
email_three = open(“email_three.txt”, “r”).read()
email_four = open(“email_four.txt”, “r”).read()
def censor(phrase, text):
for words in text:
replaced = text.replace(phrase, “**** *****”)
print (censor(phrase, email_one)
sorry about my code. it didn’t format correctly. I guess I wanted to ask if there was a way to censor multiple words in the same kind of function I made for one phrase?