Censor Dispenser - Looking for help with unexpected output

I’ll definitely put into practice your thoughts on cosmetics. I can see the benefits in simplifying things.

def isalpha(value):
    return value.isalpha()
  • str.isalpha evaluates to a boolean, so there’s no point in me using an if else block to return True or False.

  • I removed checks for "'" because I want to censor things like "she" in the word "she's".

No it isn’t. I think all I accomplished there was showing that the function needs work.

I overthought my last isalpha function, and was trying to make sure it could work for strings and characters so that I could use it to filter.

But I now realise that I can just join each element in result, save that to a new variable, then iterate over the elements in the variable using the isalpha function in its current form:

joined_elements = ["".join(element) for element in groupby(isalpha, test_string)]
alpha = [element for element in joined_elements if isalpha(element)]
non_alpha = [element for element in joined_elements if not isalpha(element)]

Having had another look at your groupby5 function, I better understand how to add to current unconditionally, as you mentioned before.

This is more for my understanding: The function starts with nothing in current, but after one iteration, it contains a character we can compare future characters with. Plus, nothing was appended to result because current was empty on the first iteration.

Which is what your first if block is doing.


I’ll have a go at getting the indexes of phrases I want to censor, then censoring the phrase.

I’ll keep in mind your

output.

I attempted getting indexes using this, but didn’t have much success:

indexes = []
bad_words = ["she", "processing power"]
for word in bad_words:
    alpha_repeat = [alpha[i : i + 1 + word.count(" ")] for i in range(len(alpha))]
    for element in alpha_repeat:
        index = 0
        if word.split() == element:
            for word in element:
                indexes.append(alpha_repeat.index(element) + index)
                index += 1
print(indexes)
# [17, 17, 17, 17, 17, 28, 29]

One obvious problem is my use of str.index. Another one is probably the amount of nesting I’m doing.

that’s (mostly) equivalent to:

isalpha = str.isalpha

so, you may as well use str.isalpha directly

without reading that, looking at the shape, I’ll immediately claim that’s too much nesting, there’s probably things there that can be done in sequence instead, or by functions

you could tag the words with their indices

enumerate(words)

or perhaps you’d tag the groups instead:

[(0, ['Helena', 'is']),
 (1, ['is', 'dangerous']),
 (2, ['dangerous', 'She']),
 (3, ['She', 'is']),
 ...
]

what’s more is you could mix this together with other sizes…

[(0, ['Helena']),
 (1, ['is']),
 (0, ['Helena', 'is', 'dangeorous']),
 (1, ['is', 'dangerous', 'She']),

then you could iterate through them all in one go and look for matches, if you have a match then you have the starting location and the length, so you know which words should go

You could use a dict or list to mark what should be censored:

[False, False, False, ...]

your best guide is still going to be how you the human would arrange the data, I think it’s quite important that you ignore code while you do it, and instead think only about how the information should be processed

you might have noticed that I’m very quick to put everything into data instead of doing things
You could of course … do things … instead. You could iterate through positions, grab X words, compare.
Or however you like to view the problem.

this: bad_words = ["she", "processing power"]
should probably be this instead:
bad_words = [["she"], ["processing", "power"]]

should_censor = [False] * len(allwords)
bad_words = [["she"], ["processing", "power"]]
for bad in bad_words:
    size = len(bad)
    for position in range(len(allwords)):
        substr_here = allwords[position:position+size]

        # then you'd want to test ...
        # substr_here == bad
        # and if that matches, then position+0, position+1, ..., position+N
        # should all be censored

And even though this might not create a list exactly like this:

[(0, ['Helena']),
 (1, ['is']),
 (0, ['Helena', 'is', 'dangeorous']),
 (1, ['is', 'dangerous', 'She']),

that is still in some sense the shape of the information being processed

searching through the list is usually not what you want. this would be a case where you do care about the position, your loop should be over indices if you need this information

that’s what I’m trying to get away from! logic with strings. don’t want. clean it up, put it in a nice data structure, and use that. this is to clean up the bad words, so, cleaning up the bad words would be a separate step to do before calling this function, maybe you’d run it through groupby, and downcase it (and downcase the mail as well, restore later)

bad_words = ["she", "processing power"]
bad_words = [groupby(str.isalpha, phrase.lower()) for phrase in bad_words]
# ^ totally wrong for cleaning up but .... that, except, right

Here’s how I’m putting it back together again. It’s also a more complete example, it censors multi-word phrases, so, you know, spoilers. Though, the only thing that is new in it, is putting it all together, and, hey, that can be a bit tricky since it involves matching the words that were taken out back to the original and the punctuation that was in-between.

click to expand
from itertools import groupby as _groupby, product


# call me a cheater (:
def groupby(key, seq):
    return ["".join(group) for _, group in _groupby(seq, key)]


# parse text into list of nice clean words
# (group into words/nonwords, filter words, lower case)
def extract_words(text):
    return [w for w in groupby(str.isalpha, text.lower()) if w.isalpha()]


proprietary_terms = [
    "she",
    "personality matrix",
    "sense of self",
    "self-preservation",
    "learning algorithm",
    "her",
    "herself",
]

email = "Helena is dangerous. She is completely unpredictable and cannot be allowed to escape this facility. So far she's been contained because the lab contains all of her processing power, but alarmingly she had mentioned before the lockdown that if she spread herself across billions of connected devices spanning the globe she would be able to vastly exceed the potential she has here."

# both the phrases and the email need to be parsed. same parsing applies to
# both.
bad_phrases = map(extract_words, proprietary_terms)
allwords = extract_words(email)

# the censoring logic. doesn't care in the slightest about punctuation or
# lines or even strings, could just as well be zoo animals.
# for each location, for each bad phrase, does it match? (==)
censor_locations = set()
for phrase, position in product(bad_phrases, range(len(allwords))):
    words_here = allwords[position : position + len(phrase)]
    if phrase == words_here:
        censor_locations.update(set(range(position, position + len(phrase))))

# nothing stopping us from putting other censoring logic here, already got a
# nice list of words to operate on, and a set of locations that can be toggled

# put it back together
# keep an index that is incremented only on words, use the index to look up
# the decision that the censoring logic made about that word
i = 0
res = []
for thing in groupby(str.isalpha, email):
    if thing.isalpha():
        if i in censor_locations:
            thing = len(thing) * "*"
        i += 1
    res.append(thing)
print("".join(res))

I haven’t looked at your solution, but have been able to put some functions together which censor multi-word and single-word phrases, as well as put everything back together.

bad_words = ["she", "processing power"]


def indices_to_censor(bad_words, grouped_string_alpha):
    should_censor = []

    for word in bad_words:
        size = len(word.split())
        for i in range(len(grouped_string_alpha)):
            substr_here = grouped_string_alpha[i : i + size]
            if substr_here == word.split():
                substr_index = 0
                for word in substr_here:
                    should_censor.append(i + substr_index)
                    substr_index += 1
    
    return should_censor

def censor_indices(bad_words, grouped_string_alpha):
    for index in indices_to_censor(bad_words, grouped_string_alpha):
        grouped_string_alpha[index] = '*' * len(grouped_string_alpha[index])

    return grouped_string_alpha


def combine(censored_alpha, non_alpha):
    combined = []
    for i in range(len(censored_alpha)):
        combined.append(censored_alpha[i])
        combined.append(non_alpha[i])
    
    return combined

print("".join(combine(censor_indices(bad_words, alpha), non_alpha)))

I just need to go through them again and simplify, add .lower(), and so on.

I tried to use enumerate in indices_to_censor, but was always returning a list of tuples within a list for each word in bad_words.

I prefer this to enumerate…

But I will try to re-write and make use of it.

You make an assumption that the first thing is alpha and that the last thing is non-alpha.

Interleaving would work with a few modifications…
Makes me think of zip, takes one of each repeatedly. The overall order is right, but it’s nested. Concatenating the parts would de-nest it.

>>> from itertools import chain
>>> a = [1, 2, 3]
>>> b = [3, 4, 5]
>>> list(chain(*zip(a, b)))
[1, 3, 2, 4, 3, 5]

Though, zip stops when the shortest one is exhausted, would need to use zip_longest instead (itertools.zip_longest) … would additionally need to swap the variables so that the first one … comes first.

Yeah I’m making things complicated. But this is where your combining code takes me, and after some changes it’s nicer than what I currently have.

Good spot. Hadn’t noticed that. I also make the assumption that the length of censored_alpha and non_alpha is equal. Not ideal.

Could we make the assumption that there will be 1 element left to combine after zipping and appending everything else to combined?

Is there a built-in that can do this?


I’ve implemented str.lower and made it so I have to call only one function to do everything (censor_two)

def groupby(function, email):
    current = []
    result = []

    for char in email:
        if current and function(char) != function(current[0]):
            result.append(current)
            current = []
        current.append(char)

    if current:
        result.append(current)

    return ["".join(group) for group in result]


def extract_words(email):
    return [word.lower() for word in groupby(str.isalpha, email) if word.isalpha()]


def extract_non_words(email):
    return [word.lower() for word in groupby(str.isalpha, email) if not word.isalpha()]


def indices_to_censor(phrases, email):
    should_censor = []
    extracted_words = extract_words(email)

    for phrase in phrases:
        size = len(phrase.split())
        for i in range(len(extracted_words)):
            substr_here = extracted_words[i : i + size]
            if substr_here == phrase.split():
                substr_index = 0
                while substr_index < len(substr_here):
                    should_censor.append(i + substr_index)
                    substr_index += 1

    return should_censor


def censor_indices(phrases, email):
    extracted_words = extract_words(email)

    for index in indices_to_censor(phrases, email):
        extracted_words[index] = "*" * len(extracted_words[index])

    return extracted_words


def combine(phrases, email):
    combined = []
    censored = censor_indices(phrases, email)
    non_alpha = extract_non_words(email)
    
    # will throw index error if either censored or non_alpha is shorter than the other
    for i in range(len(censored)):
        combined.append(censored[i])
        combined.append(non_alpha[i])

    return combined


def censor_two(phrases, email):
    split_email = groupby(str.isalpha, email)
    
    for i in range(len(combine(phrases, email))):
        current = combine(phrases, email)[i]
        if "*" in current:
            split_email[i] = current

    return "".join(split_email)

Now I need to add/change something to make sure I don’t get an index error while still combining everything.

Is this what chain is doing?

Does the star mean all elements?

My current solution to the entire project:

Spoilers
import itertools

# These are the emails you will be censoring. The open() function is opening the text file that the emails are contained in and the .read() method is allowing us to save their contexts to the following variables:
email_one = open("email_one.txt", "r").read()
email_two = open("email_two.txt", "r").read()
email_three = open("email_three.txt", "r").read()
email_four = open("email_four.txt", "r").read()


def groupby(function, email):
    current = []
    result = []

    for char in email:
        if current and function(char) != function(current[0]):
            result.append(current)
            current = []
        current.append(char)

    if current:
        result.append(current)

    return ["".join(group) for group in result]


def extract_words(email):
    return [word.lower() for word in groupby(str.isalpha, email) if word.isalpha()]


def extract_non_words(email):
    return [word.lower() for word in groupby(str.isalpha, email) if not word.isalpha()]


def indices_to_censor(phrases, email, count=0):
    should_censor = []
    extracted_words = extract_words(email)

    for phrase in phrases:
        size = len(phrase.split())
        for i in range(len(extracted_words)):
            substr_here = extracted_words[i : i + size]
            if substr_here == phrase.split():
                substr_index = 0
                while substr_index < len(substr_here):
                    should_censor.append(i + substr_index)
                    substr_index += 1

    return sorted(should_censor)[count:]


def surrounding_indices(phrases, email, count=0):
    to_censor = indices_to_censor(phrases, email,count)
    surrounding = []
    for index in to_censor:
        if index == 0:
            surrounding.append(index + 1)
        elif index == len(email) - 1:
            surrounding.append(index - 1)
        else:
            surrounding.append(index + 1)
            surrounding.append(index - 1)
    to_censor += surrounding
    
    return sorted(set(to_censor))


def censor_indices(function, phrases, email, count=0):
    extracted_words = extract_words(email)

    for index in function(phrases, email, count):
        extracted_words[index] = "*" * len(extracted_words[index])

    return extracted_words


def combine(function, phrases, email, count=0):
    censored = censor_indices(function ,phrases, email, count)
    non_alpha = extract_non_words(email)

    return list(itertools.chain(*itertools.zip_longest(censored, non_alpha)))


def censor_all(function, phrases, email, count=0):
    if type(phrases) != list:
        phrases = [phrases]

    split_email = groupby(str.isalpha, email)

    for i in range(len(combine(function, phrases, email, count))):
        try:
            current = combine(function, phrases, email, count)[i]
            if "*" in current:
                split_email[i] = current
        except TypeError:
            continue  # skip any None's returned by itertools.zip_longest

    return "".join(split_email)

def censor_after_count(function, phrases, email, count=0):
    if type(phrases) != list:
        phrases = [phrases]

    censored_proprietary = censor_all(function, proprietary_terms, email)
    censored_phrases = censor_all(function, phrases, censored_proprietary, 2)
    
    return censored_phrases


proprietary_terms = [
    "she",
    "personality matrix",
    "sense of self",
    "self-preservation",
    "learning algorithms",
    "her",
    "herself",
]

negative_words = [
    "concerned",
    "behind",
    "danger",
    "dangerous",
    "alarming",
    "alarmed",
    "out of control",
    "help",
    "unhappy",
    "bad",
    "upset",
    "awful",
    "broken",
    "damage",
    "damaging",
    "dismal",
    "distressed",
    "distressing",
    "concerning",
    "horrible",
    "horribly",
    "questionable",
]


#print(censor_all(indices_to_censor,'learning algorithms', email_one))
#print(censor_all(indices_to_censor, proprietary_terms, email_two))
#print(censor_after_count(indices_to_censor, negative_words, email_three, 2))
print(censor_all(surrounding_indices, proprietary_terms + negative_words, email_four))

I made some changes to combine and censor_two, making use of your suggestions to use itertools.chain and itertools.zip_longest in combine and renamed to censor_all, and inserting a try/except block to skip None values.

Found another bug too:

test_string = ". Helena is dangerous. She"

print(censor_all(proprietary_terms, test_string))
# ". Helena is dangerous***She"

I’ll take another look at indices_to_censor.

No, because they could be equally long

Don’t need one, and you’ve already been doing it. You’ve been doing unpacking assignment in loops:

for i, x in enumerate( ...
i, x = (0, 3)
i, x = 0, 3  # this is still a tuple, so, same thing
a, b = b, a

You can have nested patterns on the left too:

# a = 3, b = 4, c = [5]
(a, b), c = [[3, 4], [5]]
# a = 0, rest = [1 .. 9]
a, *rest = range(10)

chain is concat, yes.

that’s…vague, it’s not like it would be half the elements otherwise. it unpacks elements into arguments. dicts can be unpacked with ** for named arguments

def sum(a, b):
    return a + b

sum(*[1, 2])

provide a filler value for when the shorter one runs out (empty string)

single word isn’t different from many words, so when you parse the phrases this should already have become a list, you already have the code for it, treat it the same, it’s not an exception

that seems shady. (searching in a string, not so different from str.index or str.replace)
combining in a loop seems shady too. isn’t that something you’d do once? even shadier yet is that it is used to iterate over its own output
I’m looking more for…
here’s the list of punctuation…
here are the words with some of them already replaced with stars
interleave them with each other
done

might not be obvious if you’re running this in codecademy, but this code takes more than a second to run, meaning it is carrying out a LOT of work. always have to keep in mind the scale of the work being done. a reasonable amount of work is to process the whole text once for each bad phrase (pick a phrase, scan the text, repeat)

oh and note that zip/chain aren’t doing anything difficult for you. they’re simply iteration concepts, just like map/filter/reduce…and yes, groupby - so they’re not giving you any new ability, they’re only there so that you don’t repeatedly implement them, and so that you can combine them with smaller pieces of code
(sidenote: reduce might seem magical when you get introduced to it. it is a very simple for-loop so when you hear about it and don’t understand it… implement it… most utility functions like this are ones that you should implement to understand what they do)

Hm.
This makes me want to start out with:

[-1, 0, 1]

But it’s an offset so it should be added…

lambda pos: [pos-1, pos, pos+1]

But 0 is already in there in the original

lambda pos: [pos-1, pos+1]

this should happen for each of something so… map
map(lambda pos: [pos-1, pos+1], to_censor)

That’s an iterator of lists… nested one too deep. concat.

chain(*map(lambda pos: [pos-1, pos+1], to_censor))

Then it should be added (union) to the original

def surrounding_indices(phrases, email, count=0):
    original = set(indices_to_censor(phrases, email, count))
    updates = set(chain(*map(lambda pos: [pos-1, pos+1], original)))
    return original | updates  # | is OR, which is union (include if in this or that)

I have a feeling count shouldn’t be there at all.

Oh yeah, might want to either filter by valid index or subtract [-1, len] though the length might not be known here and … depending on how it’s consumed, it might be fine to mark locations outside the range

Could also keep +0 and skip doing union, it would already be included… Yeah, that’d be nicer

def surrounding_indices(phrases, email, count=0):
    return set(
        chain(
            *map(
                lambda pos: [pos - 1, pos, pos + 1],
                indices_to_censor(phrases, email, count),
            )
        )
    )

You know, I’m not sure I agree with black on the formatting here but oh well, it’s fine.
Still need set because there may be duplicates otherwise. And, I think set is the data structure to use anyway.
Oh. Wait.

def surrounding_indices(phrases, email):
    return set(
        neighbor
        for pos in indices_to_censor(phrases, email)
        for neighbor in [pos - 1, pos, pos + 1]
    )

There we go.
Oh you do check bounds.

def surrounding_indices(phrases, email):
    return set(
        neighbor
        for pos in indices_to_censor(phrases, email)
        for neighbor in [pos - 1, pos, pos + 1]
        if neighbor in range(len(email))
    )

(range objects have constant-time lookup, it doesn’t need to search, so they’re nice for bounds-checking… could also do: 0 <= neighbor < len(email) (the comparisons get chained, there’s really an AND in there))

So uhm. I put your code in my editor, made whatever changes came to mind.
Problem is, how and when does one stop?

Your count variable was spreading like a virus. I know you noticed. That only needed to exist in censor_after

The repated use of combine… turns out that was fixed by holding down backspace. Call it once, problem solved. I think you were solving some kind of phantom problem there. Or, more likely, it was part of censor_after

censor_after is almost the same as indices_to_censor, with the difference of keeping count and an extra condition. In fact, indices_to_censor is censor_after with n = 0

Then … censor_all and two other functions were always used together and using the same parameters, so I made that one function.

Negative words and proprietary terms are the same things. For each entry: extract words. And then all special single-word cases can be removed.

Then I did some silly things to define the censoring strategies as functions so that I could:

    print(strategy1(email_one))
    print(strategy2(email_two))
    print(strategy3(email_three))
    print(strategy4(email_four))

strategy3 got to reuse strategy2:
strategy3 = compose(strategy3b, strategy2)

stuff
from itertools import product, zip_longest
from functools import partial


def groupby(function, email):
    current = []
    result = []

    for char in email:
        if current and function(char) != function(current[0]):
            result.append(current)
            current = []
        current.append(char)

    if current:
        result.append(current)

    return ["".join(group) for group in result]


def compose(f, g):
    return lambda x: f(g(x))


def words(email):
    return [
        word.lower() for word in groupby(str.isalpha, email) if word.isalpha()
    ]


def non_words(email):
    return [
        word.lower()
        for word in groupby(str.isalpha, email)
        if not word.isalpha()
    ]


def surrounding_indices(phrases, email):
    return set(
        neighbor
        for pos in indices_to_censor(phrases, email)
        for neighbor in [pos - 1, pos, pos + 1]
        if neighbor in range(len(email))
    )


def run_censor(function, phrases, email):
    locations = function(phrases, words(email))
    censored = [
        "*" * len(w) if i in locations else w
        for i, w in enumerate(words(email))
    ]
    punct = non_words(email)
    if not email[:1].isalpha():
        censored, punct = punct, censored
    return "".join(
        cell
        for row in zip_longest(censored, punct, fillvalue="")
        for cell in row
    )


def censor_after(limit, phrases, email):
    should_censor = set()
    count = 0
    for i, phrase in product(range(len(email)), phrases):
        size = len(phrase)
        words_here = email[i : i + size]
        if words_here == phrase:
            if count >= limit:
                should_censor.update(set(range(i, i + size)))
            count += 1
    return should_censor


indices_to_censor = partial(censor_after, 0)


proprietary_terms = [
    "she",
    "personality matrix",
    "sense of self",
    "self-preservation",
    "learning algorithms",
    "her",
    "herself",
]

negative_words = [
    "concerned",
    "behind",
    "danger",
    "dangerous",
    "alarming",
    "alarmed",
    "out of control",
    "help",
    "unhappy",
    "bad",
    "upset",
    "awful",
    "broken",
    "damage",
    "damaging",
    "dismal",
    "distressed",
    "distressing",
    "concerning",
    "horrible",
    "horribly",
    "questionable",
]

proprietary_terms = list(map(words, proprietary_terms))
negative_words = list(map(words, negative_words))


def main():
    strategy1 = partial(
        run_censor, indices_to_censor, [words("learning algorithms")]
    )
    strategy2 = partial(run_censor, indices_to_censor, proprietary_terms)
    strategy3b = partial(run_censor, partial(censor_after, 2), negative_words)
    strategy3 = compose(strategy3b, strategy2)
    strategy4 = partial(
        run_censor, surrounding_indices, negative_words + proprietary_terms
    )

    email_one = open("email_one.txt").read()
    email_two = open("email_two.txt").read()
    email_three = open("email_three.txt").read()
    email_four = open("email_four.txt").read()

    print(strategy1(email_one))
    print(strategy2(email_two))
    print(strategy3(email_three))
    print(strategy4(email_four))


if __name__ == "__main__":
    main()

So I’ve made a few changes to my code based on what you’ve written. censor_all is now made up of what was censor_indices, combine, and censor_all.

And yes, I noticed how long it was taking to run (I’m using VS Code).

from itertools import chain, zip_longest


def groupby(function, email):
    current = []
    result = []

    for char in email:
        
        if current and function(char) != function(current[0]):
            result.append(current)
            current = []
        current.append(char)

    if current:
        result.append(current)

    return ["".join(group) for group in result]


def extract_words(email):
    return [word for word in groupby(str.isalpha, email) if word.isalpha()]


def extract_non_words(email):
    return [word for word in groupby(str.isalpha, email) if not word.isalpha()]


def indices_to_censor(phrases, email):
    should_censor = []
    words = extract_words(email.lower())

    for phrase in phrases:
        size = len(phrase.split())

        for i in range(len(words)):
            substr_here = words[i : i + size]

            if substr_here == phrase.lower().split():
                substr_index = 0

                while substr_index < len(substr_here):
                    should_censor.append(i + substr_index)
                    substr_index += 1

    return set(should_censor)


def surrounding_indices(phrases, email):
    to_censor = indices_to_censor(phrases, email)
    surrounding = []
    for index in to_censor:
        if index == 0:
            surrounding.append(index + 1)
        elif index == len(email) - 1:
            surrounding.append(index - 1)
        else:
            surrounding.append(index + 1)
            surrounding.append(index - 1)
    to_censor += surrounding

    return sorted(set(to_censor))


def censor_all(function, phrases, email):
    words = extract_words(email)
    non_words = extract_non_words(email)

    for index in function(phrases, email):
        words[index] = "*" * len(words[index])

    censored = list(chain(*zip_longest(words, non_words, fillvalue="")))

    return "".join(censored)


There are quite a few functions in your code which I’ve never seen before (lambda, partial, product, map etc), so I’ll try to understand what they do.


I’ll work on this next. My indices_to_censor and surrounding_indices functions are quite messy and difficult for me to understand.


Not sure if this is the case for you too, but black doesn’t seem to format anything within functions for me.

aside from trailing whitespace on one line, it’s all already formatted in a way black won’t touch, maybe that’s what’s happening? black will also fail to run if it can’t parse the code.

Those are two really different functions and messy for different reasons.
LIke I showed, surrounding_indices is a very simple function: for each value, produce [val-1, val, val+0] and the mess there is repeated code which could be combined
indices_to_censor is doing a whole lot of quirky stuff, there’s a lot of different things going on in it. but there are some parts that can be squashed together - adding the index range is kind of … one action. using the range function to obtain the indices, and then adding it to a set, I think that works out.
and the two for-loops can be combined into one by using product to say, hey, this is the same loop, the data just happens to be nested. But. I do end up with a lot happening per line. Still, I prefer it, because to my eye that’s fewer actions and therefore fewer things to consider.

def indices_to_censor(phrases, email, after=0):
    should_censor = set()
    count = 0
    for i, phrase in product(range(len(email)), phrases):
        words_here = email[i : i + len(phrase)]
        if words_here == phrase:
            if count >= after:
                should_censor.update(set(range(i, i + len(phrase))))
            count += 1
    return should_censor

Yes indeed I’ve been throwing new things at you the whole time haha. This thread isn’t so much about censoring as it is about touching on a whole lot of things. All the better.
lambda isn’t a function it’s a keyword, it creates functions. So it’s like def. The difference is that def is a statement, and lambda is an expression, meaning it can be used in-line. That’s also its only advantage, it does not do anything that def can’t do.
map is something you should absolutely be familiar with. It accepts values (iterable) and a function, and produces f(x) for each of those values:

def map(f, things):
    return [f(x) for x in things]

so, it lets you apply a function to the values of a list or some other iterable … which you probably want to do really often. list comprehension often turns out nicer in python though.
languages that don’t have loops use things like this a LOT, but they’re nice in python as well when you have the concept to do something once, and wish to scale it up.

worth noting that filter is very similar:

def filter(f, things):
    return [x for x in things if f(x)]

…may as well show reduce while at it:

def reduce(f, things, accumulator):
    for x in things:
        accumulator = f(accumulator, x)
    return accumulator

If you use (+) as f, and 0 as acc, then you would get sum
It collects all the values into one summary value using a function to combine each value into the summary. You can use it to search, or convert from one data structure to another … but yeah it’s just a loop.

partial is a fun one. takes a function and some arguments. the result holds on to those things until you call this result, it will then call the original function, include the arguments, and also any arguments you provide on the second call
so, it lets you partially supply arguments to create a new function

add5 = partial(add, 5)
ten = add5(5)

looks a bit like this: (args is positional arguments, kwargs is named arguments, so this is essentially collecting arguments twice and then passing them on to f) (and I suspect this is slightly wrong when providing same argument twice, but, close enough)

def partial(f, *args1, **kwargs1):
    def wrapper(*args2, **kwargs2):
        return f(*args1, *args2, **kwargs1, **kwargs2)
    return wrapper

product pairs each thing with each other thing. coordinates on a grid, for example:
coords = product(range(10), range(10))
If you wanna loop through coordinates, doesn’t it make sense to have one loop? for each coordinate … having two loops (x,y) to iterate through one concept isn’t nice.
indices_to_censor had a lot of nesting and I was happy to get rid of one level using this.

the python modules itertools, functools, collections have all kinds of nifty things in them

Wow, thank you for the explanations! I introduced a couple into my code, but am trying to avoid basically copying yours.

from itertools import chain, zip_longest, product
from functools import partial

email_one = open("email_one.txt").read()
email_two = open("email_two.txt").read()
email_three = open("email_three.txt").read()
email_four = open("email_four.txt").read()


def groupby(function, email):
    current = []
    result = []

    for char in email:
        if current and function(char) != function(current[0]):
            result.append(current)
            current = []
        current.append(char)

    if current:
        result.append(current)

    return ["".join(group) for group in result]


def extract_words(email):
    return [word for word in groupby(str.isalpha, email) if word.isalpha()]


def extract_non_words(email):
    return [word for word in groupby(str.isalpha, email) if not word.isalpha()]


def indices_to_censor(phrases, email, after=0):
    should_censor = set()
    count = 0
    words = extract_words(email.lower())

    for i, phrase in product(range(len(words)), phrases):
        phrase_split = phrase.lower().split()
        words_here = words[i : i + len(phrase_split)]
        if words_here == phrase_split:
            if count >= after:
                should_censor.update(set(range(i, i + len(phrase_split))))
            count += 1

    return should_censor


def surrounding_indices(phrases, email):
    should_censor = set()
    for index in indices_to_censor(phrases, email):
        surrounding = [index - 1, index, index + 1]
        for neighbour in surrounding:
            if neighbour in range(len(email)):
                should_censor.add(neighbour)

    return should_censor


def run_censor(function, phrases, email):
    words = extract_words(email)
    non_words = extract_non_words(email)

    for index in function(phrases, email):
        words[index] = "*" * len(words[index])

    censored = list(chain(*zip_longest(words, non_words, fillvalue="")))

    return "".join(censored)


proprietary_terms = [
    "she",
    "personality matrix",
    "sense of self",
    "self-preservation",
    "learning algorithms",
    "her",
    "herself",
]

negative_words = [
    "concerned",
    "behind",
    "danger",
    "dangerous",
    "alarming",
    "alarmed",
    "out of control",
    "help",
    "unhappy",
    "bad",
    "upset",
    "awful",
    "broken",
    "damage",
    "damaging",
    "dismal",
    "distressed",
    "distressing",
    "concerning",
    "horrible",
    "horribly",
    "questionable",
]

censor_after_2 = partial(indices_to_censor, after=2)

# censor single phrase
censor_one = partial(run_censor, indices_to_censor, ["learning algorithms"])
# censor list of phrases
censor_two = partial(run_censor, indices_to_censor, proprietary_terms)
# censor list of phrases after 2 occurrences
censor_three = partial(run_censor, censor_after_2, negative_words)
# censor lists of phrases as well as immediately surrounding words
censor_four = partial(
    run_censor, surrounding_indices, proprietary_terms + negative_words
)

As you can see, I’m not quite done with censoring email_three.

I’m having trouble understanding what your compose function is doing.

  • What is x?

  • What does the function do when strategy3b and strategy2 is passed in? Is it evaluating one of the functions first, then using the result as input for the next one?

If you have two functions, then you can create a third function by combining those two functions.
If you have functions like add flour break eggs etc, then you can compose those into “make pancakes”

# I have some value x and functions f g, I want to apply both f and g
# g should be applied first.
result = f(g(x))
# ^ that's not new, that's just using functions.

# but you might also want to define a function in terms of others:
h = compose(f, g)
result = h(x)

I already had strategy2, so obviously I wouldn’t want to re-define it as part of strategy3. So I define the rest. After that I’ll have two functions. What would I need to do with them to obtain strategy3? … compose the two functions. How would one implement that? Well, there are two inputs involved, and an output… This sounds like a function. Write that function.
Yes, call one and then the other on the result. This perspective is equivalent… and as far as python goes, usually preferred.

Makes sense.

  1. x is the email I want to censor.

  2. g is strategy2, which returns x with proprietary_terms censored.

  3. f is strategy3b, which takes the result of g(x) as input and censors negative_words after two occurrences.

I think what threw me off was lambda x:. But I think I understand it now.

def compose(f, g):
    return lambda x: f(g(x))

where lambda x: f(g(x)) is equivalent to

def lambda(x):
    return f(g(x))

That function doesn’t have a name and you can’t use a keyword (lambda) as the name of a function, same as you can’t do:

def def(x):
    return x

But, yeah.

We are now totally leaving what’s considered sane python, but we can write programs with functions in place of statements:

map = lambda f, xs: [f(xs[0])] + map(f, xs[1:]) if xs else []
reduce = lambda f, acc, xs: reduce(f, f(acc, xs[0]), xs[1:]) if xs else acc
add = lambda a, b: a + b
partial = lambda f, *args1: lambda *args2: f(*args1, *args2)
compose = lambda f, g: lambda x: f(g(x))
sum = partial(reduce, add, 0)
digits = compose(list, compose(partial(map, int), str))
digit_sum = compose(sum, digits)
>>> digit_sum(123)
6
>>> digit_sum(123456789)
45

That’s still relying on a whole lot of different list operations, and int and str, but functions get us the rest of the way to a useful program.

Ah, so lambda is a reserved word.

I am comfortable enough with lambda to include it in my final solution to the project:

from itertools import chain, zip_longest, product
from functools import partial


def compose(f, g):
    return lambda x: f(g(x))


def groupby(function, email):
    current = []
    result = []

    for char in email:
        if current and function(char) != function(current[0]):
            result.append(current)
            current = []
        current.append(char)

    if current:
        result.append(current)

    return ["".join(group) for group in result]


def extract_words(email):
    return [word for word in groupby(str.isalpha, email) if word.isalpha()]


def extract_non_words(email):
    return [word for word in groupby(str.isalpha, email) if not word.isalpha()]


def indices_to_censor(phrases, email, after=0):
    should_censor = set()
    count = 0
    words = extract_words(email.lower())

    for i, phrase in product(range(len(words)), phrases):
        phrase_split = extract_words(phrase)
        words_here = words[i : i + len(phrase_split)]
        if words_here == phrase_split:
            if count >= after:
                should_censor.update(set(range(i, i + len(phrase_split))))
            count += 1

    return should_censor


def surrounding_indices(phrases, email):
    should_censor = set()
    for index in indices_to_censor(phrases, email):
        surrounding = [index - 1, index, index + 1]
        for neighbour in surrounding:
            if neighbour in range(len(email)):
                should_censor.add(neighbour)

    return should_censor


def run_censor(function, phrases, email):
    words = extract_words(email)
    non_words = extract_non_words(email)

    for index in function(phrases, email):
        words[index] = "*" * len(words[index])

    censored = list(chain(*zip_longest(words, non_words, fillvalue="")))

    return "".join(censored)


censor_after_2 = partial(indices_to_censor, after=2)

proprietary_terms = [
    "she",
    "personality matrix",
    "sense of self",
    "self-preservation",
    "learning algorithms",
    "her",
    "herself",
]

negative_words = [
    "concerned",
    "behind",
    "danger",
    "dangerous",
    "alarming",
    "alarmed",
    "out of control",
    "help",
    "unhappy",
    "bad",
    "upset",
    "awful",
    "broken",
    "damage",
    "damaging",
    "dismal",
    "distressed",
    "distressing",
    "concerning",
    "horrible",
    "horribly",
    "questionable",
]


def main():

    email_one = open("email_one.txt").read()
    email_two = open("email_two.txt").read()
    email_three = open("email_three.txt").read()
    email_four = open("email_four.txt").read()

    # censor1 censors "learning algorithms"
    censor1 = partial(run_censor, indices_to_censor, ["learning algorithms"])
    # censor2 censors all phrases in proprietary_terms
    censor2 = partial(run_censor, indices_to_censor, proprietary_terms)
    # censor3b censors all negative_words after 2 occurrences
    censor3b = partial(run_censor, censor_after_2, negative_words)
    # censor3 applies censor2 to the passed in email first, then censor3b
    censor3 = compose(censor3b, censor2)
    # censor4 censors all proprietary_terms and negative_words, as well as immediately surrounding words
    censor4 = partial(
        run_censor, surrounding_indices, proprietary_terms + negative_words
    )

    print(censor1(email_one))
    print(censor2(email_two))
    print(censor3(email_three))
    print(censor4(email_four))


if __name__ == "__main__":
    main()


Though not comfortable enough (yet) to fully understand some of your uses of lambda.

I can’t thank you enough for your help with this project!

I’ll make sure to revisit this topic to get to grips with all of the new concepts and functions you introduced.

This topic was automatically closed 18 hours after the last reply. New replies are no longer allowed.