Censor Dispenser - Looking for help with unexpected output

I’m working my way through the Censor Dispenser challenge project and have just completed step 4 (or so I thought). Here is my code so far:

def censor_one(email, phrase):
    return email.replace(phrase, '*' * len(phrase))

#print(censor_one(email_one, 'learning algorithms'))

proprietary_terms = ["she", "personality matrix", "sense of self", "self-preservation", "learning algorithms", "her", "herself"]

def censor_two(email, phrases):
    # Splits email using ' ' as the delimiter and replaces 'clean' words (with no punctuation) with '*'s equal to the word's length. This loop is designed to focus on phrases with no spaces in them.
    email_split = email.split(' ')
    for i in range(len(email_split)):
        if email_split[i].lower() in phrases:
            email_split[i] = '*' * len(email_split[i])

    censored_email = ' '.join(email_split)
    
    # Iterates through each character in censored_email and looks for each phrase in a slice of censored_email equal to the phrase's length. If found, the slice has '*'s equal to the phrase's length assigned to it. This loop is designed to censor any phrases which weren't censored by the previous loop because of punctuation and/or spaces.
    for i in range(len(censored_email)):
        for phrase in phrases:
            if phrase == censored_email[i:i+len(phrase)].lower() and not censored_email[i-1].isalpha():
                censored_email = censored_email.replace(censored_email[i:i+len(phrase)], '*' * len(phrase))

    return censored_email

#print(censor_two(email_two, proprietary_terms))

negative_words = ["concerned", "behind", "danger", "dangerous", "alarming", "alarmed", "out of control", "help", "unhappy", "bad", "upset", "awful", "broken", "damage", "damaging", "dismal", "distressed", "distressing", "concerning", "horrible", "horribly", "questionable"]

def censor_three(email, phrases):
    censored_email = censor_two(email, proprietary_terms)
    
    count = 0
    index = 0
    for i in range(len(censored_email)):
        if count < 2:
            for phrase in phrases:
                if phrase == censored_email[i:i+len(phrase)].lower() and not censored_email[i-1].isalpha():
                    count += 1
                    index = i + len(phrase)
        else:
            break
    
    new_censored = []
    if count >= 2:
        new_censored.append(censored_email[:index])
        new_censored.append(censor_two(censored_email[index:], phrases))     
    
    return ''.join(new_censored)

#print(censor_three(email_three, negative_words))
print(censor_two(email_four, negative_words + proprietary_terms))

I tested my censor functions on email_four and did not get the output I was expecting. If you run the code, you’ll notice in the console that there are words which shouldn’t be censored - such as ***e., ********ly, and ******ous.

I don’t understand why I’m getting this output in email_four when, in email_two and email_three, I get my expected output.

I expect censor_two to censor every phrase found in a passed in list, and censor_three to censor every phrase in proprietary_terms as well as every phrase in negative_terms after the second occurrence of a negative word – irrespective of case and punctuation.

These functions work as expected on email_two and email_three respectively.

I welcome any help with this.

Seems to me like you’ll want to dig a little deeper than “not what I expected”

What do you have in your code that makes you argue that it should behave differently?

And, is it the case that this part of the code is doing its job? Is the idea wrong, or the code?

If you for example expect to censor either the whole word, or not at all, but you’re ignoring word boundaries, then, that would be a problem with the idea, and that also says something about what needs to be different, namely looking at each word and then making a decision about whether to censor it.

If something got censored that never should have been matched, then which exact operation behaved incorrectly? There’s a lot of narrowing down that can be done there.

You say that this shouldn’t be censored:

********ly

But what was the text there, and is that in your list of words? It is. So… it … should… be censored? You’ll have to motivate why it should or shouldn’t, and your code would need to match that motivation.

I cleaned up your code a bit. Removed things that weren’t doing anything, otherwise it’s the same.

def censor_two(email, phrases):
    for i in range(len(email)):
        for phrase in phrases:
            substr = email[i:i + len(phrase)]
            if substr.lower() == phrase:
                email = email.replace(substr, '*' * len(phrase))

    return email


email_four = '''Helena is dangerous. She is completely unpredictable and
cannot be allowed to escape this facility. So far she's been contained because
the lab contains all of her processing power, but alarmingly she had mentioned
before the lockdown that if she spread herself across billions of connected
devices spanning the globe she would be able to vastly exceed the potential
she has here.'''
proprietary_terms = [
    "she", "personality matrix", "sense of self", "self-preservation",
    "learning algorithms", "her", "herself"
]
negative_words = [
    "concerned", "behind", "danger", "dangerous", "alarming", "alarmed",
    "out of control", "help", "unhappy", "bad", "upset", "awful", "broken",
    "damage", "damaging", "dismal", "distressed", "distressing", "concerning",
    "horrible", "horribly", "questionable"
]
print(censor_two(email_four, negative_words + proprietary_terms))

Though, it’s actually just this, if simplifying further (the comparisons made in your code are already done by str.replace, so in effect it does nothing and can be removed. Same with the first part of the function.)

def censor_two(email, phrases):
    for phrase in phrases:
        email = email.replace(phrase, len(phrase) * '*')
    return email

If I run the following I get inconsistent outputs.

print(censor_two(email_two, proprietary_terms))
# 'communication between the system and our team of researchers.'
print(censor_two(email_four, proprietary_terms))
# 'It's been four days now we've been trapped in ***e.' <-- ***e = here

The if statement in censor_two (if phrase == censored_email[i:i+len(phrase)].lower() and not censored_email[i-1].isalpha():) is the cause, but I’m at a loss on how to correct. Removing and not censored_email[i-1].isalpha() leads to the following:

print(censor_two(email_two, proprietary_terms))
# 'communication between the system and our team of researc***s.' <-- researc***s = researchers
print(censor_two(email_four, proprietary_terms))
# 'It's been four days now we've been trapped in ***e.' <-- ***e = here

Adding more logical operators to the if statement (I’ve tried multiple configurations using or, and not, and censored_email[I+1].isalpha()`) leads to nothing being censored at all.

There isn’t anything.

Because I haven’t been able to add a condition to censor_two which checks if the last character in the string which the loop is currently on is not alpha, I will continue to see output like ***e or ******rous.

Plus the and not censored_email[i-1].isalpha() only checks if a non-alpha character preceeds the phrase I’m looking to censor.

I would say the idea is wrong as the code does what I told it to do.

In censor_two, I consider word boundaries in the first for loop, but then disregard them in the second, because I couldn’t figure out how to separate punctuation or how I would check if a bad multi-word phrase was in email_split.

How would you approach separating punctuation from alpha characters if found in an element of a split string?

How would you check if a bad multi-word phrase like ‘learning algorithm’ was in the email and then censor it?

The original word is 'alarmingly`, which doesn’t appear in proprietary_terms or negative_words. I think this shouldn’t have been censored because it falls under the same banner as words like ‘researc***s’ and ‘***e’: words which contain a bad word but are not an exact match.

split at word boundaries (isalpha). you would then have words only. the separators can be put back once the censoring has finished.

if I have a list of only words with nothing else to get in the way (which I do after doing the above) then looking at two words at a time becomes easy

You’re doing text search-and-replace on it, and one of your words to replace is alarming

Thank you for trimming the fat from my code.

Your second censor function is almost identical to something I wrote initially, but I found that it replaced the her in researchers with *** and does not censor herself because her is passed into the temporary variable phrase first.

I’d start by obtaining this:

[ "Helena" , " " , "is" , " " , "dangerous" , ". " , "She" , " " , "is" , " "
, "completely" , " " , "unpredictable" , " " , "and" , " " , "cannot" , " " ,
"be" , " " , "allowed" , " " , "to" , " " , "escape" , " " , "this" , " " ,
"facility" , ". " , "So" , " " , "far" , " " , "she's" , " " , "been" , " " ,
"contained" , " " , "because" , " " , "the" , " " , "lab" , " " , "contains" ,
" " , "all" , " " , "of" , " " , "her" , " " , "processing" , " " , "power" ,
", " , "but" , " " , "alarmingly" , " " , "she" , " " , "had" , " " ,
"mentioned" , " " , "before" , " " , "the" , " " , "lockdown" , " " , "that" ,
" " , "if" , " " , "she" , " " , "spread" , " " , "herself" , " " , "across" ,
" " , "billions" , " " , "of" , " " , "connected" , " " , "devices" , " " ,
"spanning" , " " , "the" , " " , "globe" , " " , "she" , " " , "would" , " " ,
"be" , " " , "able" , " " , "to" , " " , "vastly" , " " , "exceed" , " " ,
"the" , " " , "potential" , " " , "she" , " " , "has" , " " , "here" , "." ]

The same condition as is used to do this splitting can be used to separate them into words and other.
So you’d then have…

["Helena","is","dangerous","She","is","completely","unpredictable","and","cannot","be","allowed","to","escape","this","facility","So","far","she's","been","contained","because","the","lab","contains","all","of","her","processing","power","but","alarmingly","she","had","mentioned","before","the","lockdown","that","if","she","spread","herself","across","billions","of","connected","devices","spanning","the","globe","she","would","be","able","to","vastly","exceed","the","potential","she","has","here"]
[" "," ",". "," "," "," "," "," "," "," "," "," "," "," ",". "," "," "," "," "," "," "," "," "," "," "," "," "," ",", "," "," "," "," "," "," "," "," "," "," "," "," "," "," "," "," "," "," "," "," "," "," "," "," "," "," "," "," "," "," "," "," ","."]

Then I can create things like:

[["Helena","is"],["is","dangerous"],["dangerous","She"],["She","is"],["is","completely"],["completely","unpredictable"],["unpredictable","and"],["and","cannot"],["cannot","be"],["be","allowed"],["allowed","to"],["to","escape"],["escape","this"],["this","facility"],["facility","So"],["So","far"],["far","she's"],["she's","been"],["been","contained"],["contained","because"],["because","the"],["the","lab"],["lab","contains"],["contains","all"],["all","of"],["of","her"],["her","processing"],["processing","power"],["power","but"],["but","alarmingly"],["alarmingly","she"],["she","had"],["had","mentioned"],["mentioned","before"],["before","the"],["the","lockdown"],["lockdown","that"],["that","if"],["if","she"],["she","spread"],["spread","herself"],["herself","across"],["across","billions"],["billions","of"],["of","connected"],["connected","devices"],["devices","spanning"],["spanning","the"],["the","globe"],["globe","she"],["she","would"],["would","be"],["be","able"],["able","to"],["to","vastly"],["vastly","exceed"],["exceed","the"],["the","potential"],["potential","she"],["she","has"],["has","here"],["here"],[]]

And those can then be compared with == to phrases (somewhere one would need to deal with up/downcase, either before and restoring later, or replace == with something that is able to ignore it)

Note that the first list contains information like up/downcasing, and where puntuation/word locations are, allowing for putting everything back together later.

In particular I would avoid things like replace/find, I don’t need the searching behaviour and I don’t want them to act more than once, and I definitely don’t want them to act somewhere other than where I meant, they’re quite useless for doing this correctly.

I would quite likely want to make all comparisons against the original rather than against the updated censored data, because changing it affects what gets matched. Either by using two copies, or by marking the locations and then applying censoring after having made all the decisions.

Using a regular expression or something else? I haven’t learned about regular expressions yet.

Would this be a good opportunity to learn about them, or should I try to use what I’ve learned so far in the Codecademy Computer Science Path? All of my Python knowledge comes from this course (I have completed everything up to and including Strings: Thread Shed)

But what if you need to censor a list of phrases where elements contain a different amount of words? My first thought is to search for a split variant of the phrase in a split email. For example, looking for 'learning', 'algorithm' in ['abc', 'def', 'learning', 'algorithm', 'ghi']

regexps are useful for text but I’m not treating this like text, this is all list operations. append, lookup by index, length, that’s it.
what might be new to you is a willingness to split things up into separate functions each responsible for their own tasks, and then combining those functions in a way that creates layers of abstraction, overall quite the circus, but each individual part is simple so it’s all about learning to split things up.

Splitting into alpha/non-alpha is done by reading until it changes between those two options. at each change, cut, start a new list, keep going, repeat until having reached the end.

Creating lists of length-N-phrases is done by taking N words, creating a list for them, go to next location, repeat…

Think of it like this. You have some big problem to solve.
What grand abilities would you need to easily solve it?
Create those abilities for yourself, one at a time. Likely those too will require smaller problems to be solved, more abilities to create for yourself.

So long as these problems become smaller and smaller, you will eventually have created enough tools to solve the problem. Each part can be simple because each part relies on having useful tools for solving that problem. Nothing should be bending over backwards because of lack of tools to use.

You may also find that when you have many tools available it becomes easier to create new ones. It might seem like small things are useless, but they very quickly combine into powerful things.

I had a go at implementing this (albeit slightly differently) but cannot how to figure out how to repeat until having reached the end.

This is the best of what I’ve tried so far:

Without functions

email_split = []
alpha = ''
non_alpha = ''
i = 0

while email_four[i].isalpha():
    alpha += email_four[i]
    i += 1
email_split.append(alpha)

while not email_four[i].isalpha():
    non_alpha += email_four[i]
    i += 1
email_split.append(non_alpha)

With functions

def append_alpha(email, i):
    alpha = ''
    
    while email[i].isalpha():
        alpha += email[i]
        i += 1
    
    #append_non_alpha(email, i)
    return alpha

def append_non_alpha(email, i):
    non_alpha = ''
        
    while not email[i].isalpha():
        non_alpha += email[i]
        i += 1
    
    #append_alpha(email, i)
    return non_alpha

def split_email(email):
    email_split = []
    i = 0
    
    #for char in email:
    #for email[i] in email:
    #for i in email:

    while email[i].isalpha():
        email_split.append(append_alpha(email, i))
        i += len(append_alpha(email, i))
    while not email[i].isalpha():
        email_split.append(append_non_alpha(email, i))
        i += len(append_non_alpha(email, i))
        
    return email_split

Output

print(email_split) # ['SEND', ' ']
print(split_email(email_four)) # ['SEND', ' ']

First of all you need a function for telling the two cases apart. str.isalpha will work, though you might want to take apostrophes into account as well, and count that as a word-character.

The values you’re grouping is all the values. So you can iterate through them same as you would for anything else:

for thing in things:
    ...

That’ll deal with processing the whole thing, because you’re reading all of them.

Then you’d use the aforementioned function to make a decision. Should this thing go into the current group, or is this different, and then create a new empty group to put it in.

If you instead have code that finds one group and then stops, then you would need to do this repeatedly.

while there's more:
    extract the first group

Also note that extracting the group is the same thing regardless of whether it’s one kind of group or another.
So you should not have code for extracting a group of letters, and another piece of code for extracting a group of non-letters.
You should instead have code for extracting a group.


If you find yourself wanting to mutate the input (not needed, but it may be tempting to split off one group and then continue with the rest) then, you should be doing this from the back of the list. Lists support adding and removing things at the back, it cannot be done from the front, not without re-creating the whole thing. Alternatively, use something other than a list that does support the operation you’re looking for. (In this case, list already does support it, by working from the other end)


Keep in mind that you already know how to do this manually, so if you’re wondering what needs to happen … refer to what you’d do.

I can’t thank you enough for your help so far. You’ve really helped me understand the importance of breaking things down into small, manageable chunks.

I went back to square one and am so close to obtaining what you obtained.

The only difference between your output and mine is that mine has an empty string as the first element and the full stop is missing at the end if I don’t account for that specific situation.

['', 'Helena', ' ', 'is', ' ', 'dangerous', '. ', 'She', ' ', 'is', ' ', 'completely', ' ', 'unpredictable', ' ', 'and', ' ', 'cannot', ' ', 'be', ' ', 'allowed', ' ', 'to', ' ', 'escape', ' ', 'this', ' ', 'facility', '. ', 'So', ' ', 'far', ' ', "she's", ' ', 'been', ' ', 'contained', ' ', 'because', ' ', 'the', ' ', 'lab', ' ', 'contains', ' ', 'all', ' ', 'of', ' ', 'her', ' ', 'processing', ' ', 'power', ', ', 'but', ' ', 'alarmingly', ' ', 'she', ' ', 'had', ' ', 'mentioned', ' ', 'before', ' ', 'the', ' ', 'lockdown', ' ', 'that', ' ', 'if', ' ', 'she', ' ', 'spread', ' ', 'herself', ' ', 'across', ' ', 'billions', ' ', 'of', ' ', 'connected', ' ', 'devices', ' ', 'spanning', ' ', 'the', ' ', 'globe', ' ', 'she', ' ', 'would', ' ', 'be', ' ', 'able', ' ', 'to', ' ', 'vastly', ' ', 'exceed', ' ', 'the', ' ', 'potential', ' ', 'she', ' ', 'has', ' ', 'here']

This is what I wrote to get to this point:

test_string = 'Helena is dangerous. She is completely unpredictable and cannot be allowed to escape this facility. So far she\'s been contained because the lab contains all of her processing power, but alarmingly she had mentioned before the lockdown that if she spread herself across billions of connected devices spanning the globe she would be able to vastly exceed the potential she has here.'

# Function to tell the two cases apart
def isalpha(string):
    isalpha = []

    for char in string:
        if char.isalpha() or char == '\'':
            isalpha.append('y')
        else:
            isalpha.append('n')
    
    return isalpha
    # I now know, for each character in the string, whether it is alpha or not.

# Function to split a string into alpha/non-alpha characters
def split_string(string):
    list1 = []
    list2 = []
    
    for i in range(len(string)):
        # 1. compare char and the previous char and use isalpha to check if they are both alpha/non-alpha
        # 2. if they are, continue adding char to list1
        # 3. if they are not, stop, append the joined list1 to a new list, and make list1 empty
        # 4. repeat steps 1–3
        
        if i != len(string) - 1:
            if isalpha(string[i]) == isalpha(string[i-1]):
                list1.append(string[i])
            else:
                list2.append(''.join(list1))
                list1 = []
                list1.append(string[i])
        elif i == len(string) - 1 and isalpha(string[i]) != isalpha(string[i-1]):
            # Attempt to account for last punctuation mark in string, but will not include whole last word if there isn't a punctuation mark at the end because a non-alpha character precedes the final word.
            list2.append(''.join(list1))
            list2.append(string[i])

    
    return list2
    # I have now split the string into alpha/non-alpha characters.

print(split_string(test_string))

I’m sure the difference between our outputs relates to my comparison of the current and previous character (if isalpha(string[i]) == isalpha(string[i-1]):), but I’ve been scratching my head for a while.

  • I’m sure I get the empty string at the beginning of my list because the if statement is comparing the isalpha values of string[0] # 'H' with string[-1] #'.'.

  • I’ve tried to account for the final character being a non-alpha character, but as you can see it is very fragile.

Is there a better way for me to account for edge cases like these? Or do you have any suggestions as to where my code could be altered to not have to account for such edge cases?

Yes, the first value is special because nothing precedes it. But you do compare to something, that’s a bug.
What you would want to do is to unconditionally put it in a group, same as for all the other first values of groups.

The last group is special as well, because there isn’t a next group to terminate it, so it needs to be terminated anyway.

There’s a better way to test for when you’re done. When you’re done, the code after the loop runs. So you’d put that code after the loop.

I’d let the grouping function accept a comparison function instead of hard-coding isalpha:

>>> groupby(even, [0, 1, 2, 4, 5]
[[0], [1], [2, 4], [5]]

You only need isalpha for a single value at a time. It does not need to produce a list. Also, rather than returning strings, consider using booleans.

I’d probably prefer to avoid str.join, and treat it like a list instead. The caller can deal with string-specific things.

Depending on perspective, one can end up with quite a lot of edge cases for this. My version has the same problem:

def groupby(key, seq):
    if not seq:
        return []
    res = []
    current = []
    current_p = key(seq[0])
    for thing in seq:
        p = key(thing)
        if p != current_p:
            res.append(current)
            current = []
        current.append(thing)
        current_p = p
    res.append(current)
    return res

yikes.

Doing one group at a time helps a bit, some edge cases become … not edge cases, part of the loop.

The current group is always the first group, and a group is always non-empty meaning the first value can be used to compare against.
And each group is read in full and added, so it doesn’t end up having to add the last group in a special case after it’s done.

def groupby2(key, seq):
    res = []
    offset = 0
    while offset < len(seq):
        p = key(seq[offset])
        size = 1
        while offset + size < len(seq) and key(seq[offset + size]) == p:
            size += 1
        res.append(seq[offset:offset + size])
        offset += size
    return res

as a bonus, slicing will produce a string when used on string, no str.join required anywhere.

Even nicer yet would be if one had implementations of takeWhile and dropWhile - then one could test the first value, then take values while some condition holds, and obtain the rest by dropping by the same condition.
But. This would be removing things from the front, and lists… can’t.
From the back though?

def take_end_while(key, lst):
    res = []
    for val in reversed(lst):
        if not key(val):
            break
        res.append(val)
    return res[::-1]


def drop_end_while(key, lst):
    while lst and key(lst[-1]):
        lst.pop()


def groupby(key, seq):
    seq = list(seq)
    res = []
    while seq:
        outcome = key(seq[-1])
        def p(x): return key(x) == outcome
        res.append(take_end_while(p, seq))
        drop_end_while(p, seq)
    return res[::-1]

I think it’s an improvement. Is it weird to work from the back? Yeah a bit. Could use a double-ended queue instead.

from collections import deque


def take_while(key, deq):
    res = []
    for val in deq:
        if not key(val):
            break
        res.append(val)
    return res


def drop_while(key, deq):
    while deq and key(deq[0]):
        deq.popleft()


def groupby(key, seq):
    seq = deque(seq)
    res = []
    while seq:
        outcome = key(seq[0])
        def p(x): return key(x) == outcome
        res.append(take_while(p, seq))
        drop_while(p, seq)
    return res

Of course, what I actually did was to use a built-in groupby. I didn’t write it. But, it’s an excellent function to implement for practice because it only involves basic operations and yet can’t really be solved by accident, requires doing some programming.

I believe I’ve accounted for this in my attempt by adding an if i == 0 statement at the beginning of my loop, though as you point out, one could go on forever accounting for edge cases.

# Function to split a string into alpha/non-alpha characters
def split_string(string):
    list1 = []
    list2 = []
    
    for i in range(len(string)):
        # if == 0 needed to avoid empty string at beginning of returned list because of comparison between string[0] and string[-1] (edge case)
        if i == 0:
            list1.append(string[i])
        elif isalpha(string[i]) == isalpha(string[i-1]):
            list1.append(string[i])
        else:
            list2.append(list1)
            list1 = []
            list1.append(string[i])
    # Once loop is done, you will have one last thing to append which doesn't append normally because no more iterations (and thus appends) are going to happen. (edge case)
    list2.append(list1)
    
    return list2

joined_elements = [''.join(element) for element in split_string(test_string)]

What are the benefits of doing so? I imagine reusability is a big reason.

Silly me, of course I don’t need a list! I only care about the character I’m checking, not every character in the string!

Why should I return booleans instead of strings? Better performance, making use of something built-in rather than reinventing the wheel?

Is this to increase reusability?

I went through your groupby and groupby2 functions line-by-line and would like to make sure I understand what’s happening in groupby2:

def groupby2(key, seq):
    res = []
    offset = 0
    while offset < len(seq):
    # p = function return value at seq[offset]. Initial value == key(offset[0])
        p = key(seq[offset])
        # used in check if next key return value == current. size can never be less than 1. size grows until return value != current
        size = 1
        # while offset + size < len(string) and next key return value is equal to current key value, add 1 to size (this determines the size of the slice we are going to take, where each character within the slice returns the same key value).
        while offset + size < len(seq) and key(seq[offset + size]) == p:
            size += 1
            # when the next key return value is not equal to current, we exit the inner while loop and append a slice of string from offset to offset + size to res
        res.append(seq[offset:offset + size])
        # we then add the size of the slice we just took to offset, thus avoiding looking at those characters again
        offset += size
        # while loop continues until offset >= len(seq) otherwise we'd get an IndexError
    return res

I read about groupby() in the documentation but couldn’t figure out why I need to define two temporary variables in order to return the desired output.

groups = [''.join(list(g)) for k, g in itertools.groupby(test_string, isalpha)]
print(groups) # Desired output

groups = [''.join(list(g)) for g in itertools.groupby(test_string, isalpha)]
print(groups) # TypeError: sequence item 0: expected str instance, bool found

I’d let the grouping function accept a comparison function instead of hard-coding isalpha

What are the benefits of doing so? I imagine reusability is a big reason.

to converge with the already existing function, to see that hey, this thing I needed, it’s a common widely applicable thing and it already exists, and, hey I know what this does because I wrote it
and it is immediately useful here: you can group by isalpha and then filter by isalpha, the filtering would separate out the words and would use the same condition (actually not exactly the same, because filtering would need to apply the condition to one of the values in the group instead of to the whole group, but the same condition is part of that)

Doesn’t matter much. It’s weird, the default choice here would be boolean. Booleans have language support, “y” “n” don’t. In this case you only care about the outcomes being different, but if you were to combine this function with something else (for example filter) then to make things fit together the result would need to be boolean.

I’d probably prefer to avoid str.join

Is this to increase reusability?

What str makes us do to concatenate them is so weird I don’t think it belongs in a general algorithm. And yeah, using it means it’s only applicable to string. And yeah, part of my bias is definitely that I want to converge with itertools.groupby

Yeah it’s anything but beginner friendly. When you call it, it doesn’t do any work, but instead creates an iterator that you can tug at to get values out of. And the things that this iterator produces are… tuples and the second value of each tuple is yet another iterator.
So it has three levels of nesting and the string representations of iterators will look confusing.

You don’t need to do both of those. The reason you would use list there is to make list iterate through g and create a list, but str.join will also iterate so that too will force evaluation.
Each value produced by itertools.groupby is a tuple, and it’s the second value in the tuple that contains the values of the group. You need to dig it out.

a, b = 3, 4
print(a)  # 3
print(b)  # 4

I think the best thing you can do is to consider your perspective when doing it manually, to try to find a formulation of a solution that you can reasonably well convince yourself is correct. And if you can write your code to match this perspective, all the better.

Oh and I found a bug (because I had some test code so I plugged yours in)

empty causes your latest function to produce:
[[]] (one empty group)
should be:
[] (no groups)

also, compulsively rewrote it a bit:

def groupby5(key, seq):
    current = []
    result = []
    for val in seq:
        if current and key(val) != key(current[0]):
            result.append(current)
            current = []
        current.append(val)
    if current:
        result.append(current)
    return result

with the changes that:

  • the index isn’t important information
  • the different branches in the loop were sharing things and could be combined (all of them appended the value, so that’s unconditional, which leaves only the else-branch as different)
  • appending the current group after the loop should only happen if there is a current group

the only real change is if current, the rest is cosmetic

Definitely.

How would you describe the difference between an iterator and an iterable to a beginner like me, who has no previous knowledge of __iter__ or __next__?

From my reading, all the distinctions between the two make use of __iter__ or __next__.

So it does. There are no indexes in an empty data type, so the for loop doesn’t do anything and the function just appends an empty list1 to list2.

This explains why you used

if not seq:
    return []

in your groupby function.

Your comparison to the first element in current, if non-empty, is much nicer.

Should beginners like myself worry about cosmetic things, or is it more important to just produce code that works as I want for now?

# Function to determine if a string is alpha/non-alpha
def isalpha(string):
    for char in string:
        if char.isalpha() or char == '\'':
            return True
    return False

# Function to group each character in a string depending on the value they return when passed in to the passed in function
def split_string(function, string):
    list1 = []
    list2 = []
    
    for i in range(len(string)):
        if i == 0:
            list1.append(string[i])
        elif function(string[i]) == function(list1[0]):
            list1.append(string[i])
        else:
            list2.append(list1)
            list1 = []
            list1.append(string[i])

    if list1:
        list2.append(list1)
    
    return list2

I’m happy with string_split. isalpha, not so much.

Passing in 'hello.' returns True.

Passing in '.' returns False.

I suppose this doesn’t matter too much because when we come to filter, all of the characters in an element will be either alpha OR non-alpha; there won’t be a mix.

Or would rewriting isalpha like so be preferable? I can’t think of a case where '\'' in string would mess up the grouping or filtering.

def isalpha(string):
    if string.isalpha() or '\'' in string:
        return True
    return False

Passing in 'hello.' returns False.

Passing in '.' returns False.

I prefer this version, but welcome your thoughts on it.

An iterator is iteration in progress, each time you poke it, you get another value (or it terminates)

An iterable is something that you can request an iterator from. So, a list is iterable but not an iterator.

iterator = iter([1])
first_value = next(iterator)  # 1
second_value = next(iterator)  # stopiteration

When you only have the required things it gets easier to see patterns in it. So yeah, do simplify things.

That’s the same thing and it appears in the third one as well. So it’s only the third one that needs testing for.

And that’s creating indirection (index) but always dereferencing, so use the thing directly

Both are things that cancel out, dead code, yeah you should care about that. I even think you should be a bit aggressive about it and even make simplifications that change behaviour, maybe something better comes out.

Something you can care a bit less about is code style. Though, if things that are equal look equal, then, again, patterns, easier to see, so consistency matters but not so much whether it’s one way or another. Automatic code formatters are great, use those, they deal with what matters without having to make a fuss. https://github.com/psf/black

naming things helps. list1 and list2 are names that won’t make it apparent if you do the wrong thing with them, you could for example return the wrong one and it wouldn’t be obvious.

# names matter
x.h()
missile.fire()

Your human perspective on the problem can help cosmetics too. How do you think of it, can you write your code in those terms? We do optimize things while we’re doing them, that’s something we can use.

I would argue that those things have different shape, should almost be a type error. That function is supposed to look at one character, 'hello.' isn’t valid input for it, is it?

and,

That’s the same self-cancelling thing again.

return something
def isalpha(string):
    if string.isalpha() or '\'' in string:
        return True
    return False

That’s very wrong, by the way. Containing an apostrophe isn’t sufficient.
A nice way to go about dealing with many of something is to implement the thing for ONE and scale it up separately by combining it with some concept of iteration:

all(map(isalpha, string))

this would be nicer but isn’t how all behaves in python:

all(isalpha, string)