How to split punctuation from a string list without creating a nested list

In the personal project for the Censor Dispenser (https://www.codecademy.com/practice/projects/censor-dispenser), the exercise asks if your censor function can handle punctuation elements. After doing the initial string split on spaces, I thought of iterating through each element in the resulting list, to look for exclamation points, then commas then periods. The problem is that if I iterate and use split again when it finds these things, it then creates a nested list, whereas I’d need those strings to simply remain in the parent list for my function to work.

I have tried the following (at first just for exclamation marks, to see if I get the concept right):

def heavy_censor(message):
  message2 = message.split(" ")
  message3 = []
  
  for word in message2:
    index = message2.index(word)
    if word.find("!") != -1 and len(message2[index]) > 1:
      message3 = message2[index].split("!")
      message2[index : index] = message3
      message2 = message2.remove(word)

But it returns an AttributeError that NoneType object has no value and I have no idea what this means as I’ve never seen this error before.

Other attempts to iterate, splitting and adding to parent list have always just ended in infinite loops.

I’ve looked through python documentation and googled but I couldn’t find another method of splitting a string within a list without creating a nested list at the newly split index.

Instead of find and testing index, use the in operator: a in b
Though, if it’s not there then splitting and joining again won’t have any effect, will it? So there’s no need to check.

str.split returns a list, that list is not nested, there’s only one list.

Your attributeerror is probably from trying to get an attribute that isn’t there. You mixed something up.

>>> None.derp
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'derp'

rather than repeatedly searching through your list of words and removing from the middle of it (inefficient), consider writing to a new list instead, and ignoring words that you don’t wish to keep.

Are you trying to change the word or remove it? Your code seems to do both, but if you remove it then changing it seems pointless.

Removing things from a list you’re currently iterating through will affect iteration. You’ll see skips caused from things moving to lower indices while the loop always moves forwards. Again, write to a new list instead.

For most tasks I have no need of knowing the index of things. Usually it’s enough to process one value at a time. Indices are a common source of bugs, so if you don’t need to know then don’t process it either. You might for example instead create a function that acts on ONE word, then map that function to the list of words.

def maybecensor(word):
    return result

words = [...]
censored_words = [maybecensor(word) for word in words]
# or:
# censored_words = list(map(maybecensor, words))

I’m trying to separate punctuation from strings, turning the word and the punctuation element as separate string entries in the original list without creating a nested list inside of it.

For example, one of the emails you have to censor in the exercise starts with “Send help!”. After I split it with email.split(), the second string is “help!”, and I wish to separate it into “help” and “!”.

In my attempt I was trying to separate the exclamation point, assigning just the separated elements to a new list, then inserting them into a slice at their former index and removing the original index where the punctuation is not separated.

And then do the same with all types of punctuation and line breaks (which have also been proving problematic, but one step at a time).

So one word could become many. Writing to a new list solves that too. Note that this is inherently nested, there would be no point in trying to avoid that. It is.

res = []
words = [...]
for word in words:
    for splitword in split(word):
        res.append(splitword)

There are two reasons I can immediately think of to prefer this:

  • reading and writing from/to the same thing is always error-prone (have to make sure they don’t interfere with each other). keep it one-directional - have a source and a sink. One bug you almost certainly have is that your call to remove might actually remove a different value than you expected since you haven’t specified where in the list the removal should happen.
  • adding/removing stuff from anywhere other than the end of a list is inefficient since all values that follow have to be moved every time you do it. If you need to do it then you don’t want a list. A dict would be better, you could use indices as keys and insert/remove would be constant time. No need to do that here though since you can simply write to a new list instead. Or not, to “remove” it. Or write multiple.

Okay, I tried that, but then it doesn’t create a new separate string for the punctuation I split, which I need for when I put the message back together after the censor function does it’s thing. Basically the exclamation point in that case disappears.

def test(message):
  message1 = message.split(" ")
  message2 = []
  
  for word1 in message1:
    for word2 in word1.split("!"):
      message2.append(word2)
  
  return message2

This was printing [“Send”, “help”, “(third word)”, etc…]
Where I want it to print [“Send”, “help”, “!”, “(third word)”, etc…]

I don’t see why you’d lose anything. You’d append all the strings you got from splitting to the new list.

Oh you’d lose it while splitting. That’s … you’d lose that no matter how your other things looked.
There’s nothing stopping you from appending that too though.

And, again, if at all possible then having a function that processes just one word is quite nice.

so something like:

def test(message):
  message1 = message.split(" ")
  message2 = []
  
  for word1 in message1:
    if "!" in word1:
      for word2 in word1.split("!"):
        message2.append(word2)
        message2.append("!")
    else:
      message2.append(word1)
  
  return message2

The problem is now the aforementioned line breaks. Since there is a line break after the “help!” word, this is now printing:

['SEND', 'HELP', '!', '\n\nHelena', '!', 'has', ....]

When the message has no exclamation after “Helena”.

I’d make an argument for splitting on whitespace and re-formatting afterwards.

If you had to keep the whitespace you’re splitting on then you could split with a capturing regex pattern but don’t you kind of want to re-break the lines anyway?

>>> import re
>>> re.split(r'(\s+)', 'hello there \n bob')
['hello', ' ', 'there', ' \n ', 'bob']

And then… you’d probably want to use the same pattern again to determine whether or not your function should be applied. You could also include punctuation, or split on words instead.

Yeah thats been puzzling me a lot. I’m very new to coding, Ive just done the computer science path basically up until this project.

I’m having trouble understanding how the join function will reassemble the message properly afterwards, after I’ve split all the white space, instead of just putting line breaks after each string on the list, for example.

Edit: I haven’t learned import yet, so guess I’ll have to look that up. Thanks for the pointer.

Join puts the string it’s called on between every string in the list. If you call it on an empty list then it’s just concatenation.

>>> import re
>>> re.split(r'(\s+)', 'hello there \n bob')
['hello', ' ', 'there', ' \n ', 'bob']
>>> ''.join(_)  # _ just means previous value  ^^^
'hello there \n bob'

Regarding lines: you could also split by lines first. Then process one line at a time. So you might for example have a function for doing a word. And a function for doing a line. And a function for doing whole texts. They’d use each other to accomplish their … whatever they do.

def process_text(text):
    res = []
    for line in text.split('\n'):
        res.append(process_line(line))
    return '\n'.join(res)

So I’d have to make another loop to split for line breaks and appending them again, like I did with exclamation points? In this case, won’t the same thing that happened with exclamation points happen, except with the roles of exclamation points and line breaks reversed?

What, nested results? Not if you concatenate them each time you finish with something. If you look at my process_text function it splits by line and then joins it back together, the output has the same shape as teh input.

No I meant with the weird printing an extra exclamation point where there was none. I am still quite confused about why python printed that.

Oh. Well. Your split and join don’t match.
There’s one fewer “hole” than there are words: "hello there" - two words, one delimiter
You’re appending one “!” for each word - so one too many.
str.join inserts between each, not after, so that’s probably what you’d want to get around that.

Ah, gotcha, thanks! This whole time was thinking it was the line break’s fault somehow interfering.

I think a reasonable way to handle punctuation would be to have a function that accepts a possibly “dirty” word (has punctuation), isolates the word and the punctuation, sends the word to the single-word censor function, then puts it back together.

That would be called by the function that deals with a line
and that one in turn by the one that deals with whole text.

Each function splits the input into multiple simpler jobs, and each function is quite trivial by itself (thanks to dealing with only a small isolated part of the problem). You probably won’t ever have to use indices since you’re only ever dealing with one text, or one line, or one word.

That’s an interesting suggestion.

I’ve been fiddling around with trying to isolate punctuation, though, and came up with the following function to disassemble then reassemble the whole message:

def breakdown_test(message):
  split_lines = message.split("\n")
  
  split_words = []
  
  for line in split_lines:
    split_words.append(line.split(" "))
  
  all_split = [[] for line in split_words]
  
  for line in split_words:
    index = split_words.index(line)
    for word in line:
      if "!" in word:
        all_split[index].append(word.strip("!"))
        all_split[index].append("!")
      elif "," in word:
        all_split[index].append(word.strip(","))
        all_split[index].append(",")
      elif "." in word:
        all_split[index].append(word.strip("."))
        all_split[index].append(".")
      else:
        all_split[index].append(word)
  
  punctuation_recon = []
  
  for line in all_split:
    index = all_split.index(line)
    punctuation_recon.append(" ".join(all_split[index]))
  
  lines_recon = []
  
  for line in punctuation_recon:
    lines_recon.append(line)
  
  words = [str(word) for word in punctuation_recon]
  word_recon = "\n".join(words)
  
  return word_recon

The problem is that when reassembling, python places a space before punctuation when joining them back, since they were their own separate strings. Would it even be possible to set up an if statement to correct this? I am struggling to think of a way to do so.

Will I have to go through the whole message again with an if statement to check for spaces before punctuation and append to a new list as it goes? I feel like there must be a more efficient way.

Edit: Nvm, figured it out, I just had to use the .replace method. :smiley:

What if there’s supposed to be a space there?

If you split and keep the delimiters then you’ll have words and non-words (what that is depends on how you wanna define it and is probably hugely complicated for a natural language)

>>> text = '''\
... there's a guy standing @ the
... telephone booth
... '''
>>> uhh_stuff = re.split(r'(\w+)', text)
>>> uhh_stuff
['', 'there', "'", 's', ' ', 'a', ' ', 'guy', ' ', 'standing', ' @ ', 'the', '\n', 'telephone', ' ', 'boot
h', '\n']
>>> censor = lambda w: len(w) * '*'
>>> censored = [censor(w) if w.isalpha() else w for w in uhh_stuff]
>>> censored
['', '*****', "'", '*', ' ', '*', ' ', '***', ' ', '********', ' @ ', '***', '\n', '*********', ' ', '****
*', '\n']
>>> ''.join(censored)
"*****'* * *** ******** @ ***\n********* *****\n"

Or if you’ve got a function like I mentioned before that takes some word possibly containing punctuation, then it’ll have a much easier time putting it back together correctly again simply because that’s the only thing that function does, there’s nothing else around it to make things complicated, no deeply nested loops, no other logic.

The only reason why it would be okay to join on space is if you split on space. If you split on multiple things then you’ve lost that information.

You may as well treat space like punctuation. There would be no lines. No spaces. Just punctuation. It actually simplifies things quite a bit. If you look at my pattern for splitting it wasn’t on spaces or punctuation. It was on alphanumeric characters ([a-zA-Z0-9] … uhh and also unicode stuff) it should probably be refined to not include digits but close enough

Unless we treat digits as shorthand for their spoken/written word. As text goes, if there are no restrictions on the first character, why couldn’t it be a digit? After all it’s just a printable character at that point. I wouldn’t change a thing.

The idea of separating out all the alphanumeric data is not something I would at first have thought of. Expression is a beautiful thing.

On a side note, we know that one cannot split on an empty separator.

>>> 'something'.split('')

Traceback (most recent call last):
  File "<pyshell#0>", line 1, in <module>
    'something'.split('')
ValueError: empty separator

But yet we can count the empty spaces in a string…

>>> 'something'.count('')
10
>>> 

Your expression based code was able to find that first empty string, but not the last, as .count() did above. Minor. I’m more interested in how .count does it. Is there a tweak to your pattern above that would be sure to include that last empty string?

>>> "*****'* * *** ******** @ ***\n********* *****\n".count('')
46
>>> len("*****'* * *** ******** @ ***\n********* *****\n")
45
>>>