How does `split()` work?

Question:

As per this lesson, we are trying to break apart a word to determine the number of times a sequence of characters is present. One method to solve this is using the .split() method, which I’ll explain below.

Solution:

If you check the documentation here, we can learn a bit about the .split() method. First thing to note is that it needs to be applied to strings or variables that are strings.

Next is that it takes two arguments. sep=None , and maxsplit=-1, since both of these have default values this method can be used without any arguments. While I won’t talk about the second parameter here, the first is the separator which is used to define a delimiter that the method can look for. Delimiters are sequences of one or more characters that are used to separate independent regions of text or other data. Outside of this lesson it is great to split each word off from a sentence for example.

In our lesson however we are making use of another advantage of it, that it returns a list split up by the separator. See below:

'eternity'.split('er')
['et','nity']
'alfalfa'.split('a')
['', 'lf', 'lf', '']

Its important to note based on our second example that a split will always return something on both sides even if it is an '' empty string. Using this we can see that in each situation, split will return one additional segment on top of the number of times our separator is present in the string. I will leave it to you to determine how that information can solve the lesson.

I hope this explanation helps to understand what’s going on with this method. If you have further questions or wish to discuss this please do so below.

2 Likes

Something of note to round out this topic…

str.split() and str.join() are inverse functions. That means the string we use in the argument of the split method can be used as the string object upon which we invoke the join method with the list as argument.

>>> string = 'ab cde fg hij'
>>> separator = ' '
>>> array = string.split(separator)
>>> re_string = separator.join(array)
>>> re_string
'ab cde fg hij'
>>> separator.join(string.split(separator))
'ab cde fg hij'
>>> 

Looking at the example in the OP…

>>> string = 'alfalfa'
>>> separator = 'a'
>>> string.split(separator)
['', 'lf', 'lf', '']
>>> separator.join(string.split(separator))
'alfalfa'
>>> 

Multiple character separator string…

>>> string = 'misssori-mississippi'
>>> separator = 'iss'
>>> array = string.split(separator)
>>> array
['m', 'sori-m', '', 'ippi']
>>> re_string = separator.join(array)
>>> re_string
'misssori-mississippi'
>>> separator.join(string.split(separator))
'misssori-mississippi'
>>> 
3 Likes

Im curious about the ’ return(len(splits)-1) ’ here. What is its purpose?

def count_multi_char_x(word, x):
splits = word.split(x)
return(len(splits)-1)

why not:
def count_multi_char_x(word, x)::
split_count = 0
splits = word.split(x)
if splits == x:
split_count += 1
return split_count

Why cant we just provide a variable to match to the argument x?

I guess it’s because 3 lines of the code is more readable than 6.
and because you won’t have different number every time: just like analogy in number of cuts needed compared to parts of paper… there’s no need to assign a variable that will always give you -1 result.

Consider,

>>> len('mississippi'.split('iss'))
3
>>> 'mississippi'.split('iss')
['m', '', 'ippi']
>>> len('mississippi'.split('i'))
5

The only thing that we can conclude is that it fudges the data to match the expectation given. Forcing data this way is counterproductive. As we can see, there is no logic in splitting the word on the characters we are attempting to identify then return a uniques count.

What that code would be more useful for is removing characters from the string…

>>> def remove_x(word, x):
	return ''.join(word.split(x))

>>> remove_x('mississippi', 'i')
'msssspp'
>>> 

Bottom line, neither example is correct. Strings are iterable and don’t need to be split to solve this problem. We will need to iterate one of the input strings; the only question is which one?

Hello,

# first example
word = "mississippi"
word = word.split("iss")

print(word) <<< prints ["m", " ", "ippi"], len(word) = 3

occurences = len(word) - 1      #which is 2

# second example
word = "mapapapdrink"
word = word.split("ap")

print(word) <<< prints ["m", "", "", "ippi"], len(word) = 4

occurences = len(word) - 1     #which is 3

This page: Splitting Strings II explains a bit on empty string ("") as a by-product when I split a string on a character that it also ends with, I’ll end up with an empty string at the end of the list.

But for the word “mapapapdrink”, the program returns 2 occurence of “” in the split string instead of 1.

Have I understood the chapter wrong? Can anyone help me understand this?

1 Like

Ending in the splitting substring isn’t a special case. Wherever it splits, you’ll get a string for each side whether there are characters there or not.

1 Like

I see, thank you so much for explaining!

splits = word.split(x)
return splits
print(count_multi_char_x(“mississipi”, “iss”))

[‘m’, ‘’, ‘ipi’]
can someone explain why theres “” i dont understand why it would do that?

Also why does:
“mississipi”.split(“i”)
give us [‘m’, ‘ss’, ‘ss’, ‘p’, ‘’]
" in the end. I guess my logic is not clear on this.

str.split() will default to space characters as the separator string, and will preserve only the words in a list. When we supply a separator string (can be one or more characters), the method only preserves what is on either side of that separator. In the above we see that there are no i's preserved in the list, only what was on either side of them.

Occasionally, as in your first example, there may be two such separator strings next to each other, ississ and when we remove the second one it preserves an empty string on its left side, hence, the '', in the list. When the separator is at the end of the string, again it preseves an empty string on its left. This is not a flaw, but the nature of the method.

Hi:)
I think I’m mssing something :frowning: , to my understanding when you split a string on a character that it also ends with, you’ll end up with an empty string at the end of the list.
the examples show diferent cases :

  • “mississippi”, “iss” we get [‘m’, ’ ’ , ’ ', 'ippi]
  • “apple”, “pp” we get [‘a’,‘le’]
    i dont understand the patern here, if this is the case why I dont get in with the ‘apple’ example [‘a’, ’ ', ‘le’]

Recall that for every character in a string, there are empty strings before and after all characters.

>>> 'aaaaaaa'.split('a')
['', '', '', '', '', '', '', '']
>>> 

Note that when we split on a, which is seven characters long, the result is eight empty strings.

There must be some reason that we get only one empty string when splitting on p in apple

>>> 'apple'.split('p')
['a', '', 'le']
>>>

and no empty strings when splitting on pp

>>> 'apple'.split('pp')
['a', 'le']
>>> 

How the algorithm sorts this out is a little beyond me though I suspect that it has something to do with the inverse method, join. Any split string should be restorable using str.join() with the same separator string used in the str.split()

>>> 'a'.join(['', '', '', '', '', '', '', ''])
'aaaaaaa'
>>> 'p'.join(['a', '', 'le'])
'apple'
>>> 'pp'.join(['a', 'le'])
'apple'
>>> 

As to that algorithm, it will take some research to find and explore the workings behind split and join but that would be chasing down a rabbit hole, at this point in time. Keep this question in mind for extra study once you are have more of the practical and rudimentary language skills under one’s belt.

Note the behavior of join when the separator is an empty string…

>>> ''.join(['', '', '', '', '', '', '', ''])
''
>>> ''.join(['a', '', 'le'])
'ale'
>>> ''.join(['a', 'le'])
'ale'
>>> 
1 Like

Will the following code always work this way?

splits = word.split(x)

Essentially working as a counter function for the string method?

This does not seem to work for lists.

I.E.

list1 = [1,2,3] 

list 2 = [4,5] 

for i in list2:
  count = list1.append(i) ```

Let’s set aside the count concept and focus on iterable. A split string will produce a list object. The only count, per se, is the length.

len(splits)

That line is assigning None to the variable with each iteration. There is no return from the append method.

Make it an action of its own accord, instead.

list1.append(i)

There is an easier way than append, though. Lists may be concatenated.

list1 += list2
1 Like

I Have used the below solution :

def count_multi_char_x(word,x):
p = len(’’.join(word.split(x)))
l = len(word)
y = len(x)
return (l - p)/y

This way makes more sense and understandable, I think?
def count_multi_char_x(word, x):
count = 0
for i in range(len(word)) :
if(word[i : i + len(x)] == x) :
count += 1
return count
// for every index i accompanied with the increment of len(x) , we will figure it out how many times the multiple characters appear.