Remove_Duplicates - duplicates not removing the way I think they should?


#1




So first off - I am solving the problem - just not how i am expecting to. All the print statements in my code are for debugging. This is my output.

0
[4, 5, 5, 4]
1
4
4
[5, 5, 4]
0
[5, 5, 4]
1
5
5
[5, 4]
0
[5, 4]
0
[5, 4]
[5, 4]
None

As you will see, the first if statement gets skipped in the first pass, which is expected, and it moves on to the elif statement, where count gets incremented. However, on the next pass of the inner for loop, print i should be printing 5 - instead it prints 4, which leads to the removal of the first 4 rather than the last one. Granted, it still solves the problem of removing duplicates, but why is my inner for loop not moving on to the next item? I have a list, not a dictionary, so it is ordered, therefore it should be seeing the next item in the list, which is 5, correct?


I expect to see -

0
[4, 5, 5, 4]
1
4
5
[4, 5, 5, 4]
1
4
5
[4, 5, 5, 4]
1
4
4
[4, 5, 5]

etc....


def remove_duplicates(numbers):
    new_list = []
    
    for integer in numbers:
        new_list.append(integer)

    for integer in numbers:
        count = 0
        for i in new_list:
            if (i == integer and count == 1):
                new_list.remove(i)
                print count
                print integer
                print i
                print new_list
            elif (i == integer and count == 0):
                print count
                count += 1
                print new_list 
        count = 0    
    return new_list

print remove_duplicates([4,5,5,4])


#2

Will interfere with


#3

Hmmm yes, I was expecting it to - but why is it interfering even before it sees the remove function? My issue is that in the list [4,5,5,4], i should first equal 4, and then it should equal 5, since nothing has been removed yet. It makes no sense to me that i = 4 on the first pass and then i = 4 on the second pass as well....


#4

I'm trying to make heads and tails out of your code


#5

Thanks, I know it does seem to be overly complex, but this is just how my brain wanted to solve the problem - by comparing two identical lists to begin with, but modifying one over time!


#6

If you have [1, 1, 1, 2]

You'll miss the third 1 the first time (because of removing the second 1, so it skips the third), but because there are multiple 1's in the original you'll repeat the de-duping process and eventually get them all


#7

For [4, 5, 5, 4] your i takes on the values:

4
5
5
4

(this 4 was removed when previous iteration encountered the second 4)
5
5
(skip 4 because of removal of the 5 above, but it is here in the list)

5
4
(whole list iterated through)

5
4
(whole list iterated through)

#8

So I guess what I'm not understanding is this - why is the first 4 removed when the previous iteration encounters the second 4? I should be comparing integer = 4 to the first i =4, then 5, then 5, then the second 4, so surely the second 4 should get removed, not the first?


#9

You tell [4, 5, 5, 4] to remove 4.
Which one should it remove?

https://docs.python.org/2/tutorial/datastructures.html


#10

Argh! I hate this! I keep thinking it'll try and remove the 4 at a specific index (like it would if I was iterating with range), but this form of the for loop just looks for the first instance....

Thanks, this has helped a lot with my understanding :slight_smile:


#11

Yeah you need to know what each thing does. You should pretty much just not use anything that you don't know exactly how it behaves!
That's difficult at first cause it's a bit like saying you gotta learn everything at once, and everything depends on knowing everything else.

Removing from anywhere but the end in a list means all the following values have to move over one step. (you wouldn't do this manually)

So with lists you would rather add than remove. It would be more efficient to start with an empty list and only add if it's not already there

An additional improvement is to use a hash table which is a datastructure with the ability to determine if it has a key in constant time (more keys doesn't make it take longer per key)
In python that data structure is called set (just keys) and dict (keys point to values)


#12

Yeah, I'm beginning to see that more. I did find a more elegant solution that was only 3 lines long and was appending rather than removing, I just wanted to understand what remove was doing exactly.

Isn't the hash table overkill for this kind of thing though? Or would you say it's better for something like a database?


#13

A hash table is exactly what you want for this, because it allows for removing duplicates in linear time

For small amounts of data it would be slower, yes (because for small amounts of data there aren't many operations anyway and hashtables's operations are slower. The gain with hashtables is that you can do a smaller amount of operations)
But no, it's not overkill, because it isn't any more difficult to write and the extra time is insignificant (unless a large amount of time is spent executing that code, which is seldom the case)

You could for example convert to set and back to list and you'd have no duplicates. You would lose the ordering between the elements, but if that matters then an OrderedDict can be used (there's no OrderedSet, but they're both hashtables so dict is fine, just set dummy values) .. OrderedDict would be doing the same thing as you would do with a list and a set. The list for keeping the order, the set for checking whether it has already been added


#14

Good to know, I haven't really worked with hash tables before, but I will definitely look into this - I'm learning Python to write scripts for databases anyway, so it'll be useful!


#15

Oh, but you have. Global variables, and also object attributes are implemented as dictionaries.
Functions and methods on the other hand use a stack(? or maybe it's just stored elsewhere in a known location) instead (for example, x might be the third value from the top - this is faster because you can replace each variable with its exact location in memory)

myvar = 5
^ key   ^ value

globals()['hello'] = 3  # don't do this without good reason though..
print hello  # prints 3

#16

Ah I've been wondering about global variables. Haven't come across them in the lessons yet - or if I have I sped through it too quickly :smiley:
I understand dictionaries, I just didn't realise hash tables were identical. I think I will do location based replacement in the future - it is more intuitive to me, I guess these new-fangled python methods are for easy access to data when you don't care about it's exact location!


#17

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.