Python lists of duplicated elements


#1

If I have a list, that has many things including strings and numbers.
Is there a way I can make a function that returns a list of lists, with each list being all of the elements that are the same as another element in the original list.

So say function([1,2,3,4,1,1,2,3])
would return [[1,1,1],[2,2],[3,3],[4]]


#2

You can use collections.Counter to get a count for each element (assuming that they are hashable)
Then loop through counts and values and create lists containing lists of values with the right amount in each

In CPython 3.6 (CPython is the reference implementation, that's what you get from python.org) the order gets preserved (added keys preserve order), however that is an implementation detail, not something that Python promises. If that's something you need then use collections.OrderedDict instead.

Can also sort and use itertools.groupby (doesn't require values to be hashable, but does require them to be orderable), you can also group them yourself in a loop instead of using itertools.groupby. Sorting and grouping yourself would be the simplest approach to implement (no need to figure out how other data structures behave, just list and sort)

If they are not hashable and not orderable then you'd need to loop through the list of lists and test each one for which one next value is supposed to go into


#3

After spending more than an hour, I'm just done writing a function that takes a list as argument and returns a list of sublists of duplicates.

def sublist_duplicates(lst):
    sublists = []

    for item in set(lst):
        sublists.append([item] * lst.count(item))

    return sublists

lst_of_duplicates = [1, 2, 10, 3, 4, 1, 's', 2, 3, 1, 4, 's']

print sublist_duplicates(lst_of_duplicates)

That gives you: [[1, 1, 1], [2, 2], [3, 3], [4, 4], [10], ['s', 's']]

Hope that helps.


#4

That'll grind into a near-halt if there are many different elements, for example range(1_000_000) (1 million unique values, so it checks the whole list (length 1 million) for the count, a million times, a total of 1000 billion operations -- We're looking for something on the order of 1 million operations where each value adds constant time regardless of what other values there are (or if sorting, log2 n amount of time for n elements, but log2 n of 1000000 is about 20 which is fine, hashing would be linear, but it would still have a similar constant factor and end up about the same for 1 million values, though as that number increases, using hashing gets increasingly better over sorting)

Your function can handle this with a fairly easy change - doing all counts in a single pass over the list


#5

I know it'll get slower and slower if it's millions of unique values, but for starters it does the job fairly well :slight_smile:


#6

from collections import Counter

def sublist_duplicates(lst):
    counts = Counter(lst)
    sublists = [[k] * counts[k] for k in counts]
    return sublists

_lst = list(range(1000000))

import timeit

time = timeit.timeit("sublist_duplicates(_lst)", setup = "from __main__ import sublist_duplicates, _lst", number=1)

print("Number of elements in the list: {}".format(len(_lst)))
print("Execution time of the function: {} seconds".format(round(time, 2)))

Number of elements in the list: 1000000
Execution time of the function: 1.26 seconds