Skip to content Skip to sidebar Skip to footer

Find And Update Duplicates In A List Of Lists

I am looking for a Pythonic way to solve the following problem. I have (what I think is) a working solution but it has complicated flow controls and just isn't 'pretty'. (Basically

Solution 1:

from collections import defaultdict

lists = [['apple', 'window', 'pear', 2, 1.55, 'banana'],
['apple', 'orange', 'kiwi', 3, 1.80, 'banana'],
['apple', 'envelope', 'star_fruit', 2, 1.55, 'banana'],
['apple', 'orange', 'pear', 2, 0.80, 'coffee_cup'],
['apple', 'orange', 'pear', 2, 3.80, 'coffee_cup']]

dic = defaultdict(int)
fts = []
for lst in lists:
    first_third = lst[0], lst[2]
    dic[first_third] += 1if dic[first_third] == 2: fts.append(first_third)
    lst.append(dic[first_third])

for lst in lists:
    if (lst[0], lst[2]) not in fts:
        lst[-1] -= 1print(lists)

Edit: Thanks utdemir. first_third = lst[0], lst[2] is correct, not first_third = lst[0] + lst[2]

Edit2: Changed variable names for clarity.

Edit3: Changed to reflect what the original poster really wanted, and his updated list. Not pretty any more, desired changes just tacked on.

Solution 2:

Your best bet is to sort the list first using itemgetter() to select the fields to be matched as key. This will cause all matching key fields to appear together so they can easily be compared and tagged. For example, the sort for matching the first and third fields would be:

lst.sort(key=itemgetter(0, 2))

The comparison of each item with its predecessor is straight forward.

Okay, here is the complete solution (uses itemgetter and groupby):

fromoperator import itemgetter
from itertools import groupby

def tagdups(input_seq, tag, key_indexes):
    keygetter = itemgetter(*key_indexes)
    sorted_list = sorted(input_seq, key=keygetter)
    for key, groupingroupby(sorted_list, keygetter):
        group_list = list(group)
        iflen(group_list) <= 1:
            continuefor item in group_list:
            item.append(tag)
    return sorted_list

And here is a sample test run to show usage:

>>> samp = [[1,2,3,4,5], [1,3,5,7,7],[1,4,3,5,8],[4,3,2,7,5],[1,6,3,7,4]]
>>> tagdups(samp, 'dup', (0,2))
[[1, 2, 3, 4, 5, 'dup'], [1, 4, 3, 5, 8, 'dup'], [1, 6, 3, 7, 4, 'dup'], [1, 3, 5, 7, 7], [4, 3, 2, 7, 5]]

Solution 3:

Here is my solution(commented code):

import itertools

l = [
        ['apple', 'window', 'pear', 2, 1.55, 'banana'],
        ['apple', 'orange', 'kiwi', 3, 1.80, 'banana'],
        ['apple', 'envelope', 'star_fruit', 2, 1.55, 'banana'],
        ['apple', 'orange', 'pear', 2, 0.80, 'coffee_cup'],
        ['apple', 'orange', 'pear', 2, 3.80, 'coffee_cup']
    ]

#Here you can select the important fields 
key = lambda i: (i[0],i[2])

l.sort(key=key)
grp = itertools.groupby(l, key=key)
#Look at itertools documentation
grouped = (list(j) for i,j in grp)

for i in grouped:
    iflen(i) == 1:
        i[0].append(0)
    else: #You want duplicates to start from 1for pos, item inenumerate(i, 1):
            item.append(pos)

#Just a little loop for flattening the list
result = [] 
for i in grouped:
    for j in i:
        result.append(j)

print(result)

Output:

[['apple', 'orange', 'kiwi', 3, 1.8, 'banana', 0],
 ['apple', 'window', 'pear', 2, 1.55, 'banana', 1],
 ['apple', 'orange', 'pear', 2, 0.8, 'coffee_cup', 2],
 ['apple', 'orange', 'pear', 2, 3.8, 'coffee_cup', 3],
 ['apple', 'envelope', 'star_fruit', 2, 1.55, 'banana', 0]]

Post a Comment for "Find And Update Duplicates In A List Of Lists"