Skip to content Skip to sidebar Skip to footer

Merging Pre-sorted Files Without Reading Everything Into Memory

I have a list of log files, where each line in each file has a timestamp and the lines are pre-sorted ascending within each file. The different files can have overlapping time rang

Solution 1:

Why roll your own if there is heapq.merge() in the standard library? Unfortunately it doesn't provide a key argument -- you have to do the decorate - merge - undecorate dance yourself:

from itertools import imap
from operator import itemgetter
import heapq

defextract_timestamp(line):
    """Extract timestamp and convert to a form that gives the
    expected result in a comparison
    """return line.split()[1] # for examplewithopen("log1.txt") as f1, open("log2.txt") as f2:
    sources = [f1, f2]
    withopen("merged.txt", "w") as dest:
        decorated = [
            ((extract_timestamp(line), line) for line in f)
            for f in sources]
        merged = heapq.merge(*decorated)
        undecorated = imap(itemgetter(-1), merged)
        dest.writelines(undecorated)

Every step in the above is "lazy". As I avoid file.readlines() the lines in the files are read as needed. Likewise the decoration process which uses generator expressions rather than list-comps. heapq.merge() is lazy, too -- it needs one item per input iterator simultaneously to do the necessary comparisons. Finally I'm using itertools.imap(), the lazy variant of the map() built-in to undecorate.

(In Python 3 map() has become lazy, so you can use that)

Solution 2:

You want to implement a file-based merge sort. Read a line from both files, output the older line, then read another line from that file. Once one of the files is exhausted, output all the remaining lines from the other file.

Post a Comment for "Merging Pre-sorted Files Without Reading Everything Into Memory"