Skip to content Skip to sidebar Skip to footer

Handling Extra Newlines (carriage Returns) In Csv Files Parsed With Python?

I have a CSV file that has fields that contain newlines e.g.: A, B, C, D, E, F 123, 456, tree , very, bla, indigo (In this case third field in the second row is 'tree\n' I tried t

Solution 1:

Suppose you have this Excel spreadsheet:

Common 'gottchas' in an Excel file

Note:

  1. the multi-line cell in C2;
  2. embedded comma in C1 and D3;
  3. blank cells, and cell with a space in D4.

Saving that as CSV in Excel, you will get this csv file:

A1,B1,"C1,+comma",D1
,B2,"line 1
line 2",D2
,,C3,"D3,+comma"
,,,D4 space

Assumably, you will want to read that into Python with the blank cells still having meaning and the embedded comma treated correctly.

So, this:

with open("test.csv", 'rU') as csvIN:
    outCSV=(line for line in csv.reader(csvIN, dialect='excel'))

    for row in outCSV:
        print("Length: ", len(row), row) 

correctly produces the 4x4 List of List matrix represented in Excel:

Length:  4 ['A1', 'B1', 'C1,+comma', 'D1']
Length:  4 ['', 'B2', 'line 1\nline 2', 'D2']
Length:  4 ['', '', 'C3', 'D3,+comma']
Length:  4 ['', '', '', 'D4 space']

The example CSV file you posted lacks quotes around the field with an 'extra newline' rendering the meaning of that newline ambiguous. Is it a new row or a multi-line field?

Therefor, you can only interpret this csv file:

A, B, C, D, E, F
123, 456, tree
, very, bla, indigo

as a one dimension list like so:

with open("test.csv", 'rU') as csvIN:
   outCSV=[field.strip() for row in csv.reader(csvIN, delimiter=',') 
              for field in row if field]

Which produces this one dimensional list:

['A', 'B', 'C', 'D', 'E', 'F', '123', '456', 'tree', 'very', 'bla', 'indigo']

This can then be interpreted and regrouped into any sub grouping as you wish.

The idiomatic regrouping method in python uses zip like so:

>>> zip(*[iter(outCSV)]*6)
[('A', 'B', 'C', 'D', 'E', 'F'), ('123', '456', 'tree', 'very', 'bla', 'indigo')]

Or, if you want a list of lists, this is also idiomatic:

>>> [outCSV[i:i+6] for i in range(0, len(outCSV),6)]
[['A', 'B', 'C', 'D', 'E', 'F'], ['123', '456', 'tree', 'very', 'bla', 'indigo']]

If you can change how your CSV file is created, it will be less ambiguous to interpret.


Solution 2:

This will work if you have non blanks cells

data = [['A', ' B', ' C', ' D', ' E', ' F'],
['123', ' 456', ' tree'],
['   ', ' very', ' bla', ' indigo']]

flat_list = chain.from_iterable(data)
flat_list = [cell for cell in flat_list if cell.strip() != ''] # remove blank cells

rows = [flat_list[i:i+6] for i in range(0, len(flat_list), 6)] # chunk into groups of 6 
print rows 

Output:

[['A', ' B', ' C', ' D', ' E', ' F'], ['123', ' 456', ' tree', ' very', ' bla', ' indigo']]

If you have blank cells in the input, this will work most of the time:

data = [['A', ' B', ' C', ' D', ' E', ' F'],
['123', ' 456', ' tree'],
['   ', ' very', ' bla', ' indigo']]

clean_rows = []
saved_row = []

for row in data:
    if len(saved_row):
        row_tail = saved_row.pop()
        row[0] = row_tail + row[0]  # reconstitute field broken by newline
        row = saved_row + row       # and reassemble the row (possibly only partially)
    if len(row) >= 6:
        clean_rows.append(row)
        saved_row = []
    else:
        saved_row = row


print clean_rows 

Output:

[['A', ' B', ' C', ' D', ' E', ' F'], ['123', ' 456', ' tree   ', ' very', ' bla', ' indigo']]

However even the second solution will fail with input such

A,B,C,D,E,F\nG
1,2,3,4,5,6

In this case the input is ambiguous and no algorithm will be able to guess if you meant:

A,B,C,D,E,F
G\n1,2,3,4,5,6 

(or the input give above)

If this could be the case for you, you'll have to go back to the person saving the data and make them save it in cleaner format (btw open office quotes newlines in CSV files far better then Excel).


Solution 3:

This should work. (Warning: Brain compiled code)

with open('test.csv', 'rU') as infile:
   data = []
   for line in infile:
       temp_data = line.split(',')
       try:
           while len(temp_data) < 6: #column length
               temp_data.extend(infile.next())
       except StopIteration: pass
       data.append(temp_data)

Solution 4:

If number of fields in each row is the same and fields can't be empty:

from itertools import izip_longest

nfields = 6
with open(filename) as f:
     fields = (field.strip() for line in f for field in line.split(',') if field)
     for row in izip_longest(*[iter(fields)]*nfields): # grouper recipe*
         print(row)

* grouper recipe

Output

('A', 'B', 'C', 'D', 'E', 'F')
('123', '456', 'tree', 'very', 'bla', 'indigo')

Solution 5:

If you know the number of columns, the best way is to ignore end of lines and then split.

Something like this

with open(filename, 'rU') as fp:
    data = ''.join(fp.readlines())

data = data.split(',')
for n in range(0, len(data), 6)
    print(data[n:n+6])

You can convert it easily into a generator if you prefer:

def read_ugly_file(filename, delimiter=',', columns=6):
    with open(filename, 'rU') as fp:
        data = ''.join(fp.readlines())

    data = data.split(delimiter)
    for n in range(0, len(data), columns)
        yield data[n:n+columns]

for row in read_ugly_file('myfile.csv'):
    print(row)

Post a Comment for "Handling Extra Newlines (carriage Returns) In Csv Files Parsed With Python?"