Handling Extra Newlines (carriage Returns) In Csv Files Parsed With Python?
Solution 1:
Suppose you have this Excel spreadsheet:
Note:
- the multi-line cell in C2;
- embedded comma in C1 and D3;
- blank cells, and cell with a space in D4.
Saving that as CSV in Excel, you will get this csv file:
A1,B1,"C1,+comma",D1
,B2,"line 1
line 2",D2
,,C3,"D3,+comma"
,,,D4 space
Assumably, you will want to read that into Python with the blank cells still having meaning and the embedded comma treated correctly.
So, this:
with open("test.csv", 'rU') as csvIN:
outCSV=(line for line in csv.reader(csvIN, dialect='excel'))
for row in outCSV:
print("Length: ", len(row), row)
correctly produces the 4x4 List of List matrix represented in Excel:
Length: 4 ['A1', 'B1', 'C1,+comma', 'D1']
Length: 4 ['', 'B2', 'line 1\nline 2', 'D2']
Length: 4 ['', '', 'C3', 'D3,+comma']
Length: 4 ['', '', '', 'D4 space']
The example CSV file you posted lacks quotes around the field with an 'extra newline' rendering the meaning of that newline ambiguous. Is it a new row or a multi-line field?
Therefor, you can only interpret this csv file:
A, B, C, D, E, F
123, 456, tree
, very, bla, indigo
as a one dimension list like so:
with open("test.csv", 'rU') as csvIN:
outCSV=[field.strip() for row in csv.reader(csvIN, delimiter=',')
for field in row if field]
Which produces this one dimensional list:
['A', 'B', 'C', 'D', 'E', 'F', '123', '456', 'tree', 'very', 'bla', 'indigo']
This can then be interpreted and regrouped into any sub grouping as you wish.
The idiomatic regrouping method in python uses zip like so:
>>> zip(*[iter(outCSV)]*6)
[('A', 'B', 'C', 'D', 'E', 'F'), ('123', '456', 'tree', 'very', 'bla', 'indigo')]
Or, if you want a list of lists, this is also idiomatic:
>>> [outCSV[i:i+6] for i in range(0, len(outCSV),6)]
[['A', 'B', 'C', 'D', 'E', 'F'], ['123', '456', 'tree', 'very', 'bla', 'indigo']]
If you can change how your CSV file is created, it will be less ambiguous to interpret.
Solution 2:
This will work if you have non blanks cells
data = [['A', ' B', ' C', ' D', ' E', ' F'],
['123', ' 456', ' tree'],
[' ', ' very', ' bla', ' indigo']]
flat_list = chain.from_iterable(data)
flat_list = [cell for cell in flat_list if cell.strip() != ''] # remove blank cells
rows = [flat_list[i:i+6] for i in range(0, len(flat_list), 6)] # chunk into groups of 6
print rows
Output:
[['A', ' B', ' C', ' D', ' E', ' F'], ['123', ' 456', ' tree', ' very', ' bla', ' indigo']]
If you have blank cells in the input, this will work most of the time:
data = [['A', ' B', ' C', ' D', ' E', ' F'],
['123', ' 456', ' tree'],
[' ', ' very', ' bla', ' indigo']]
clean_rows = []
saved_row = []
for row in data:
if len(saved_row):
row_tail = saved_row.pop()
row[0] = row_tail + row[0] # reconstitute field broken by newline
row = saved_row + row # and reassemble the row (possibly only partially)
if len(row) >= 6:
clean_rows.append(row)
saved_row = []
else:
saved_row = row
print clean_rows
Output:
[['A', ' B', ' C', ' D', ' E', ' F'], ['123', ' 456', ' tree ', ' very', ' bla', ' indigo']]
However even the second solution will fail with input such
A,B,C,D,E,F\nG
1,2,3,4,5,6
In this case the input is ambiguous and no algorithm will be able to guess if you meant:
A,B,C,D,E,F
G\n1,2,3,4,5,6
(or the input give above)
If this could be the case for you, you'll have to go back to the person saving the data and make them save it in cleaner format (btw open office quotes newlines in CSV files far better then Excel).
Solution 3:
This should work. (Warning: Brain compiled code)
with open('test.csv', 'rU') as infile:
data = []
for line in infile:
temp_data = line.split(',')
try:
while len(temp_data) < 6: #column length
temp_data.extend(infile.next())
except StopIteration: pass
data.append(temp_data)
Solution 4:
If number of fields in each row is the same and fields can't be empty:
from itertools import izip_longest
nfields = 6
with open(filename) as f:
fields = (field.strip() for line in f for field in line.split(',') if field)
for row in izip_longest(*[iter(fields)]*nfields): # grouper recipe*
print(row)
Output
('A', 'B', 'C', 'D', 'E', 'F')
('123', '456', 'tree', 'very', 'bla', 'indigo')
Solution 5:
If you know the number of columns, the best way is to ignore end of lines and then split.
Something like this
with open(filename, 'rU') as fp:
data = ''.join(fp.readlines())
data = data.split(',')
for n in range(0, len(data), 6)
print(data[n:n+6])
You can convert it easily into a generator if you prefer:
def read_ugly_file(filename, delimiter=',', columns=6):
with open(filename, 'rU') as fp:
data = ''.join(fp.readlines())
data = data.split(delimiter)
for n in range(0, len(data), columns)
yield data[n:n+columns]
for row in read_ugly_file('myfile.csv'):
print(row)
Post a Comment for "Handling Extra Newlines (carriage Returns) In Csv Files Parsed With Python?"