Pandas: How To Include All Columns For All Rows Although Value Is Missing In A Dataframe With A Long Format?
This may sound like a strange question at first, but I found it hard to find 'standard' terms when talking about elements of data of a long format. So I thought I'd just as well us
Solution 1:
This is different from previous one since we have multiple value for same row
df['key']=df.groupby(['row','column']).cumcount()
df1 = pd.pivot_table(df,index='row',columns=['key','column'],values='value')
df1 = df1.stack(level=[0,1],dropna=False).to_frame('value').reset_index()
df1 = df1[df1.key.eq(0) | df1['value'].notna()]
df1
Out[97]:
row key column value
021.08.20200 A 43.0121.08.20200 B 36.0221.08.20200 C 28.0321.08.20201 A 36.0622.08.20200 A 16.0722.08.20200 B 40.0822.08.20200 C NaN
1022.08.20201 B 34.0
Solution 2:
I found an approach with pd.pivot()
in combination with unstack()
:
import pandas as pd
df=pd.DataFrame({'row': {0: '21.08.2020',
1: '21.08.2020',
2: '21.08.2020',
3: '21.08.2020',
4: '22.08.2020',
5: '22.08.2020',
6: '22.08.2020'},
'column': {0: 'A', 1: 'A', 2: 'B', 3: 'C', 4: 'A', 5: 'B', 6: 'B'},
'value': {0: 43, 1: 36, 2: 36, 3: 28, 4: 16, 5: 40, 6: 34}})
df1 = pd.pivot_table(df,index='row',columns='column',values='value').unstack().reset_index()
print(df1)
Output
column row 00A21.08.202039.51A22.08.202016.02B21.08.202036.03B22.08.202037.04 C 21.08.202028.05 C 22.08.2020 NaN
The order of the dataframe columns are arguably messed up though...
Solution 3:
Here is a naive approach - uses a for loop.
data = {'row': {0: '21.08.2020', 1: '21.08.2020', 2: '21.08.2020',
3: '21.08.2020', 4: '22.08.2020', 5: '22.08.2020',
6: '22.08.2020'},
'column': {0: 'A', 1: 'A', 2: 'B', 3: 'C', 4: 'A', 5: 'B', 6: 'B'},
'value': {0: 43, 1: 36, 2: 36, 3: 28, 4: 16, 5: 40, 6: 34}}
df = pd.DataFrame(data)
categories = set(df.column.unique())
tbl = pd.pivot_table(df[['row','column']],values='column',index='row',aggfunc=set)
missing = tbl.column.apply(categories.difference)
missing = filter(lambda x:x[1],missing.items())
d = collections.defaultdict(list)
#d = {'row':[],'column':[],'value':[]}for row,col in missing:
forcatin col:
d['row'].append(row)
d['column'].append(cat)
d['value'].append(0)
df2 = df.append(pd.DataFrame(d)).reset_index()
df2 = df.append(pd.DataFrame(d)).reset_index()
Of course all the new values will be at the end and it would need to be sorted if that is an issue.
Intermediate objects:
>>>tbl
column
row
21.08.2020 {A, B, C}
22.08.2020 {A, B}
>>>missing
row
21.08.2020 {}
22.08.2020 {C}
Name: column, dtype: object
>>>
Solution 4:
Here is an alternative.it sets the row
and column
columns as the new index, gets all possible combinations of values in the row
and column
columns, and joins(how='outer') an empty dataframe with the row
and column
combinations as the new index :
From itertools import product
new_index = product(set(df.row.array), set(df.column.array))
df = df.set_index(["row", "column"])
new_index = pd.DataFrame([], index=pd.Index(new_index, names=["row", "column"]))
df.join(new_index, how="outer").reset_index().astype({"value": "Int8"}) # if you are keen on nullable integers
row column value
0 21.08.2020 A 43
1 21.08.2020 A 36
2 21.08.2020 B 36
3 21.08.2020 C 28
4 22.08.2020 A 16
5 22.08.2020 B 40
6 22.08.2020 B 34
7 22.08.2020 C <NA>
Post a Comment for "Pandas: How To Include All Columns For All Rows Although Value Is Missing In A Dataframe With A Long Format?"