Skip to content Skip to sidebar Skip to footer

Pandas: How To Include All Columns For All Rows Although Value Is Missing In A Dataframe With A Long Format?

This may sound like a strange question at first, but I found it hard to find 'standard' terms when talking about elements of data of a long format. So I thought I'd just as well us

Solution 1:

This is different from previous one since we have multiple value for same row

df['key']=df.groupby(['row','column']).cumcount()

df1 = pd.pivot_table(df,index='row',columns=['key','column'],values='value')

df1 = df1.stack(level=[0,1],dropna=False).to_frame('value').reset_index()

df1 = df1[df1.key.eq(0) | df1['value'].notna()]
df1
Out[97]: 
           row  key column  value
021.08.20200      A   43.0121.08.20200      B   36.0221.08.20200      C   28.0321.08.20201      A   36.0622.08.20200      A   16.0722.08.20200      B   40.0822.08.20200      C    NaN
1022.08.20201      B   34.0

Solution 2:

I found an approach with pd.pivot() in combination with unstack():

import pandas as pd
df=pd.DataFrame({'row': {0: '21.08.2020',
  1: '21.08.2020',
  2: '21.08.2020',
  3: '21.08.2020',
  4: '22.08.2020',
  5: '22.08.2020',
  6: '22.08.2020'},
 'column': {0: 'A', 1: 'A', 2: 'B', 3: 'C', 4: 'A', 5: 'B', 6: 'B'},
 'value': {0: 43, 1: 36, 2: 36, 3: 28, 4: 16, 5: 40, 6: 34}})

df1 = pd.pivot_table(df,index='row',columns='column',values='value').unstack().reset_index() 
print(df1)

Output

    column  row         00A21.08.202039.51A22.08.202016.02B21.08.202036.03B22.08.202037.04   C       21.08.202028.05   C       22.08.2020  NaN

The order of the dataframe columns are arguably messed up though...

Solution 3:

Here is a naive approach - uses a for loop.

data = {'row': {0: '21.08.2020', 1: '21.08.2020', 2: '21.08.2020',
                3: '21.08.2020', 4: '22.08.2020', 5: '22.08.2020',
                6: '22.08.2020'},
        'column': {0: 'A', 1: 'A', 2: 'B', 3: 'C', 4: 'A', 5: 'B', 6: 'B'},
        'value': {0: 43, 1: 36, 2: 36, 3: 28, 4: 16, 5: 40, 6: 34}}

df = pd.DataFrame(data)

categories = set(df.column.unique())
tbl = pd.pivot_table(df[['row','column']],values='column',index='row',aggfunc=set)

missing = tbl.column.apply(categories.difference)
missing = filter(lambda x:x[1],missing.items())

d = collections.defaultdict(list)
#d = {'row':[],'column':[],'value':[]}for row,col in missing:
    forcatin col:
        d['row'].append(row)
        d['column'].append(cat)
        d['value'].append(0)

df2 = df.append(pd.DataFrame(d)).reset_index()

df2 = df.append(pd.DataFrame(d)).reset_index()

Of course all the new values will be at the end and it would need to be sorted if that is an issue.


Intermediate objects:

>>>tbl
               column
row                  
21.08.2020  {A, B, C}
22.08.2020     {A, B}
>>>missing
row
21.08.2020     {}
22.08.2020    {C}
Name: column, dtype: object
>>>

Solution 4:

Here is an alternative.it sets the row and column columns as the new index, gets all possible combinations of values in the row and column columns, and joins(how='outer') an empty dataframe with the row and column combinations as the new index :

 From itertools import product
new_index = product(set(df.row.array), set(df.column.array))
df = df.set_index(["row", "column"])
new_index = pd.DataFrame([], index=pd.Index(new_index, names=["row", "column"]))
df.join(new_index, how="outer").reset_index().astype({"value": "Int8"}) # if you are keen on nullable integers

    row      column value
0   21.08.2020  A   43
1   21.08.2020  A   36
2   21.08.2020  B   36
3   21.08.2020  C   28
4   22.08.2020  A   16
5   22.08.2020  B   40
6   22.08.2020  B   34
7   22.08.2020  C   <NA>

Post a Comment for "Pandas: How To Include All Columns For All Rows Although Value Is Missing In A Dataframe With A Long Format?"