Splitting A Dataframe Based On Condition
Solution 1:
Use ==
, not is
, to test equality
Likewise, use !=
instead of is not
for inequality.
is
has a special meaning in Python. It returns True
if two variables point to the same object, while ==
checks if the objects referred to by the variables are equal. See also Is there a difference between ==
and is
in Python?.
Don't repeat mask calculations
The Boolean masks you are creating are the most expensive part of your logic. It's also logic you want to avoid repeating manually as your first and second masks are inverses of each other. You can therefore use the bitwise inverse~
("tilde"), also accessible via operator.invert
, to negate an existing mask.
Empty strings are different to null values
Equality versus empty strings can be tested via == ''
, but equality versus null values requires a specialized method: pd.Series.isnull
. This is because null values are represented in NumPy arrays, which are used by Pandas, by np.nan
, and np.nan != np.nan
by design.
If you want to replace empty strings with null values, you can do so:
df['medical_plan_id'] = df['medical_plan_id'].replace('', np.nan)
Conceptually, it makes sense for missing values to be null (np.nan
) rather than empty strings. But the opposite of the above process, i.e. converting null values to empty strings, is also possible:
df['medical_plan_id'] = df['medical_plan_id'].fillna('')
If the difference matters, you need to know your data and apply the appropriate logic.
Semi-final solution
Assuming you do indeed have null values, calculate a single Boolean mask and its inverse:
mask = df['medical_plan_id'].isnull()
df1 = df[mask]
df2 = df[~mask]
Final solution: avoid extra variables
Creating additional variables is something, as a programmer, you should look to avoid. In this case, there's no need to create two new variables, you can use GroupBy
with dict
to give a dictionary of dataframes with False
(== 0
) and True
(== 1
) keys corresponding to your masks:
dfs = dict(tuple(df.groupby(df['medical_plan_id'].isnull())))
Then dfs[0]
represents df2
and dfs[1]
represents df1
(see also this related answer). A variant of the above, you can forego dictionary construction and use Pandas GroupBy
methods:
dfs = df.groupby(df['medical_plan_id'].isnull())
dfs.get_group(0) # equivalent to dfs[0] from dict solution
dfs.get_group(1) # equivalent to dfs[1] from dict solution
Example
Putting all the above in action:
df = pd.DataFrame({'medical_plan_id': [np.nan, '', 2134, 4325, 6543, '', np.nan],
'values': [1, 2, 3, 4, 5, 6, 7]})
df['medical_plan_id'] = df['medical_plan_id'].replace('', np.nan)
dfs = dict(tuple(df.groupby(df['medical_plan_id'].isnull())))
print(dfs[0], dfs[1], sep='\n'*2)
medical_plan_id values
2 2134.0 3
3 4325.0 4
4 6543.0 5
medical_plan_id values
0 NaN 1
1 NaN 2
5 NaN 6
6 NaN 7
Solution 2:
Another variant is to unpack df.groupby
, which returns an iterator with tuples (first item being the element of groupby and the second being the dataframe).
Like this for instance:
cond = df_with_medicalplanid['medical_plan_id'] == ''
(_, df1) , (_, df2) = df_with_medicalplanid.groupby(cond)
_
is in Python used to mark variables that are not interested to keep. I have separated the code to two lines for readability.
Full example
import pandas as pd
df_with_medicalplanid = pd.DataFrame({
'medical_plan_id': ['214212','','12251','12421',''],
'value': 1
})
cond = df_with_medicalplanid['medical_plan_id'] == ''
(_, df1) , (_, df2) = df_with_medicalplanid.groupby()
print(df1)
Returns:
medical_plan_id value
0 214212 1
2 12251 1
3 12421 1
Post a Comment for "Splitting A Dataframe Based On Condition"