Pyspark: Remove Utf Null Character From Pyspark Dataframe
Solution 1:
Ah wait - I think I have it. If I do something like this, it seems to work:
null = u'\u0000'
new_df = df.withColumn('e', regexp_replace(df['e'], null, ''))
And then mapping to all string columns:
string_columns = ['d','e']
new_df = df.select(
*(regexp_replace(col(c), null, '').alias(c) if c in string_columns else c for
c in df.columns)
)
Solution 2:
You can use DataFrame.fillna()
to replace null values.
Replace null values, alias for na.fill(). DataFrame.fillna() and DataFrameNaFunctions.fill() are aliases of each other.
Parameters:
value – int, long, float, string, or dict. Value to replace null values with. If the value is a dict, then subset is ignored and value must be a mapping from column name (string) to replacement value. The replacement value must be an int, long, float, or string.
subset – optional list of column names to consider. Columns specified in subset that do not have matching data type are ignored. For example, if value is a string, and subset contains a non-string column, then the non-string column is simply ignored.
Post a Comment for "Pyspark: Remove Utf Null Character From Pyspark Dataframe"