Ambiguous Behavior While Adding New Column To Structtype
I defined a function in PySpark which is- def add_ids(X): schema_new = X.schema.add('id_col', LongType(), False) _X = X.rdd.zipWithIndex().map(lambda l: list(l[0]) + [l[1]]
Solution 1:
The mistake is here:
schema_new = X.schema.add("id_col", LongType(), False)
If you check the source you'll see that the add
method modifies data in place.
It is easier to see on a simplified example:
from pyspark.sql.types import *
schema = StructType()
schema.add(StructField("foo", IntegerType()))
schema
StructType(List(StructField(foo,IntegerType,true)))
As you see the schema
object has been modified.
Instead of using add
method you should rebuild the schema:
schema_new = StructType(schema.fields + [StructField("id_col", LongType(), False)])
Alternatively you can create a deep copy of the object:
importcopy
old_schema = StructType()
new_schehma = copy.deepcopy(old_schema).add(StructField("foo", IntegerType()))
old_schema
StructType(List())
new_schehma
StructType(List(StructField(foo,IntegerType,true)))
Post a Comment for "Ambiguous Behavior While Adding New Column To Structtype"