Skip to content Skip to sidebar Skip to footer

Ambiguous Behavior While Adding New Column To Structtype

I defined a function in PySpark which is- def add_ids(X): schema_new = X.schema.add('id_col', LongType(), False) _X = X.rdd.zipWithIndex().map(lambda l: list(l[0]) + [l[1]]

Solution 1:

The mistake is here:

schema_new = X.schema.add("id_col", LongType(), False)

If you check the source you'll see that the add method modifies data in place.

It is easier to see on a simplified example:

from pyspark.sql.types import *

schema = StructType()
schema.add(StructField("foo", IntegerType()))

schema
StructType(List(StructField(foo,IntegerType,true)))

As you see the schema object has been modified.

Instead of using add method you should rebuild the schema:

schema_new = StructType(schema.fields + [StructField("id_col", LongType(), False)])

Alternatively you can create a deep copy of the object:

importcopy

old_schema = StructType()
new_schehma = copy.deepcopy(old_schema).add(StructField("foo", IntegerType()))

old_schema
StructType(List())
new_schehma
StructType(List(StructField(foo,IntegerType,true)))

Post a Comment for "Ambiguous Behavior While Adding New Column To Structtype"