What Type Should The Dense Vector Be, When Using Udf Function In Pyspark?
I want to change List to Vector in pySpark, and then use this column to Machine Learning model for training. But my spark version is 1.6.0, which does not have VectorUDT(). So what
Solution 1:
You can use vectors and VectorUDT with UDF,
from pyspark.ml.linalgimportVectors, VectorUDT
from pyspark.sqlimport functions as F
ud_f = F.udf(lambda r : Vectors.dense(r),VectorUDT())
df = df.withColumn('b',ud_f('a'))
df.show()
+-------------------------+---------------------+
|a |b |
+-------------------------+---------------------+
|[0.1, 0.2, 0.3, 0.4, 0.5]|[0.1,0.2,0.3,0.4,0.5]|
+-------------------------+---------------------+
df.printSchema()
root
|-- a: array (nullable = true)
| |-- element: double (containsNull = true)
|-- b: vector (nullable = true)
About VectorUDT, http://spark.apache.org/docs/2.2.0/api/python/_modules/pyspark/ml/linalg.html
Post a Comment for "What Type Should The Dense Vector Be, When Using Udf Function In Pyspark?"