What Type Should The Dense Vector Be, When Using Udf Function In Pyspark?

February 28, 2024 Post a Comment

I want to change List to Vector in pySpark, and then use this column to Machine Learning model for training. But my spark version is 1.6.0, which does not have VectorUDT(). So what

Solution 1:

You can use vectors and VectorUDT with UDF,

from pyspark.ml.linalgimportVectors, VectorUDT
from pyspark.sqlimport functions as F

ud_f = F.udf(lambda r : Vectors.dense(r),VectorUDT())
df = df.withColumn('b',ud_f('a'))
df.show()
+-------------------------+---------------------+
|a                        |b                    |
+-------------------------+---------------------+
|[0.1, 0.2, 0.3, 0.4, 0.5]|[0.1,0.2,0.3,0.4,0.5]|
+-------------------------+---------------------+

df.printSchema()
root
  |-- a: array (nullable = true)
  |    |-- element: double (containsNull = true)
  |-- b: vector (nullable = true)

About VectorUDT, http://spark.apache.org/docs/2.2.0/api/python/_modules/pyspark/ml/linalg.html

Python Library

What Type Should The Dense Vector Be, When Using Udf Function In Pyspark?

Solution 1:

Post a Comment for "What Type Should The Dense Vector Be, When Using Udf Function In Pyspark?"