Collect Set By Group?

April 25, 2023 Post a Comment

I am working with Hive data pulled into a Python Jupyter notebook using a Hive wrapper written in Python. I have terabytes of data like the following: Table 1: time=t1 uid colA 1

Solution 1:

Here is summarized solution for problem. Hope this will work for you using pyspark.

Global Imports:-

import pyspark.sql.functions as F
import pyspark.sql.types as T

Table 2 Creation Code:-

df1 = sc.parallelize([
        [1,'A'], [1,'B'], [1,'C'], [2,'A'], [2,'B'], [3, 'C'], [3,'D']
       ]).toDF(['uid', 'colA']).groupBy("uid").agg(F.collect_set("colA").alias("colA"))

df1.show()
+---+---------+
|uid|     colA|
+---+---------+
|  1|[A, B, C]|
|  2|   [A, B]|
|  3|   [C, D]|
+---+---------+

Table 3 Creation Code:-

df2 = sc.parallelize([[1, ['A', 'B']],[2, ['B']],[3, ['C', 'D', 'E']]]).toDF(['uid', 'colA'])
def diffUdfFunc(x,y):
    return list(set(y).difference(set(x)))

diffUdf = F.udf(diffUdfFunc,T.ArrayType(T.StringType()))
finaldf = df1.withColumnRenamed("colA", "colA1").join(df2, "uid").withColumnRenamed("colA", "colA2").withColumn("diffCol", diffUdf(F.col("colA1"), F.col("colA2")))
finaldf.select("uid", F.col("diffCol").alias("colA")).where(F.size("colA") > 0).show()
+---+----+
|uid|colA|
+---+----+
|  3| [E]|
+---+----+

Python Library

Collect Set By Group?

Solution 1:

Post a Comment for "Collect Set By Group?"