Collect Set By Group?
I am working with Hive data pulled into a Python Jupyter notebook using a Hive wrapper written in Python. I have terabytes of data like the following: Table 1: time=t1 uid colA 1
Solution 1:
Here is summarized solution for problem. Hope this will work for you using pyspark.
Global Imports:-
import pyspark.sql.functions as F
import pyspark.sql.types as T
Table 2 Creation Code:-
df1 = sc.parallelize([
[1,'A'], [1,'B'], [1,'C'], [2,'A'], [2,'B'], [3, 'C'], [3,'D']
]).toDF(['uid', 'colA']).groupBy("uid").agg(F.collect_set("colA").alias("colA"))
df1.show()
+---+---------+
|uid| colA|
+---+---------+
| 1|[A, B, C]|
| 2| [A, B]|
| 3| [C, D]|
+---+---------+
Table 3 Creation Code:-
df2 = sc.parallelize([[1, ['A', 'B']],[2, ['B']],[3, ['C', 'D', 'E']]]).toDF(['uid', 'colA'])
def diffUdfFunc(x,y):
return list(set(y).difference(set(x)))
diffUdf = F.udf(diffUdfFunc,T.ArrayType(T.StringType()))
finaldf = df1.withColumnRenamed("colA", "colA1").join(df2, "uid").withColumnRenamed("colA", "colA2").withColumn("diffCol", diffUdf(F.col("colA1"), F.col("colA2")))
finaldf.select("uid", F.col("diffCol").alias("colA")).where(F.size("colA") > 0).show()
+---+----+
|uid|colA|
+---+----+
| 3| [E]|
+---+----+
Post a Comment for "Collect Set By Group?"