Skip to content Skip to sidebar Skip to footer

How To Compare Strings Without Case Sensitive In Spark RDD?

I have following Dataset drug_name,num_prescriber,total_cost AMBIEN,2,300 BENZTROPINE MESYLATE,1,1500 CHLORPROMAZINE,2,3000 Wanted to find out number of A's and B's from above Dat

Solution 1:

To convert to lower case, you should use the lower() function (see here) from pyspark.sql.functions.So you could try:

import pyspark.sql.functions as F

logData = spark.createDataFrame(
    [
     (0,'aB'),
     (1,'AaA'),
     (2,'bA'),
     (3,'bB')
    ],
    ('id', "value")
)
numAs = logData.filter(F.lower((logData.value)).contains('a')).count()

You mention 'I am using the following code to find out num of A's and number of B's.' Note that if you want to count the actual occurrences of a character instead of the amount of rows that contain the character, you could do something like:

def count_char_in_col(col: str, char: str):
    return F.length(F.regexp_replace(F.lower(F.col(col)), "[^" + char + "]", ""))

logData.select(count_char_in_col('value','a')).groupBy().sum().collect()[0][0]

which in the above example will return 5.

Hope this helps!


Post a Comment for "How To Compare Strings Without Case Sensitive In Spark RDD?"