Monday, December 12, 2022

Pyspark - Hash() function creates duplicates - solved

 Aim - To create a unique integer identifier from a column in your pyspark dataframe. 

Issue:- When using the hash() function in pyspark it is very prone to hash collisions as you might see the same hash value for two different source column values, see example below,





Solution - Use the xxhash64() pyspark function to reduce such hash collisions. See the field now, which has been created with hash 64 algorithm, returning unique values in Product_Id_2



Note:- The new column being created using hash 64 would be a big integer and not an integer.