Showing posts with label Azure Data Lake Gen2. Show all posts
Showing posts with label Azure Data Lake Gen2. Show all posts

Wednesday, February 7, 2024

Working with com.crealytics.spark.excel package for excel files in Azure Synapse

This is a post to help atleast some of you who is trying to get the com.crealytics.spark.excel package up and running in your synapse workspace and on your spark pool. I will try to explain it in the most simplest of steps. 

Step 1 - Go to MVN repository and download the latest jar file for the crealytics excel spark package.



Step 2 - Once the file is downloaded go to your Synapse workspace and to the Manage tab, then to the Workspace packages tab


Step 3 - Upload jar file to workspace packages and it should up  on the list with provisioning status as succeded, see below.



Step 4 - Once the package is uploaded, go to Manage > Sparkpool > Packages and select the spark-excel_2.12-3.5.0_0.20.3.jar from the list. Important that session level packages are allowed and the spark pool is restarted after this step. See screenshots below,


That's it on the configuration side, now on your notebook, you could have a code snippet like below to read from an excel file.

df = spark.read.format("com.crealytics.spark.excel").option("header", "true")
/ .option("inferSchema", "true").load(ReadPath)

where ReadPath contains the path to the excel in your datalake. You can play around with more options on this piece of code. Hope this helps, please let us know in comments.


Note:- If you have higher environments, make sure you repeat the steps there. 

Wednesday, November 23, 2022

Synapse Analytics - Create Delta Lake Database or Schema or Table fails with java.lang.IllegalArgumentException: null

Issue - When trying to create a delta lake database or a schema or a delta lake table fails with an error java.lang.IllegalArgumentException: null path . 

Error sample screenshot


For a quick background, I am using a small sized cluster with 3-8 worker nodes and using a synapse notebook to execute the syntax. 

Error: 

The error log goes as below,

Error: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.IllegalArgumentException: null path
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:111)
org.apache.spark.sql.hive.HiveExternalCatalog.createDatabase(HiveExternalCatalog.scala:193)
org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:153)
org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:140)
org.apache.spark.sql.hive.HiveSessionStateBuilder.externalCatalog(HiveSessionStateBuilder.scala:45)
org.apache.spark.sql.hive.HiveSessionStateBuilder.$anonfun$catalog$1(HiveSessionStateBuilder.scala:60)
org.apache.spark.sql.catalyst.catalog.SessionCatalog.externalCatalog$lzycompute(SessionCatalog.scala:133)

It hints at a path being null. 

Resolution:

The very first thing to check is whether the path you are trying to pass on while creating the table is correct or not, for example see the syntax below,

create TABLE lakedatabasename.tablename
USING DELTA
LOCATION 'abfss://container@datalake.dfs.core.windows.net/Path'

Check if the path exists by using the below syntax in a spark notebook, 

mssparkutils.fs.ls('abfss://container@datalake.dfs.core.windows.net/Path')

and this should list down all the files you have under the path,



If files get listed, it's not an issue with your path and you can rule that out.

The next thing to check is if the user you are using to create delta lake database/table has the permission 'BlobStorageDataContributor', this would be needed. 

For me both the above debug steps were okay, later I read that in synapse, delta lake creates metadata for lake database in the default container, and i had actually given the default container same name as my storage account while creation, so I had deleted the default container. 

This is where the template i extracted during resource creation came in handy for me, I checked the template and found that I had the same name for the 'defaultDataLakeStorageFilesystemName' parameter as well, see screenshot below, 



The fix was to create the default container name back in, so synapse can create the deltalake database metadata inside the same and re-run the same commands again.




The issue was raised by me and documented here in Microsoft Learn community here

Monday, November 21, 2022

Partitioned column data not displaying in External Table over Delta Lake in Serverless SQL Pool

Issue: You are trying to create an external table over a delta lake folder path in your Azure Data Lake. The said delta lake table is paritioned over a few columns. For example, see below where the delta lake table is paritioned over fields, 'TheYear' and 'TheMonth'. 

Delta Lake

The path was then created as an external table in Azure Synapse, and queried upon, where you can see all columns from the delta lake except for the paritioned columns, please see screenshot below,



Cause: This seems to be because this is not a feature, Serverless SQL pool has started to support(reference).

Solution: However there is an alternate way of achieveing the same using a SQL view with OPENROWSET functionality. Then use this view to query the table further.

sample syntax as below, 

    CREATE VIEW vName AS 
    SELECT TOP 10 *
    FROM OPENROWSET(
            BULK 'Gold/Database/Table1/',
            DATA_SOURCE = 'DeltaLakeStorage',
            FORMAT = 'delta'
        )
        WITH (  TheYear VARCHAR(10),
        TheMonth VARCHAR(10),
        type VARCHAR(100),
        count BIGINT
            ) 



Monday, August 16, 2021

Not authorized to view container in Azure Blob under Storage Account

 Error - Below is the error description from inside newly created container under storage account we created

  • This request is not authorized to perform this operation. RequestId:342dca41-601e-004f-0dc9-9291aa000000 Time:2021-08-16T18:10:08.0322108Z 
  • This storage account's 'Firewalls and virtual networks' settings may be blocking access to storage services. Try adding your client IP address ('xx.xx.xx.xxx') to the firewall exceptions, or by allowing access from 'all networks' instead of 'selected networks'




Cause: The client address of your machine is not added as an exception to your firewall settings under the parent storage account


Solution: 

1. The solution is to add your ip address as an exception under 'Networking' > 'Firewall and Virtual Networks' 



2. Click on 'Add your clinet IP Address ( xx.xx.xxx.xxx)




3.  Now click on 'Save' in the top left corner under 'Firewall and Virtual Networks'


4. Now navigate back to the container and you should be able to access the container.