CRUD operation on Parquet files with Azure Data Factory
If you ever need to do a CRUD operation on Parquet file in ADF then you can review this article for few hints
Have you every wonder how to mange CRUD (Create, Read, Update, Delete) operation on Parquet files in ADF (Azure Data Factory) environment? Recently I had to work on a project that required us to mange CRUD on Parquet files and I want to share my experience with you all.
Assumptions: Upstream process can export the Full Data set, Delta Data Set (changes of data set -Inserted and Updated) and Primary Key Data set for the table. Our Up stream Database is a Azure SQL DB that each table has a PK and also a timestamp filed like CreateDate (Datetime data type)
Full Export from Azure SQL to ADLS Parquet
The initial Full export is built by ADF pipelines and I used met data driven approach to loop through each table for export.
The source in Copy Data is Azure SQL and Sink in the ADLS.
The Parameters for tables are stored in a separate table with the watermarking option to capture the last export. The Changes on tables are captured and export by second pipeline process where first we lookup for watermark values on each table and then load the records with the datetime after the last update (this is watermarking process) and of course update the parameter tables with the recent pipeline execution timestamp.
We also create a Primary Key data set. This data set will be used later to compare and identity deleted records since last process.
Now that we have all data sets I will start building a Data flow for CRUD operation. Data flow process has the following tasks.
- Build a Deleted date
- Identify Not Changed data
- Identify Updated data
- Identify Inserted data
- Union Not Changed with Updated and Inserted
- Remove Delete data from the Union Set
- Create or replace the final data set
Summary: This process allows us to run CRUD operation on Parquet data sources. I’m fully aware that there are other technologies out there that can accomplish similar tasks but our goal here was to only utilize ADF service.