About
This howto show you how to perform diff operation between 2 SQL Tables with the data diff command.
Steps
Load Data
In this step, we load the csv resources that we want to compare into sqlite
- Load the original data resource (also known as the source, old, from or left resource)
tabul data replace diff/characters_original.csv@howto characters_original@sqlite
Transfer results
input target latency record_count error_code error_message
---------------------------------- -------------------------- ------- ------------ ---------- -------------
diff/characters_original.csv@howto characters_original@sqlite 0.50s 5
- The new data resource (also known as the target, right, to resource)
tabul data replace diff/characters_new.csv@howto characters_new@sqlite
Transfer results
input target latency record_count error_code error_message
----------------------------- --------------------- ------- ------------ ---------- -------------
diff/characters_new.csv@howto characters_new@sqlite 0.53s 3
The data
The fact that records are missing can be easily spotted but there are more subtle differences.
tabul data print \
--strict-selection \
characters_new@sqlite \
characters_original@sqlite
characters_new@sqlite
Last Name First Name Birth Date Wikipedia Page
-------------- ---------- ---------- ---------------------------------------------------
Schwarzenegger Arnold 1947-07-30 https://en.wikipedia.org/wiki/Arnold_Schwarzenegger
Norman Don 1935-12-23 https://en.wikipedia.org/wiki/Don_Norman
Kahneman Daniel 1934-03-05 https://en.wikipedia.org/wiki/Daniel_Kahneman
characters_original@sqlite
last_name first_name birth_date wikipedia_page
----------------- ---------- ---------- --------------------------------------------------------------
Schwarzenegger Arnold 1947-07-30 https://en.wikipedia.org/wiki/Arnold_Schwarzenegger
Norman Don 1935-12-25 https://en.wikipedia.org/wiki/Don_Norman
Harbison Carnagey Dale 1888-11-24 https://en.wikipedia.org/wiki/Dale_Carnegie
Kahneman Daniel 1934-03-05 https://en.wikipedia.org/wiki/Daniel_Kahneman
Pittman Bob 1953-12-28 https://en.wikipedia.org/wiki/Robert_Pittman_(media_executive)
Failure Behavior
By default, a diff that detects differences will fail (ie return a bad exit code).
- The --no-fail option changes this behavior.
- You don't need to set it normally but as we monitor our documentation for failure, we don't want to fail
Perform the diff
All diff can be executed with the data diff command by giving as argument the source and target data resource.
tabul data diff \
--no-fail \
characters_original@sqlite characters_new@sqlite
Diff Summary Report
The data resources are not equals.
from to equals record_count change_count
-------------------------- --------------------- ------ ------------ ------------
characters_original@sqlite characters_new@sqlite 5 6
Perform a diff and generate a unified report
In this step, we will asked for more details by asking for a unified report type
tabul data diff \
--no-fail \
--report-type unified \
characters_original@sqlite characters_new@sqlite
Unified Diff Report between the resources (characters_original@sqlite) and (characters_new@sqlite).
- �[31mred�[0m for deleted
- �[32mgreen�[0m for added
- �[34mblue�[0m for type change (example: number as varchar)
The data resources are not equals.
diff_status last_name first_name birth_date wikipedia_page
----------- ----------------- ---------- ---------- --------------------------------------------------------------
= Schwarzenegger Arnold 1947-07-30 https://en.wikipedia.org/wiki/Arnold_Schwarzenegger
�[32m+ �[0m�[32mNorman �[0m�[32mDon �[0m�[32m1935-12-23 �[0m�[32mhttps://en.wikipedia.org/wiki/Don_Norman�[0m
�[32m+ �[0m�[32mKahneman �[0m�[32mDaniel �[0m�[32m1934-03-05 �[0m�[32mhttps://en.wikipedia.org/wiki/Daniel_Kahneman�[0m
�[31m- �[0m�[31mNorman �[0m�[31mDon �[0m�[31m1935-12-25 �[0m�[31mhttps://en.wikipedia.org/wiki/Don_Norman�[0m
�[31m- �[0m�[31mHarbison Carnagey �[0m�[31mDale �[0m�[31m1888-11-24 �[0m�[31mhttps://en.wikipedia.org/wiki/Dale_Carnegie�[0m
�[31m- �[0m�[31mKahneman �[0m�[31mDaniel �[0m�[31m1934-03-05 �[0m�[31mhttps://en.wikipedia.org/wiki/Daniel_Kahneman�[0m
�[31m- �[0m�[31mPittman �[0m�[31mBob �[0m�[31m1953-12-28 �[0m�[31mhttps://en.wikipedia.org/wiki/Robert_Pittman_(media_executive)�[0m
We can see that the diff is not line by line as if it was a text file. In the next step, we will tell the diff that we want a diff on the last_name.
Perform a diff with a column driver
To improve our comparison, we will define the unique column of our data resource with the --driver-columns option.
tabul data diff \
--no-fail \
--report-type unified \
--driver-columns "Last Name" \
characters_original@sqlite characters_new@sqlite
Unified Diff Report between the resources (characters_original@sqlite) and (characters_new@sqlite).
- �[31mred�[0m for deleted
- �[32mgreen�[0m for added
- �[34mblue�[0m for type change (example: number as varchar)
The data resources are not equals.
diff_status last_name first_name birth_date wikipedia_page
----------- ----------------- ---------- ---------- --------------------------------------------------------------
= Schwarzenegger Arnold 1947-07-30 https://en.wikipedia.org/wiki/Arnold_Schwarzenegger
�[32m+1 �[0mNorman Don �[32m1935-12-23 �[0mhttps://en.wikipedia.org/wiki/Don_Norman
�[31m-1 �[0mNorman Don �[31m1935-12-25 �[0mhttps://en.wikipedia.org/wiki/Don_Norman
�[31m- �[0m�[31mHarbison Carnagey �[0m�[31mDale �[0m�[31m1888-11-24 �[0m�[31mhttps://en.wikipedia.org/wiki/Dale_Carnegie�[0m
= Kahneman Daniel 1934-03-05 https://en.wikipedia.org/wiki/Daniel_Kahneman
�[31m- �[0m�[31mPittman �[0m�[31mBob �[0m�[31m1953-12-28 �[0m�[31mhttps://en.wikipedia.org/wiki/Robert_Pittman_(media_executive)�[0m
This time, the unified report tell us that we need to:
- update the birth date of the Norman record (+1,-1, in the 1 change)
- delete the records
- Harbison Carnagey
- and Pittman
from the original resource to obtain the new resource
Advanced Diff with a pipeline
The diff operation is an intermediate pipeline operation that has a lot of arguments that are not available with the diff command line.
If you want to perform an advanced diff, create a pipeline.