Learning Tabulify - Step 10 - Data Diff, a step by step
About
Data Diff is the cornerstone of every development because it validates the processing of data.
Tabulify ships with a Data diff operation that allows you to compare:
- the data content
- and the data structure
The diff operation (comparison) is implemented with the data diff command.
Data to compare
For the sake of simplicity, we will compare two csv data resources but you may compare any content data resources such as a SQL table or SQL query
- The source data resource (also known as the original, from or left resource)
last_name,first_name,birth_date,wikipedia_page
Schwarzenegger,Arnold,1947-07-30,https://en.wikipedia.org/wiki/Arnold_Schwarzenegger
Norman,Don,1935-12-25,https://en.wikipedia.org/wiki/Don_Norman
Harbison Carnagey,Dale,1888-11-24,https://en.wikipedia.org/wiki/Dale_Carnegie
Kahneman,Daniel,1934-03-05,https://en.wikipedia.org/wiki/Daniel_Kahneman
Pittman,Bob,1953-12-28,https://en.wikipedia.org/wiki/Robert_Pittman_(media_executive)
- The target data resource (also known as the new, to or right resource)
Last Name,First Name,Birth Date,Wikipedia Page
Schwarzenegger,Arnold,1947-07-30,https://en.wikipedia.org/wiki/Arnold_Schwarzenegger
Norman,Don,1935-12-23,https://en.wikipedia.org/wiki/Don_Norman
Kahneman,Daniel,1934-03-05,https://en.wikipedia.org/wiki/Daniel_Kahneman
The fact that a record is missing can be easily spotted but there is a more subtle differences.
Failure Behavior
By default, a diff that detects differences will fail (ie return a bad exit code).
- The --no-fail option changes this behavior.
- You don't need to set it normally but as we monitor our documentation for failure, we don't want to fail
The Summary report
All diff can be executed with the data diff command by giving as argument the source and target data resource.
tabul data diff \
--no-fail \
diff/characters_original.csv@howto diff/characters_new.csv@howto
Diff Summary Report
The data resources are not equals.
from to equals record_count change_count
---------------------------------- ----------------------------- ------ ------------ ------------
diff/characters_original.csv@howto diff/characters_new.csv@howto 5 6
The default report is to show a high level report that lists the source and target resources and the result of the comparison:
- equals: Does the resources are equals
- record_count: How many records were read from the from resource.
- change_count: How many records changes were detected
There is a total of:
- 5 records compared
- 6 records changes were detected.
It doesn't seem quite right, let's take a look at it by asking a unified report to see the difference in details.
The Unified report
To get a diff at the record level, you set the report-type option to the unified value.
tabul data diff \
--no-fail \
--report-type unified \
diff/characters_original.csv@howto diff/characters_new.csv@howto
Unified Diff Report between the resources (diff/characters_original.csv@howto) and (diff/characters_new.csv@howto).
- �[31mred�[0m for deleted
- �[32mgreen�[0m for added
- �[34mblue�[0m for type change (example: number as varchar)
The data resources are not equals.
diff_status last_name first_name birth_date wikipedia_page
----------- ----------------- ---------- ---------- --------------------------------------------------------------
= Schwarzenegger Arnold 1947-07-30 https://en.wikipedia.org/wiki/Arnold_Schwarzenegger
�[32m+ �[0m�[32mNorman �[0m�[32mDon �[0m�[32m1935-12-23 �[0m�[32mhttps://en.wikipedia.org/wiki/Don_Norman�[0m
�[32m+ �[0m�[32mKahneman �[0m�[32mDaniel �[0m�[32m1934-03-05 �[0m�[32mhttps://en.wikipedia.org/wiki/Daniel_Kahneman�[0m
�[31m- �[0m�[31mNorman �[0m�[31mDon �[0m�[31m1935-12-25 �[0m�[31mhttps://en.wikipedia.org/wiki/Don_Norman�[0m
�[31m- �[0m�[31mHarbison Carnagey �[0m�[31mDale �[0m�[31m1888-11-24 �[0m�[31mhttps://en.wikipedia.org/wiki/Dale_Carnegie�[0m
�[31m- �[0m�[31mKahneman �[0m�[31mDaniel �[0m�[31m1934-03-05 �[0m�[31mhttps://en.wikipedia.org/wiki/Daniel_Kahneman�[0m
�[31m- �[0m�[31mPittman �[0m�[31mBob �[0m�[31m1953-12-28 �[0m�[31mhttps://en.wikipedia.org/wiki/Robert_Pittman_(media_executive)�[0m
This report will shows you how to get from:
- the original, from, left, source resource
- to the new, to, right, target resource
It tells us that we need to:
- add 2 rows
- and delete 4 rows
from the original resource to obtain the new resource
It's a total of 6 changes as seen previously in the summary report.
But why does the diff did not compare the Norman record to each other ?
This is because their is no known identifier for the record, therefore the diff was executed by record hash and not by last name.
The next step will show you how to define the unique columns to drive a diff comparison.
The Unified Report with Driver Columns Definition
To improve our comparison, we will define the unique column of our data resource with the --driver-columns option.
tabul data diff \
--no-fail \
--report-type unified \
--driver-columns "Last Name" \
diff/characters_original.csv@howto diff/characters_new.csv@howto
Unified Diff Report between the resources (diff/characters_original.csv@howto) and (diff/characters_new.csv@howto).
- �[31mred�[0m for deleted
- �[32mgreen�[0m for added
- �[34mblue�[0m for type change (example: number as varchar)
The data resources are not equals.
diff_status last_name first_name birth_date wikipedia_page
----------- ----------------- ---------- ---------- --------------------------------------------------------------
= Schwarzenegger Arnold 1947-07-30 https://en.wikipedia.org/wiki/Arnold_Schwarzenegger
�[32m+1 �[0mNorman Don �[32m1935-12-23 �[0mhttps://en.wikipedia.org/wiki/Don_Norman
�[31m-1 �[0mNorman Don �[31m1935-12-25 �[0mhttps://en.wikipedia.org/wiki/Don_Norman
�[31m- �[0m�[31mHarbison Carnagey �[0m�[31mDale �[0m�[31m1888-11-24 �[0m�[31mhttps://en.wikipedia.org/wiki/Dale_Carnegie�[0m
= Kahneman Daniel 1934-03-05 https://en.wikipedia.org/wiki/Daniel_Kahneman
�[31m- �[0m�[31mPittman �[0m�[31mBob �[0m�[31m1953-12-28 �[0m�[31mhttps://en.wikipedia.org/wiki/Robert_Pittman_(media_executive)�[0m
This time, the unified report tell us that we need to:
- update the birth date of the Norman record (+1,-1, in the 1 change)
- delete the records
- Harbison Carnagey
- and Pittman
from the original resource to obtain the new resource
Conclusion
This page ends the learning guide of Tabulify at the command line with Tabul where we have learned to perform data operation one command at a time.