Learning Tabulify - Step 10 - Data Diff, a step by step

Learning Tabulify - Step 10 - Data Diff, a step by step

About

Data Diff is the cornerstone of every development because it validates the processing of data.

Tabulify ships with a Data diff operation that allows you to compare:

The diff operation (comparison) is implemented with the data diff command.

Data to compare

For the sake of simplicity, we will compare two csv data resources but you may compare any content data resources such as a SQL table or SQL query

  • The source data resource (also known as the original, from or left resource)
last_name,first_name,birth_date,wikipedia_page
Schwarzenegger,Arnold,1947-07-30,https://en.wikipedia.org/wiki/Arnold_Schwarzenegger
Norman,Don,1935-12-25,https://en.wikipedia.org/wiki/Don_Norman
Harbison Carnagey,Dale,1888-11-24,https://en.wikipedia.org/wiki/Dale_Carnegie
Kahneman,Daniel,1934-03-05,https://en.wikipedia.org/wiki/Daniel_Kahneman
Pittman,Bob,1953-12-28,https://en.wikipedia.org/wiki/Robert_Pittman_(media_executive)


  • The target data resource (also known as the new, to or right resource)
Last Name,First Name,Birth Date,Wikipedia Page
Schwarzenegger,Arnold,1947-07-30,https://en.wikipedia.org/wiki/Arnold_Schwarzenegger
Norman,Don,1935-12-23,https://en.wikipedia.org/wiki/Don_Norman
Kahneman,Daniel,1934-03-05,https://en.wikipedia.org/wiki/Daniel_Kahneman


The fact that a record is missing can be easily spotted but there is a more subtle differences.

Failure Behavior

By default, a diff that detects differences will fail (ie return a bad exit code).

  • The --no-fail option changes this behavior.
  • You don't need to set it normally but as we monitor our documentation for failure, we don't want to fail

The Summary report

All diff can be executed with the data diff command by giving as argument the source and target data resource.

tabul data diff \
  --no-fail \
  diff/characters_original.csv@howto diff/characters_new.csv@howto
Diff Summary Report
The data resources are not equals.
from                                 to                              equals   record_count   change_count
----------------------------------   -----------------------------   ------   ------------   ------------
diff/characters_original.csv@howto   diff/characters_new.csv@howto                       5              6

The default report is to show a high level report that lists the source and target resources and the result of the comparison:

  • equals: Does the resources are equals
  • record_count: How many records were read from the from resource.
  • change_count: How many records changes were detected

There is a total of:

  • 5 records compared
  • 6 records changes were detected.

It doesn't seem quite right, let's take a look at it by asking a unified report to see the difference in details.

The Unified report

To get a diff at the record level, you set the report-type option to the unified value.

tabul data diff \
    --no-fail \
    --report-type unified \
    diff/characters_original.csv@howto diff/characters_new.csv@howto
Unified Diff Report between the resources (diff/characters_original.csv@howto) and (diff/characters_new.csv@howto).
 - �[31mred�[0m   for deleted
 - �[32mgreen�[0m for added
 - �[34mblue�[0m  for type change (example: number as varchar)
The data resources are not equals.
diff_status   last_name           first_name   birth_date   wikipedia_page
-----------   -----------------   ----------   ----------   --------------------------------------------------------------
=             Schwarzenegger      Arnold       1947-07-30   https://en.wikipedia.org/wiki/Arnold_Schwarzenegger
�[32m+             �[0m�[32mNorman              �[0m�[32mDon          �[0m�[32m1935-12-23   �[0m�[32mhttps://en.wikipedia.org/wiki/Don_Norman�[0m
�[32m+             �[0m�[32mKahneman            �[0m�[32mDaniel       �[0m�[32m1934-03-05   �[0m�[32mhttps://en.wikipedia.org/wiki/Daniel_Kahneman�[0m
�[31m-             �[0m�[31mNorman              �[0m�[31mDon          �[0m�[31m1935-12-25   �[0m�[31mhttps://en.wikipedia.org/wiki/Don_Norman�[0m
�[31m-             �[0m�[31mHarbison Carnagey   �[0m�[31mDale         �[0m�[31m1888-11-24   �[0m�[31mhttps://en.wikipedia.org/wiki/Dale_Carnegie�[0m
�[31m-             �[0m�[31mKahneman            �[0m�[31mDaniel       �[0m�[31m1934-03-05   �[0m�[31mhttps://en.wikipedia.org/wiki/Daniel_Kahneman�[0m
�[31m-             �[0m�[31mPittman             �[0m�[31mBob          �[0m�[31m1953-12-28   �[0m�[31mhttps://en.wikipedia.org/wiki/Robert_Pittman_(media_executive)�[0m

This report will shows you how to get from:

  • the original, from, left, source resource
  • to the new, to, right, target resource

It tells us that we need to:

  • add 2 rows
  • and delete 4 rows

from the original resource to obtain the new resource

It's a total of 6 changes as seen previously in the summary report.

But why does the diff did not compare the Norman record to each other ?

This is because their is no known identifier for the record, therefore the diff was executed by record hash and not by last name.

The next step will show you how to define the unique columns to drive a diff comparison.

The Unified Report with Driver Columns Definition

To improve our comparison, we will define the unique column of our data resource with the --driver-columns option.

tabul data diff \
   --no-fail \
   --report-type unified \
   --driver-columns "Last Name" \
   diff/characters_original.csv@howto diff/characters_new.csv@howto
Unified Diff Report between the resources (diff/characters_original.csv@howto) and (diff/characters_new.csv@howto).
 - �[31mred�[0m   for deleted
 - �[32mgreen�[0m for added
 - �[34mblue�[0m  for type change (example: number as varchar)
The data resources are not equals.
diff_status   last_name           first_name   birth_date   wikipedia_page
-----------   -----------------   ----------   ----------   --------------------------------------------------------------
=             Schwarzenegger      Arnold       1947-07-30   https://en.wikipedia.org/wiki/Arnold_Schwarzenegger
�[32m+1            �[0mNorman              Don          �[32m1935-12-23   �[0mhttps://en.wikipedia.org/wiki/Don_Norman
�[31m-1            �[0mNorman              Don          �[31m1935-12-25   �[0mhttps://en.wikipedia.org/wiki/Don_Norman
�[31m-             �[0m�[31mHarbison Carnagey   �[0m�[31mDale         �[0m�[31m1888-11-24   �[0m�[31mhttps://en.wikipedia.org/wiki/Dale_Carnegie�[0m
=             Kahneman            Daniel       1934-03-05   https://en.wikipedia.org/wiki/Daniel_Kahneman
�[31m-             �[0m�[31mPittman             �[0m�[31mBob          �[0m�[31m1953-12-28   �[0m�[31mhttps://en.wikipedia.org/wiki/Robert_Pittman_(media_executive)�[0m

This time, the unified report tell us that we need to:

  • update the birth date of the Norman record (+1,-1, in the 1 change)
  • delete the records
    • Harbison Carnagey
    • and Pittman

from the original resource to obtain the new resource

Conclusion

This page ends the learning guide of Tabulify at the command line with Tabul where we have learned to perform data operation one command at a time.




Related Pages
Data Operation - Diff (Comparison)

A diff is a intermediate operation that will perform a diff between: its input and a target resource defined by the target-data-uri argument The output is the diff data uri report. The diff may...
Learning Tabulify - Step 9 - How to fill a data resource with generated data ?

Tabulify integrates natively a data generator. You can generate realistic production data and start working on your project right away. anonymize production data in your development environment because...
Tabul - Data Diff Command (Compare)

data diff is a tabul command that performs a data diff (compare) against one or several data resources. You can diff: content data resource the structure of data resource The data selector...

Task Runner