How to diff a SQL table ?

About

This howto show you how to perform diff operation between 2 SQL Tables with the data diff command.

Steps

Load Data

In this step, we load the csv resources that we want to compare into sqlite

  • Load the original data resource (also known as the source, old, from or left resource)
tabul data replace diff/characters_original.csv@howto characters_original@sqlite
Transfer results
input                                target                       latency   record_count   error_code   error_message
----------------------------------   --------------------------   -------   ------------   ----------   -------------
diff/characters_original.csv@howto   characters_original@sqlite   0.50s                5

  • The new data resource (also known as the target, right, to resource)
tabul data replace diff/characters_new.csv@howto characters_new@sqlite
Transfer results
input                           target                  latency   record_count   error_code   error_message
-----------------------------   ---------------------   -------   ------------   ----------   -------------
diff/characters_new.csv@howto   characters_new@sqlite   0.53s                3

The data

The fact that records are missing can be easily spotted but there are more subtle differences.

tabul data print \
    --strict-selection \
    characters_new@sqlite \
    characters_original@sqlite
characters_new@sqlite
Last Name        First Name   Birth Date   Wikipedia Page
--------------   ----------   ----------   ---------------------------------------------------
Schwarzenegger   Arnold       1947-07-30   https://en.wikipedia.org/wiki/Arnold_Schwarzenegger
Norman           Don          1935-12-23   https://en.wikipedia.org/wiki/Don_Norman
Kahneman         Daniel       1934-03-05   https://en.wikipedia.org/wiki/Daniel_Kahneman



characters_original@sqlite
last_name           first_name   birth_date   wikipedia_page
-----------------   ----------   ----------   --------------------------------------------------------------
Schwarzenegger      Arnold       1947-07-30   https://en.wikipedia.org/wiki/Arnold_Schwarzenegger
Norman              Don          1935-12-25   https://en.wikipedia.org/wiki/Don_Norman
Harbison Carnagey   Dale         1888-11-24   https://en.wikipedia.org/wiki/Dale_Carnegie
Kahneman            Daniel       1934-03-05   https://en.wikipedia.org/wiki/Daniel_Kahneman
Pittman             Bob          1953-12-28   https://en.wikipedia.org/wiki/Robert_Pittman_(media_executive)

Failure Behavior

By default, a diff that detects differences will fail (ie return a bad exit code).

  • The --no-fail option changes this behavior.
  • You don't need to set it normally but as we monitor our documentation for failure, we don't want to fail

Perform the diff

All diff can be executed with the data diff command by giving as argument the source and target data resource.

tabul data diff \
  --no-fail \
  characters_original@sqlite characters_new@sqlite
Diff Summary Report
The data resources are not equals.
from                         to                      equals   record_count   change_count
--------------------------   ---------------------   ------   ------------   ------------
characters_original@sqlite   characters_new@sqlite                       5              6

Perform a diff and generate a unified report

In this step, we will asked for more details by asking for a unified report type

tabul data diff \
    --no-fail \
    --report-type unified \
    characters_original@sqlite characters_new@sqlite
Unified Diff Report between the resources (characters_original@sqlite) and (characters_new@sqlite).
 - �[31mred�[0m   for deleted
 - �[32mgreen�[0m for added
 - �[34mblue�[0m  for type change (example: number as varchar)
The data resources are not equals.
diff_status   last_name           first_name   birth_date   wikipedia_page
-----------   -----------------   ----------   ----------   --------------------------------------------------------------
=             Schwarzenegger      Arnold       1947-07-30   https://en.wikipedia.org/wiki/Arnold_Schwarzenegger
�[32m+             �[0m�[32mNorman              �[0m�[32mDon          �[0m�[32m1935-12-23   �[0m�[32mhttps://en.wikipedia.org/wiki/Don_Norman�[0m
�[32m+             �[0m�[32mKahneman            �[0m�[32mDaniel       �[0m�[32m1934-03-05   �[0m�[32mhttps://en.wikipedia.org/wiki/Daniel_Kahneman�[0m
�[31m-             �[0m�[31mNorman              �[0m�[31mDon          �[0m�[31m1935-12-25   �[0m�[31mhttps://en.wikipedia.org/wiki/Don_Norman�[0m
�[31m-             �[0m�[31mHarbison Carnagey   �[0m�[31mDale         �[0m�[31m1888-11-24   �[0m�[31mhttps://en.wikipedia.org/wiki/Dale_Carnegie�[0m
�[31m-             �[0m�[31mKahneman            �[0m�[31mDaniel       �[0m�[31m1934-03-05   �[0m�[31mhttps://en.wikipedia.org/wiki/Daniel_Kahneman�[0m
�[31m-             �[0m�[31mPittman             �[0m�[31mBob          �[0m�[31m1953-12-28   �[0m�[31mhttps://en.wikipedia.org/wiki/Robert_Pittman_(media_executive)�[0m

We can see that the diff is not line by line as if it was a text file. In the next step, we will tell the diff that we want a diff on the last_name.

Perform a diff with a column driver

To improve our comparison, we will define the unique column of our data resource with the --driver-columns option.

tabul data diff \
   --no-fail \
   --report-type unified \
   --driver-columns "Last Name" \
   characters_original@sqlite characters_new@sqlite
Unified Diff Report between the resources (characters_original@sqlite) and (characters_new@sqlite).
 - �[31mred�[0m   for deleted
 - �[32mgreen�[0m for added
 - �[34mblue�[0m  for type change (example: number as varchar)
The data resources are not equals.
diff_status   last_name           first_name   birth_date   wikipedia_page
-----------   -----------------   ----------   ----------   --------------------------------------------------------------
=             Schwarzenegger      Arnold       1947-07-30   https://en.wikipedia.org/wiki/Arnold_Schwarzenegger
�[32m+1            �[0mNorman              Don          �[32m1935-12-23   �[0mhttps://en.wikipedia.org/wiki/Don_Norman
�[31m-1            �[0mNorman              Don          �[31m1935-12-25   �[0mhttps://en.wikipedia.org/wiki/Don_Norman
�[31m-             �[0m�[31mHarbison Carnagey   �[0m�[31mDale         �[0m�[31m1888-11-24   �[0m�[31mhttps://en.wikipedia.org/wiki/Dale_Carnegie�[0m
=             Kahneman            Daniel       1934-03-05   https://en.wikipedia.org/wiki/Daniel_Kahneman
�[31m-             �[0m�[31mPittman             �[0m�[31mBob          �[0m�[31m1953-12-28   �[0m�[31mhttps://en.wikipedia.org/wiki/Robert_Pittman_(media_executive)�[0m

This time, the unified report tell us that we need to:

  • update the birth date of the Norman record (+1,-1, in the 1 change)
  • delete the records
    • Harbison Carnagey
    • and Pittman

from the original resource to obtain the new resource

Advanced Diff with a pipeline

The diff operation is an intermediate pipeline operation that has a lot of arguments that are not available with the diff command line.

If you want to perform an advanced diff, create a pipeline.




Related Pages
Data Operation - Diff (Comparison)

A diff is a intermediate operation that will perform a diff between: its input and a target resource defined by the target-data-uri argument The output is the diff data uri report. The diff may...

Task Runner