Data Operation - Diff (Comparison)

Undraw Data Processing

About

A diff is a intermediate operation that will perform a diff between:

The output is the diff data uri report.

The diff may be performed:

Howto / Examples

Cli

The diff operation is also available via the tabul data diff command

Arguments

Argument Default Description
data-origin record The origin of the data to diff
driver-columns The columns name that drive the diff (Should be set to the unique columns normally). Default to
the record id (ie row id, line id) for a data comparison
the column name for a data structure comparison
equality-type loss define the equality strictness. Possible values: loss, strict
fail true If true, the operation fails if the diff shows a change
max-change-count 100 The maximum number of change allowed before failing
report-type summary The type of report
report-data-uri See report-data-uri a template data uri where the diff report should be stored
report-density dense If sparse, show only the changes in a unified report (known as context in textual data diff)
report-diff-columns status List of diff columns added to the unified report
report-diff-column-prefix diff_ A prefix added the diff columns
terminal-colors true if detected color the output in the terminal
target-data-uri a template data uri that will define and select the target resource to compare (if not set, the comparison is done against an empty set)
target-data-def The data attributes defined in a data definition format

Arguments Detailed

Target Data Uri

The template data uri may be not given as argument. If this is the case, the comparison will except an empty resource.

Report Data Uri

diff-data-uri is a template data uri that defines the data uri of the accumulator diff report

By default, the diff result is stored in memory.

diff_${input_logical_name}_${target_logical_name}@memory

Note:

  • the input and target resource attributes should be prefixed respectively with input_ and target_

If you plan to make a diff on millions of records that may saturate the memory, you should select another connection.

Report Type

The report-type argument defines the report generated:

Name Grain Accumulator A record is
summary resource No Yes/No equality report with diff stats
unified record Yes input and target record

An accumulator report is a report that will accumulate the changes. You can set their location with the Diff Data Uri

Data Origin

The data-origin defines the origin of the data to compare.

Name Description
record (default) the records of the data resource
structure the structure (column definitions) of the data resource
attributes the scalar data attributes of the data resource

Driver Columns

The driver columns are the columns that drives the comparison process.

For instance, if you compare two resources that have an unique id column,

  • if the diff has fetched:
    • a input record with a id value of 1
    • a target record with a id value of 2
  • the diff will:
    • mark the input record has missing in the target
    • fetch records only from the input until it finds the record with the id 2 or above.

By default, the driver columns are:

  • the primary key columns if found
  • the first unique key columns if found
  • otherwise all columns.

Note: the equality on driver columns is always loose.

Why? If this not the case, the diff would report a added and deleted record change because the record identifier would not match and it will be difficult to spot it. With a loose equality,

  • the record are the same
  • but the driver columns get a blue color (the color of loose equality)

Equality Type

The equality-type defines the strictness of the equality about the data type

  • If loss
    • the character 2 will be equal to the number 2
    • float equality will also be lose
    • empty string or string with only white space will be equal to null
    • the string yes would be equal to the boolean true
  • If strict, the equality will fail if the data type is not the same

Note: in a strict equality, if the data type of the driver column are not the same, the string version is used to determine the order of the diff.

Terminal-Colors

If true, the following terminal colors escape sequences are applied:

  • green for an added change
  • red for a deleted change
  • blue for a type change (ie loose equality)

Report Type

Unified

An unified report has the following structure (columns):

Status Column

The status column gives a short description of the change.

The format is:

(+|-)n (n)

where:

  • + for added record in the target (ie record found in the target but not in the input)
  • - for deleted records in the target (ie record found in the input but not in the target)
  • n is the change id (added only when the change is a value modification)
  • (n) is the record id if no driver_columns was specified

An empty status means no change.

Example:

value modification without driver_columns,

status  color
------  -----
+1 (2)  blue
-1 (2)  red

where:

  • this is the first change, ie the 1 in +1 and -1,
  • + - the target record
  • - - the input record
  • (2) - 2nd record in the target and input

Diff Columns

With the report-diff-columns, you can add any of the following columns:

Name Description
change_id change identifier
The third change would have the number 3)
The last number on the report would show the total number of change found.
colors A color definition for the print operation
id a report sequence identifier to locate the report records
origin the origin of the record
* Source means that the record comes from the source
* Target means that record comes from the target
* Blank means that the record is the same in the source and in the target
origin_id the record id of the input or target origin (row id, line id, …)
status The type of change

With the report-diff-column-prefix, you can change the prefix added to the name.

Summary

A summary report is a report with one line that contains the following columns:

Name Type Description
from varchar the input data uri (also known as the original, source or left resource)
to varchar the target data uri (also known as the new, target or right resource)
equals boolean the equality result (yes/no)
record_count long the number of record compared
change_count long the number of record changed
A record deletion, addition or update count towards 1 change

In case of multiple input/target, you can aggregate the summary report with the concat operation.




Related HowTo
Undraw Data Processing
How to diff a SQL table ?

This howto show you how to perform diff operation between 2 SQL Tables with the data diff command. In this step, we load the csv resources that we want to compare into sqlite Load the original...
Undraw Data Processing
Learning Tabulify - Step 10 - Data Diff, a step by step

Data Diff is the cornerstone of every development because it validates the processing of data. Tabulify ships with a Data diff operation that allows you to compare: the data content and the data...
Undraw Data Processing
Oracle - How to execute an Anonymous Code Block

In Oracle, an anonymous code block can be created with the DECLARE/BEGIN/END block. The next steps will be using the howto...
Undraw Data Processing
Postgres - Anonymous Code Block

In Postgres, an anonymous code block is executed via the non-standard SQL DO statement. The next steps will be using the postgres howto connection....
Undraw Data Processing
SQL Server - Anonymous Code Block

In , an anonymous code block can be created with the BEGIN/END block. The next steps will be using the howto connection....

Task Runner