Data Operation - Diff (Comparison)

About

A diff is a intermediate operation that will perform a diff between:

its input
and a target resource defined by the target-data-uri argument

The output is the diff data uri report.

The diff may be performed:

at the metadata level, meaning
- on the data resource attributes
- on the data structure
or and at the data level:
- logically: record by record
- physically: block by block

Howto / Examples

Cli

The diff operation is also available via the tabul data diff command

Arguments

Argument	Default	Description
data-origin	record	The origin of the data to diff
driver-columns		The columns name that drive the diff (Should be set to the unique columns normally). Default to the record id (ie row id, line id) for a data comparison the column name for a data structure comparison
equality-type	loss	define the equality strictness. Possible values: loss, strict
fail	true	If true, the operation fails if the diff shows a change
max-change-count	100	The maximum number of change allowed before failing
report-type	summary	The type of report
report-data-uri	See report-data-uri	a template data uri where the diff report should be stored
report-density	dense	If sparse, show only the changes in a unified report (known as context in textual data diff)
report-diff-columns	status	List of diff columns added to the unified report
report-diff-column-prefix	diff_	A prefix added the diff columns
terminal-colors	true if detected	color the output in the terminal
target-data-uri		a template data uri that will define and select the target resource to compare (if not set, the comparison is done against an empty set)
target-data-def		The data attributes defined in a data definition format

Arguments Detailed

Target Data Uri

The template data uri may be not given as argument. If this is the case, the comparison will except an empty resource.

Report Data Uri

diff-data-uri is a template data uri that defines the data uri of the accumulator diff report

By default, the diff result is stored in memory.

diff_${input_logical_name}_${target_logical_name}@memory

Note:

the input and target resource attributes should be prefixed respectively with input_ and target_

If you plan to make a diff on millions of records that may saturate the memory, you should select another connection.

Report Type

The report-type argument defines the report generated:

Name	Grain	Accumulator	A record is
summary	resource	No	Yes/No equality report with diff stats
unified	record	Yes	input and target record

An accumulator report is a report that will accumulate the changes. You can set their location with the Diff Data Uri

Data Origin

The data-origin defines the origin of the data to compare.

Name	Description
record (default)	the records of the data resource
structure	the structure (column definitions) of the data resource
attributes	the scalar data attributes of the data resource

Driver Columns

The driver columns are the columns that drives the comparison process.

For instance, if you compare two resources that have an unique id column,

if the diff has fetched:
- a input record with a id value of 1
- a target record with a id value of 2
the diff will:
- mark the input record has missing in the target
- fetch records only from the input until it finds the record with the id 2 or above.

By default, the driver columns are:

the primary key columns if found
the first unique key columns if found
otherwise all columns.

Note: the equality on driver columns is always loose.

Why? If this not the case, the diff would report a added and deleted record change because the record identifier would not match and it will be difficult to spot it. With a loose equality,

the record are the same
but the driver columns get a blue color (the color of loose equality)

Equality Type

The equality-type defines the strictness of the equality about the data type

If loss
- the character 2 will be equal to the number 2
- float equality will also be lose
- empty string or string with only white space will be equal to null
- the string yes would be equal to the boolean true
If strict, the equality will fail if the data type is not the same

Note: in a strict equality, if the data type of the driver column are not the same, the string version is used to determine the order of the diff.

Terminal-Colors

If true, the following terminal colors escape sequences are applied:

green for an added change
red for a deleted change
blue for a type change (ie loose equality)

Report Type

Unified

An unified report has the following structure (columns):

at minima the status diff column and optionally other diff_columns
and the columns of the input/target

Status Column

The status column gives a short description of the change.

The format is:

(+|-)n (n)

where:

+ for added record in the target (ie record found in the target but not in the input)
- for deleted records in the target (ie record found in the input but not in the target)
n is the change id (added only when the change is a value modification)
(n) is the record id if no driver_columns was specified

An empty status means no change.

Example:

value modification without driver_columns,

status  color
------  -----
+1 (2)  blue
-1 (2)  red

where:

this is the first change, ie the 1 in +1 and -1,
+ - the target record
- - the input record
(2) - 2nd record in the target and input

Diff Columns

With the report-diff-columns, you can add any of the following columns:

Name	Description
change_id	change identifier The third change would have the number 3) The last number on the report would show the total number of change found.
colors	A color definition for the print operation
id	a report sequence identifier to locate the report records
origin	the origin of the record * Source means that the record comes from the source * Target means that record comes from the target * Blank means that the record is the same in the source and in the target
origin_id	the record id of the input or target origin (row id, line id, …)
status	The type of change

With the report-diff-column-prefix, you can change the prefix added to the name.

Summary

A summary report is a report with one line that contains the following columns:

Name	Type	Description
from	varchar	the input data uri (also known as the original, source or left resource)
to	varchar	the target data uri (also known as the new, target or right resource)
equals	boolean	the equality result (yes/no)
record_count	long	the number of record compared
change_count	long	the number of record changed A record deletion, addition or update count towards 1 change

In case of multiple input/target, you can aggregate the summary report with the concat operation.