About
A diff is a intermediate operation that will perform a diff between:
- its input
- and a target resource defined by the target-data-uri argument
The output is the diff data uri report.
The diff may be performed:
- at the metadata level, meaning
- on the data resource attributes
- on the data structure
- or and at the data level:
- logically: record by record
- physically: block by block
Howto / Examples
Cli
The diff operation is also available via the tabul data diff command
Arguments
| Argument | Default | Description |
|---|---|---|
| data-origin | record | The origin of the data to diff |
| driver-columns | The columns name that drive the diff (Should be set to the unique columns normally). Default to the record id (ie row id, line id) for a data comparison the column name for a data structure comparison |
|
| equality-type | loss | define the equality strictness. Possible values: loss, strict |
| fail | true | If true, the operation fails if the diff shows a change |
| max-change-count | 100 | The maximum number of change allowed before failing |
| report-type | summary | The type of report |
| report-data-uri | See report-data-uri | a template data uri where the diff report should be stored |
| report-density | dense | If sparse, show only the changes in a unified report (known as context in textual data diff) |
| report-diff-columns | status | List of diff columns added to the unified report |
| report-diff-column-prefix | diff_ | A prefix added the diff columns |
| terminal-colors | true if detected | color the output in the terminal |
| target-data-uri | a template data uri that will define and select the target resource to compare (if not set, the comparison is done against an empty set) | |
| target-data-def | The data attributes defined in a data definition format |
Arguments Detailed
Target Data Uri
The template data uri may be not given as argument. If this is the case, the comparison will except an empty resource.
Report Data Uri
diff-data-uri is a template data uri that defines the data uri of the accumulator diff report
By default, the diff result is stored in memory.
diff_${input_logical_name}_${target_logical_name}@memory
Note:
- the input and target resource attributes should be prefixed respectively with input_ and target_
If you plan to make a diff on millions of records that may saturate the memory, you should select another connection.
Report Type
The report-type argument defines the report generated:
| Name | Grain | Accumulator | A record is |
|---|---|---|---|
| summary | resource | No | Yes/No equality report with diff stats |
| unified | record | Yes | input and target record |
An accumulator report is a report that will accumulate the changes. You can set their location with the Diff Data Uri
Data Origin
The data-origin defines the origin of the data to compare.
| Name | Description |
|---|---|
| record (default) | the records of the data resource |
| structure | the structure (column definitions) of the data resource |
| attributes | the scalar data attributes of the data resource |
Driver Columns
The driver columns are the columns that drives the comparison process.
For instance, if you compare two resources that have an unique id column,
- if the diff has fetched:
- a input record with a id value of 1
- a target record with a id value of 2
- the diff will:
- mark the input record has missing in the target
- fetch records only from the input until it finds the record with the id 2 or above.
By default, the driver columns are:
- the primary key columns if found
- the first unique key columns if found
- otherwise all columns.
Note: the equality on driver columns is always loose.
Why? If this not the case, the diff would report a added and deleted record change because the record identifier would not match and it will be difficult to spot it. With a loose equality,
- the record are the same
- but the driver columns get a blue color (the color of loose equality)
Equality Type
The equality-type defines the strictness of the equality about the data type
- If loss
- the character 2 will be equal to the number 2
- float equality will also be lose
- empty string or string with only white space will be equal to null
- the string yes would be equal to the boolean true
- If strict, the equality will fail if the data type is not the same
Note: in a strict equality, if the data type of the driver column are not the same, the string version is used to determine the order of the diff.
Terminal-Colors
If true, the following terminal colors escape sequences are applied:
- green for an added change
- red for a deleted change
- blue for a type change (ie loose equality)
Report Type
Unified
An unified report has the following structure (columns):
- at minima the status diff column and optionally other diff_columns
- and the columns of the input/target
Status Column
The status column gives a short description of the change.
The format is:
(+|-)n (n)
where:
- + for added record in the target (ie record found in the target but not in the input)
- - for deleted records in the target (ie record found in the input but not in the target)
- n is the change id (added only when the change is a value modification)
- (n) is the record id if no driver_columns was specified
An empty status means no change.
Example:
value modification without driver_columns,
status color
------ -----
+1 (2) blue
-1 (2) red
where:
- this is the first change, ie the 1 in +1 and -1,
- + - the target record
- - - the input record
- (2) - 2nd record in the target and input
Diff Columns
With the report-diff-columns, you can add any of the following columns:
| Name | Description |
|---|---|
| change_id | change identifier The third change would have the number 3) The last number on the report would show the total number of change found. |
| colors | A color definition for the print operation |
| id | a report sequence identifier to locate the report records |
| origin | the origin of the record * Source means that the record comes from the source * Target means that record comes from the target * Blank means that the record is the same in the source and in the target |
| origin_id | the record id of the input or target origin (row id, line id, …) |
| status | The type of change |
With the report-diff-column-prefix, you can change the prefix added to the name.
Summary
A summary report is a report with one line that contains the following columns:
| Name | Type | Description |
|---|---|---|
| from | varchar | the input data uri (also known as the original, source or left resource) |
| to | varchar | the target data uri (also known as the new, target or right resource) |
| equals | boolean | the equality result (yes/no) |
| record_count | long | the number of record compared |
| change_count | long | the number of record changed A record deletion, addition or update count towards 1 change |
In case of multiple input/target, you can aggregate the summary report with the concat operation.