A diff is a intermediate operation that will perform a diff between:
The output is the diff data uri report.
The diff may be performed:
The diff operation is also available via the tabul data diff command
| Argument | Default | Description |
|---|---|---|
| data-origin | record | The origin of the data to diff |
| driver-columns | The columns name that drive the diff (Should be set to the unique columns normally). Default to the record id (ie row id, line id) for a data comparison the column name for a data structure comparison |
|
| equality-type | loss | define the equality strictness. Possible values: loss, strict |
| fail | true | If true, the operation fails if the diff shows a change |
| max-change-count | 100 | The maximum number of change allowed before failing |
| report-type | summary | The type of report |
| report-data-uri | See report-data-uri | a template data uri where the diff report should be stored |
| report-density | dense | If sparse, show only the changes in a unified report (known as context in textual data diff) |
| report-diff-columns | status | List of diff columns added to the unified report |
| report-diff-column-prefix | diff_ | A prefix added the diff columns |
| terminal-colors | true if detected | color the output in the terminal |
| target-data-uri | a template data uri that will define and select the target resource to compare (if not set, the comparison is done against an empty set) | |
| target-data-def | The data attributes defined in a data definition format |
The template data uri may be not given as argument. If this is the case, the comparison will except an empty resource.
diff-data-uri is a template data uri that defines the data uri of the accumulator diff report
By default, the diff result is stored in memory.
diff_${input_logical_name}_${target_logical_name}@memory
Note:
If you plan to make a diff on millions of records that may saturate the memory, you should select another connection.
The report-type argument defines the report generated:
| Name | Grain | Accumulator | A record is |
|---|---|---|---|
| summary | resource | No | Yes/No equality report with diff stats |
| unified | record | Yes | input and target record |
An accumulator report is a report that will accumulate the changes. You can set their location with the Diff Data Uri
The data-origin defines the origin of the data to compare.
| Name | Description |
|---|---|
| record (default) | the records of the data resource |
| structure | the structure (column definitions) of the data resource |
| attributes | the scalar data attributes of the data resource |
The driver columns are the columns that drives the comparison process.
For instance, if you compare two resources that have an unique id column,
By default, the driver columns are:
Note: the equality on driver columns is always loose.
Why? If this not the case, the diff would report a added and deleted record change because the record identifier would not match and it will be difficult to spot it. With a loose equality,
The equality-type defines the strictness of the equality about the data type
Note: in a strict equality, if the data type of the driver column are not the same, the string version is used to determine the order of the diff.
If true, the following terminal colors escape sequences are applied:
An unified report has the following structure (columns):
The status column gives a short description of the change.
The format is:
(+|-)n (n)
where:
An empty status means no change.
Example:
value modification without driver_columns,
status color
------ -----
+1 (2) blue
-1 (2) red
where:
With the report-diff-columns, you can add any of the following columns:
| Name | Description |
|---|---|
| change_id | change identifier The third change would have the number 3) The last number on the report would show the total number of change found. |
| colors | A color definition for the print operation |
| id | a report sequence identifier to locate the report records |
| origin | the origin of the record * Source means that the record comes from the source * Target means that record comes from the target * Blank means that the record is the same in the source and in the target |
| origin_id | the record id of the input or target origin (row id, line id, …) |
| status | The type of change |
With the report-diff-column-prefix, you can change the prefix added to the name.
A summary report is a report with one line that contains the following columns:
| Name | Type | Description |
|---|---|---|
| from | varchar | the input data uri (also known as the original, source or left resource) |
| to | varchar | the target data uri (also known as the new, target or right resource) |
| equals | boolean | the equality result (yes/no) |
| record_count | long | the number of record compared |
| change_count | long | the number of record changed A record deletion, addition or update count towards 1 change |
In case of multiple input/target, you can aggregate the summary report with the concat operation.