---json
{
"aliases": [
{ "path": ":docs:op:compare" }
],
"page_id": "xh2mvotwt2ky7k4pjza6x"
}
---
====== Data Operation - Diff (Comparison) ======
===== About =====
A ''diff'' is a [[:docs:flow:intermediate|intermediate operation]] that will perform a diff between:
* its [[:docs:flow:input|input]]
* and a target resource defined by the [[#target data uri|target-data-uri]] argument
The [[:docs:flow:output|output]] is the [[#diff-target-uri|diff data uri report]].
The ''diff'' may be performed:
* at the [[docs:resource:metadata|metadata level]], meaning
* on the [[docs:resource:attribute|data resource attributes]]
* on the [[docs:resource:structure|data structure]]
* or and at the data level:
* logically: record by record
* physically: block by block
===== Howto / Examples =====
* [[howto:getting_started:10_resource_comparison]]
* [[:howto:sql_table:diff|]]
===== Cli =====
The ''diff'' operation is also available via the [[:docs:tabul:data:diff|tabul data diff]] command
===== Arguments =====
^ Argument ^ Default ^ Description ^
| [[#data origin|data-origin]] | ''record'' | The origin of the data to diff |
| [[#driver columns|driver-columns]] | | The columns name that drive the diff (Should be set to the unique columns normally). Default to \\ the record id (ie row id, line id) for a data comparison \\ the column name for a data structure comparison |
| [[#equality type|equality-type]] | ''loss'' | define the [[#equality type|equality strictness]]. Possible values: ''loss'', ''strict'' |
| ''fail'' | true | If true, the operation fails if the diff shows a change |
| ''max-change-count'' | 100 | The maximum number of change allowed before failing |
| [[#report type|report-type]] | ''summary'' | The type of report |
| [[#report-data-uri]] | See [[#report data uri|report-data-uri]] | a [[docs:flow:template_data_uri|template data uri]] where the diff report should be stored |
| ''report-density'' | ''dense'' | If sparse, show only the changes in a unified report (known as context in textual data diff) |
| ''report-diff-columns'' | [[#status]] | List of [[#diff columns|diff columns added to the unified report]] |
| ''report-diff-column-prefix'' | ''diff_'' | A prefix added the diff columns |
| [[#terminal-colors]] | ''true'' if detected | color the output in the terminal |
| [[#target-data-uri|target-data-uri]] | | a [[docs:flow:template_data_uri|template data uri]] that will define and select the target resource to compare (if not set, the comparison is done against an empty set) |
| ''target-data-def'' | | The [[docs:resource:attribute|data attributes]] defined in a [[:docs:resource:data-definition|data definition]] format |
===== Arguments Detailed =====
==== Target Data Uri ====
The [[docs:flow:template_data_uri|template data uri]] may be not given as argument. If this is the case, the comparison will except an empty resource.
==== Report Data Uri ====
''diff-data-uri'' is a [[:docs:flow:template_data_uri|template data uri]] that defines the [[:docs:resource:data_uri|data uri]] of the accumulator [[#report type|diff report]]
By default, the diff result is stored in [[:docs:connection:memory|memory]].
diff_${input_logical_name}_${target_logical_name}@memory
Note:
* the ''input'' and ''target'' [[:docs:resource:attribute|resource attributes]] should be prefixed respectively with ''input_'' and ''target_''
If you plan to make a diff on millions of [[:docs:resource:record|records]] that may saturate the memory, you should select another [[docs:connection:connection|connection]].
==== Report Type ====
The ''report-type'' argument defines the report generated:
^ Name ^ Grain ^ Accumulator ^ A record is ^
| ''summary'' | ''resource'' | No | Yes/No equality report with diff stats |
| [[#unified]] | ''record'' | Yes | input and target [[:docs:resource:record|record]] |
An accumulator report is a report that will accumulate the changes.
You can set their location with the [[#Diff Data Uri|Diff Data Uri]]
==== Data Origin ====
The ''data-origin'' defines the origin of the data to compare.
^ Name ^ Description ^
| ''record'' (default) | the [[:docs:resource:record|records]] of the data resource |
| ''structure'' | the [[:docs:resource:structure|structure (column definitions)]] of the data resource |
| ''attributes'' | the [[:docs:resource:attribute|scalar data attributes]] of the data resource |
==== Driver Columns ====
The ''driver columns'' are the columns that drives the comparison process.
For instance, if you compare two resources that have an unique id column,
* if the diff has fetched:
* a input record with a id value of ''1''
* a target record with a id value of ''2''
* the diff will:
* mark the input record has missing in the target
* fetch records only from the input until it finds the record with the id ''2'' or above.
By default, the ''driver columns'' are:
* the ''primary key'' columns if found
* the ''first unique key'' columns if found
* otherwise all [[:docs:resource:column|columns]].
Note: the equality on ''driver columns'' is always [[#equality-type|loose]].
Why? If this not the case, the diff would report a ''added'' and ''deleted'' record change because the record identifier would not match and it will be difficult to spot it. With a [[#equality-type|loose]] equality,
* the record are the same
* but the ''driver columns'' get a blue color (the [[#colors|color]] of [[#equality type|loose equality]])
==== Equality Type ====
The ''equality-type'' defines the strictness of the equality about the [[docs:data_type:data_type|data type]]
* If ''loss''
* the character 2 will be equal to the number 2
* float equality will also be lose
* empty string or string with only white space will be equal to ''null''
* the string ''yes'' would be equal to the [[:docs:data_type:boolean|boolean]] ''true''
* If ''strict'', the equality will fail if the [[docs:data_type:data_type|data type]] is not the same
Note: in a ''strict'' equality, if the data type of the driver column are not the same, the string version is used to determine the order of the diff.
==== Terminal-Colors ====
If ''true'', the following terminal colors [[https://en.wikipedia.org/wiki/ANSI_escape_code|escape sequences]] are applied:
* ''green'' for an ''added'' change
* ''red'' for a ''deleted'' change
* ''blue'' for a [[docs:data_type:data_type|type]] change (ie [[#equality type|loose equality]])
===== Report Type =====
==== Unified ====
An unified report has the following [[:docs:resource:structure|structure (columns)]]:
* at minima the [[#status column|status diff column]] and optionally other [[#diff columns]]
* and the columns of the input/target
=== Status Column ===
The ''status'' column gives a short description of the change.
The format is:
(+|-)n (n)
where:
* ''+'' for added record in the target (ie record found in the target but not in the input)
* ''-'' for deleted records in the target (ie record found in the input but not in the target)
* ''n'' is the change id (added only when the change is a value modification)
* ''(n)'' is the record id if no [[#driver columns]] was specified
An empty status means no change.
Example:
**value modification** without [[#driver columns]],
```
status color
------ -----
+1 (2) blue
-1 (2) red
```
where:
* this is the first change, ie the ''1'' in ''+1'' and ''-1'',
* ''+'' - the target record
* ''-'' - the input record
* ''(2)'' - 2nd record in the target and input
=== Diff Columns ===
With the `report-diff-columns`, you can add any of the following columns:
^ Name ^ Description ^
| ''change_id'' | change identifier \\ The third change would have the number 3) \\ The last number on the report would show the total number of change found. |
| ''colors'' | A [[:docs:op:print#color|color definition]] for the [[:docs:op:print|print operation]] |
| ''id'' | a report sequence identifier to locate the report records |
| ''origin'' | the origin of the record \\ * ''Source'' means that the record comes from the [[docs:flow:source|source]] \\ * ''Target'' means that record comes from the [[docs:flow:target|target]] \\ * Blank means that the record is the same in the [[docs:flow:source|source]] and in the [[docs:flow:target|target]] |
| ''origin_id'' | the record id of the input or target origin (row id, line id, ...) |
| [[#status column|status]] | The type of change |
With the `report-diff-column-prefix`, you can change the prefix added to the name.
==== Summary ====
A summary report is a report with one line that contains the following columns:
^ Name ^ Type | Description ^
| ''from'' | varchar | the input [[:docs:resource:data_uri|data uri]] (also known as the ''original'', ''source'' or ''left'' resource) |
| ''to'' | varchar | the target [[:docs:resource:data_uri|data uri]] (also known as the ''new'', ''target'' or ''right'' resource) |
| ''equals'' | boolean | the equality result (yes/no) |
| ''record_count'' | long | the number of record compared |
| ''change_count'' | long | the number of record changed \\ A record deletion, addition or update count towards 1 change |
In case of multiple input/target, you can aggregate the summary report with the [[:docs:op:concat|concat operation]].