About
transfer is a intermediate operation that executes all this data resources operations:
- copy (Default)
In general, if you need to move or process data, you will transfer it.
The data transfer action permits you to
- copy a local file to a sftp server
- and more
Example
In a Pipeline
pipeline:
# ...
- name: "Store"
operation: transfer
args:
target-data-uri: '${logicalName}.csv@tmp'
transfer-operation: insert
target-operation: replace
...
Cli
The transfer operation is performed by the tabul data transfer command
Arguments
| Name | Default | Description |
|---|---|---|
| target-data-uri | - | A template data uri (Mandatory) |
| target-name-sanitization | true | If true, the target name is sanitized to comply with the naming convention of the target system. Example: for sql, the name foo-bar would become foo_bar |
| target-operation | - | A resource operation that will happens on the target before the transfer |
| source-operation | - | A resource operation that will happens on the source after the transfer |
| Pipeline Properties | ||
| output-type | target | The output - targets, the target resources - inputs, the inputs resources are passed - results, the results of the transfer step is passed |
| processing-type | batch if the target is not a pattern stream otherwise | The type of processing |
| Cross Transfer Properties | ||
| source-fetch-size | 10000 | The number of record from the source for one fetch (ie network message size from the source system) |
| buffer-size | 2 x target-worker x fetch-size | The maximum number of record from stored in transit between the source and the target |
| target-batch-size | 1000 | The number of record send to the target system at once in a batch (ie network message size to the target system) |
| target-commit-frequency | <math>2^{31}</math> (Infinite) | The number of batch sends that will trigger a commit to the target system |
| target-worker | 1 | The number of thread against the target system that will send batches |
| with-parameters | true | If the target system is a sql database, SQL bind variables will be used |
| metrics-data-uri | A data uri where the transfer metrics will be saved | |
| Transfer Properties | ||
| transfer-operation | copy for a file system insert otherwise | A transfer operation (copy, insert, upsert, update, …) |
| transfer-upsert-type | merge | The transfer upsert type (merge, insert-update, update-insert, …) |
| transfer-mapping-method | position if the target has no structure name otherwise | how the source and target columns are mapped (by name, position, map-by-name or map-by-position) |
| transfer-mapping-columns | the map of source target column when the mapping method is map-by-name or map-by-position | |
| transfer-mapping-strict | true | if true, the transfer will fail if a source column cannot be mapped to a target column with the mapping method name or position |
Transfer Operation
The transfer operation supports the following values for the transfer-operation argument.
Basic Operations
Alias Operations
We support also aliases operation. They are operation derived from the basic one.
| Name | Alias |
|---|---|
| concat | alias of insert with only one target (ie a target-data-uri without target pattern) |
| move | alias of copy with source drop |
| rename | alias of move on the same connection (Metadata Operation) |
| replace | alias of copy with target drop |
Default Transfer Operation
The default transfer operation is system dependent:
- copy is the default for a file system
- insert is the default for a database system (If the target does not exist, it's always by default created)
Transfer mapping method
The transfer-mapping-method argument defines how the source columns are mapped to the target columns.
There are 4 values:
- name, the column name of the source should match a column name of the target
- by default, if a column of the source is not found in the target, the execution will stop
- if you want a lose mapping without any error, you can set the transfer-mapping-strict attribute to false. The source column will then not be added to the transfer.
- position, the column position of the source will match the column position of the target
- by default, if the number of column in the source does not match the number of column in the target, the execution will stop
- if you want a lose mapping without any error, you can set the transfer-strict-mapping attribute to false.
- map-by-name, the argument transfer-mapping-columns should contains a map of column name where
- the keys are the source column names
- the values are the target column names
- map-by-position, the argument transfer-mapping-columns should contains a map of column position where:
- the keys are the source column positions
- the values are the target column positions
Resource Operation
The following values may be used for the target-operation and source-operation arguments.
Transfer Upsert Type
We support the upsert operation via 3 methods
| Type | Batch Mode | Definition |
|---|---|---|
| merge | Yes | The merge represents an unique statement (generally a merge but if not supported, it may be an insert on conflict) |
| insert | Yes | When the target has no unique constraint, an insert is performed |
| insert-update | No | An insert is executed, on error an update is performed |
| update-insert | No | An update is executed, on error an insert is performed |
Note that a batch mode cannot be applied to a insert-update or update-insert because we need to get the result of the first operation (update or insert) to apply the second conditionally.
If the merge statement is not implemented by the database or the driver, we switch to:
- insert-update if the target resource is empty
- update-insert if the target resource is not empty
Target Batch Size
The target-batch-size defines the size of a batch.
Batch mode
If the target system supports batch update, we don't send each SQL statement one by one but grouped in a unit called a batch.
This mode is supported only when
- with-parameters is set to true
- when the transfer operation is performed in one statement (not with upsert type of a insert/update or update/insert mode)
Reporting
Results
If you set the output arguments to results, a result data path will be returned (in place of targets) and it will contain the following columns:
Metrics
If the argument metrics-data-uri is given, a metrics data resource will be created with the following columns:
- run - the run id
- timestamp - the metrics timestamp
- metric - the name of the metrics
- value - the value of the metrics
- worker - the name of the worker (thread)
The following metrics will be recorded:
- BufferSize: the size of the memory buffer between the source and the target
- BufferMaxSize: the maximum size of the buffer
- BufferRatio: the ratio from Buffer Size against Buffer MaxSize
- Commits: the number of commits
- Records: the number of records
Cross Transfer
A transfer can happen:
- on the same connection
- or between two different connections (called also a cross-transfer)
Tabulify Optimization: If two connections share the same credentials and the same system URL, the transfer will be optimized and considered local. This way, the transfer is applied only on the system metadata and it happens faster because the data is not moved.