howto:getting_started:9_data

Data Generation

Tabulify integrates natively a data generator.

You can generate realistic production data and start working on your project right away.

Because the data is fake but realistic, you don't need to:

anonymize production data in your development environment because of the privacy laws
or create any acceptance environment.

The fill command

The data fill operation is an operation that will select target data resource and fill them with data.

Tabulify supports two mode:

auto - the data generated is automatically chosen
generator - the data generated is defined in a file called the generator

The fill operation is supported by the data fill command.

Auto Fill

Let's first delete all data with the data truncate command to get a clean schema.

tabul data truncate *@sqlite

The below fill command will fill all tables with auto-generated data

tabul data fill *@sqlite

Transfer results
input                           target                          latency   record_count   error_code   error_message
-----------------------------   -----------------------------   -------   ------------   ----------   -------------
call_center@memgen              call_center@sqlite              0.33s              100                
catalog_page@memgen             catalog_page@sqlite             0.11s              100                
catalog_sales@memgen            catalog_sales@sqlite            0.34s              100                
customer@memgen                 customer@sqlite                 0.17s              100                
customer_address@memgen         customer_address@sqlite         0.15s              100                
customer_demographics@memgen    customer_demographics@sqlite    0.18s              100                
date_dim@memgen                 date_dim@sqlite                 0.54s              100                
household_demographics@memgen   household_demographics@sqlite   0.9s               100                
income_band@memgen              income_band@sqlite              0.9s               100                
item@memgen                     item@sqlite                     0.25s              100                
promotion@memgen                promotion@sqlite                0.18s              100                
ship_mode@memgen                ship_mode@sqlite                0.13s              100                
store@memgen                    store@sqlite                    0.27s              100                
store_sales@memgen              store_sales@sqlite              0.25s              100                
time_dim@memgen                 time_dim@sqlite                 0.13s              100                
warehouse@memgen                warehouse@sqlite                0.16s              100                
web_page@memgen                 web_page@sqlite                 0.12s              100                
web_sales@memgen                web_sales@sqlite                0.26s              100                
web_site@memgen                 web_site@sqlite                 0.23s              100

The data fill command loads 100 records for each table because this is the default value of the max-record-count option (This option defines the number of records generated).

Query 11

By running the query 11 (of the query lesson), we don't get any data back.

tabul data print '(sqlite/query_11.sql@tpcds_query)@sqlite'
# The quotes are only mandatory in bash because parenthesis are a bash token (ie subshell)

(sqlite/query_11.sql@tpcds_query)@sqlite
customer_id   customer_first_name   customer_last_name   customer_email_address
-----------   -------------------   ------------------   ----------------------

Why ? Because the query 11 is based on time data of the year 2001 and unfortunately the auto-generated data does not contain 2001 in the d_year column.

tabul data head --limit 10 date_dim@sqlite

The first 10 rows of the data resource (date_dim@sqlite):
d_date_sk   d_date_id   d_date       d_month_seq   d_week_seq   d_quarter_seq   d_year   d_dow   d_moy   d_dom   d_qoy   d_fy_year   d_fy_quarter_seq   d_fy_week_seq   d_day_name   d_quarter_name   d_holiday   d_weekend   d_following_holiday   d_first_dom   d_last_dom   d_same_day_ly   d_same_day_lq   d_current_day   d_current_week   d_current_month   d_current_quarter   d_current_year
---------   ---------   ----------   -----------   ----------   -------------   ------   -----   -----   -----   -----   ---------   ----------------   -------------   ----------   --------------   ---------   ---------   -------------------   -----------   ----------   -------------   -------------   -------------   --------------   ---------------   -----------------   --------------
        1   a           2025-12-25             4            0              10        4       4       0       4       3           5                  1              10   k            u                j           b           v                               4            3               8               0   u               h                r                 t                   n
        2   b           2026-01-01             6            8               7        2       3       6       6       6           6                  4               0   u            h                f           s           o                               6            2               7               5   l               w                k                 h                   a
        3   c           2025-12-24             7            6               7        6       5      10       5       0           0                  5               9   i            c                l           e           g                              10            6               4               5   k               d                l                 j                   w
        4   d           2025-12-30             0            9               0        0       6       1       6       8           2                  3               4   m            e                o           b           z                               1            2               9               2   h               n                z                 c                   i
        5   e           2025-12-31             9            5               4        5       8       9       0      10           3                  1               8   j            j                m           j           g                               8            7               7               8   p               x                s                 d                   h
        6   f           2025-12-24             6            8               2        0       8       3       3       3           3                  9               3   p            o                x           l           h                               0            3               3               7   g               x                o                 j                   v
        7   g           2026-01-01             9            0               8        1       1       2       8       0           3                  5               0   z            f                l           m           i                               2            9               2               3   d               z                d                 m                   q
        8   h           2025-12-27             9            4               9        2       7       7       5       3           5                  2               5   l            p                p           r           x                               0            8               0               3   f               g                i                 l                   j
        9   i           2026-01-01             7            6               6        2       1       4       6       2           8                  8               1   c            z                k           j           g                               4            1               1               3   f               t                h                 j                   m
       10   j           2026-01-02             4            3               2        0       4       4       9       2           0                  9               7   x            n                z           s           u                               1            9               9               0   n               q                v                 l                   u

To update the column dyear with data from the year 2001, we will use a generator in the next section.

Generator

A generator is a file that contains the data generation definition.

For each column, a column data generator is defined that control the data generated.

The below generator generates one year of data with two columns:

d_date that has a date sequence generator to generate date data from 2001-01-01 and upwards
dyear that has a expression generator that extracts the year of the d_date column.

kind: generator
spec:
  LogicalName: date_dim
  # 1000 records to be sure that we have one record by day, the default is 100
  max-record-count: 1000
  Columns:
    - name: d_date
      type: date
      comment: A column with a sequence generator that generates a date sequence from 2001-01-01 and after
      data-supplier:
        type: sequence
        arguments:
          start: 2001-01-01
          step: 1
    - name: d_year
      type: integer
      precision: 4
      comment: A column with a expression generator that extracts the year from the date column
      data-supplier:
        type: expression
        arguments:
          column-variable: d_date
          expression: "d_date.getFullYear()"

This generator is also a content resource and therefore you can use it as any tabular resource and take a look at the data generated

tabul data head date_dim_2001--generator.yml@howto

The first 10 rows of the data resource (date_dim_2001@memgen):
d_date       d_year
----------   ------
2001-01-01     2001
2001-01-02     2001
2001-01-03     2001
2001-01-04     2001
2001-01-05     2001
2001-01-06     2001
2001-01-07     2001
2001-01-08     2001
2001-01-09     2001
2001-01-10     2001

Fill with generators

After having created a generator for the date_dim table, we can pass it to the data fill command with the –generator-selector option to make the data generation more controled.

tabul data fill --generator-selector date_dim_2001--generator.yml@howto   *@sqlite

As the option generator-selector is a resource selector, you can create a generator for each table where you want to customize the generated data and select them with a glob pattern.

Output:

Transfer results
input                           target                          latency   record_count   error_code   error_message
-----------------------------   -----------------------------   -------   ------------   ----------   -------------
call_center@memgen              call_center@sqlite              0.33s              100                
catalog_page@memgen             catalog_page@sqlite             0.11s              100                
catalog_sales@memgen            catalog_sales@sqlite            0.34s              100                
customer@memgen                 customer@sqlite                 0.16s              100                
customer_address@memgen         customer_address@sqlite         0.20s              100                
customer_demographics@memgen    customer_demographics@sqlite    0.15s              100                
date_dim@memgen                 date_dim@sqlite                 0.968s            1000                
household_demographics@memgen   household_demographics@sqlite   0.9s               100                
income_band@memgen              income_band@sqlite              0.10s              100                
item@memgen                     item@sqlite                     0.24s              100                
promotion@memgen                promotion@sqlite                0.18s              100                
ship_mode@memgen                ship_mode@sqlite                0.10s              100                
store@memgen                    store@sqlite                    0.23s              100                
store_sales@memgen              store_sales@sqlite              0.23s              100                
time_dim@memgen                 time_dim@sqlite                 0.12s              100                
warehouse@memgen                warehouse@sqlite                0.13s              100                
web_page@memgen                 web_page@sqlite                 0.15s              100                
web_sales@memgen                web_sales@sqlite                0.36s              100                
web_site@memgen                 web_site@sqlite                 0.22s              100

And the query 11 is now giving back a result. The generated data is minimal and should be further defined.

tabul data print '(sqlite/query_11.sql@tpcds_query)@sqlite'
# The quotes are only mandatory in bash because parenthesis are a bash token (ie subshell)

(sqlite/query_11.sql@tpcds_query)@sqlite
customer_id   customer_first_name   customer_last_name   customer_email_address
-----------   -------------------   ------------------   ----------------------
bg            o                     v                    o
cj            t                     a                    d

Learn how to compare data resource.

Data Resource Diff

Table of Contents

Learning Tabulify - Step 9 - How to fill a data resource with generated data ?

Data Generation

The fill command

Auto Fill

Query 11

Generator

Fill with generators

Next