Data Generation
Tabulify integrates natively a data generator.
You can generate realistic production data and start working on your project right away.
Because the data is fake but realistic, you don't need to:
- anonymize production data in your development environment because of the privacy laws
- or create any acceptance environment.
The fill command
The data fill operation is an operation that will select target data resource and fill them with data.
Tabulify supports two mode:
- auto - the data generated is automatically chosen
- generator - the data generated is defined in a file called the generator
The fill operation is supported by the data fill command.
Auto Fill
Let's first delete all data with the data truncate command to get a clean schema.
tabul data truncate *@sqlite
The below fill command will fill all tables with auto-generated data
tabul data fill *@sqlite
Transfer results
input target latency record_count error_code error_message
----------------------------- ----------------------------- ------- ------------ ---------- -------------
call_center@memgen call_center@sqlite 0.37s 100
catalog_page@memgen catalog_page@sqlite 0.12s 100
catalog_sales@memgen catalog_sales@sqlite 0.36s 100
customer@memgen customer@sqlite 0.16s 100
customer_address@memgen customer_address@sqlite 0.21s 100
customer_demographics@memgen customer_demographics@sqlite 0.15s 100
date_dim@memgen date_dim@sqlite 0.82s 100
household_demographics@memgen household_demographics@sqlite 0.10s 100
income_band@memgen income_band@sqlite 0.10s 100
item@memgen item@sqlite 0.32s 100
promotion@memgen promotion@sqlite 0.19s 100
ship_mode@memgen ship_mode@sqlite 0.12s 100
store@memgen store@sqlite 0.30s 100
store_sales@memgen store_sales@sqlite 0.23s 100
time_dim@memgen time_dim@sqlite 0.13s 100
warehouse@memgen warehouse@sqlite 0.17s 100
web_page@memgen web_page@sqlite 0.16s 100
web_sales@memgen web_sales@sqlite 0.34s 100
web_site@memgen web_site@sqlite 0.26s 100
The data fill command loads 100 records for each table because this is the default value of the max-record-count option (This option defines the number of records generated).
Query 11
By running the query 11 (of the query lesson), we don't get any data back.
tabul data print '(sqlite/query_11.sql@tpcds_query)@sqlite'
# The quotes are only mandatory in bash because parenthesis are a bash token (ie subshell)
(sqlite/query_11.sql@tpcds_query)@sqlite
customer_id customer_first_name customer_last_name customer_email_address
----------- ------------------- ------------------ ----------------------
Why ? Because the query 11 is based on time data of the year 2001 and unfortunately the auto-generated data does not contain 2001 in the d_year column.
tabul data head --limit 10 date_dim@sqlite
The first 10 rows of the data resource (date_dim@sqlite):
d_date_sk d_date_id d_date d_month_seq d_week_seq d_quarter_seq d_year d_dow d_moy d_dom d_qoy d_fy_year d_fy_quarter_seq d_fy_week_seq d_day_name d_quarter_name d_holiday d_weekend d_following_holiday d_first_dom d_last_dom d_same_day_ly d_same_day_lq d_current_day d_current_week d_current_month d_current_quarter d_current_year
--------- --------- ---------- ----------- ---------- ------------- ------ ----- ----- ----- ----- --------- ---------------- ------------- ---------- -------------- --------- --------- ------------------- ----------- ---------- ------------- ------------- ------------- -------------- --------------- ----------------- --------------
1 a 2025-10-31 7 2 4 7 5 1 8 0 0 6 4 k t q x x 10 6 8 4 u p u d t
2 b 2025-11-05 7 3 0 1 6 8 9 5 3 0 10 p m u f j 8 2 10 9 m m z x h
3 c 2025-11-02 6 6 4 2 1 6 3 1 3 7 7 w k p i c 3 2 4 5 l l q m m
4 d 2025-11-04 3 1 1 5 1 1 5 0 10 4 5 r g k f b 0 1 2 8 b w h p x
5 e 2025-11-04 5 8 8 2 10 7 1 4 1 2 6 n x l w p 7 7 0 3 x o f h l
6 f 2025-11-06 7 3 0 3 5 10 8 5 10 6 10 a c n l w 7 1 7 10 e k h t m
7 g 2025-11-10 3 0 4 7 0 0 4 4 1 0 0 o y r f q 0 8 10 7 l d c l u
8 h 2025-11-02 8 10 8 7 2 10 7 8 5 7 0 i o n a i 9 2 5 10 k z c x k
9 i 2025-11-10 6 2 2 8 6 3 3 9 10 0 1 w z a m h 4 3 8 8 n w n l y
10 j 2025-11-09 2 10 5 8 3 4 10 9 8 4 7 p h u m i 10 8 2 0 o u o r w
To update the column dyear with data from the year 2001, we will use a generator in the next section.
Generator
A generator is a file that contains the data generation definition.
For each column, a column data generator is defined that control the data generated.
The below generator generates one year of data with two columns:
- d_date that has a date sequence generator to generate date data from 2001-01-01 and upwards
- dyear that has a expression generator that extracts the year of the d_date column.
kind: generator
spec:
LogicalName: date_dim
Columns:
- name: d_date
type: date
comment: A column with a sequence generator that generates a date sequence from 2001-01-01 and after
data-supplier:
type: sequence
arguments:
start: 2001-01-01
step: 1
- name: d_year
type: integer
precision: 4
comment: A column with a expression generator that extracts the year from the date column
data-supplier:
type: expression
arguments:
column-variable: d_date
expression: "d_date.getFullYear()"
This generator is also a content resource and therefore you can use it as any tabular resource and take a look at the data generated
tabul data head date_dim_2001--generator.yml@howto
The first 10 rows of the data resource (date_dim_2001@memgen):
d_date d_year
---------- ------
2001-01-01 2001
2001-01-02 2001
2001-01-03 2001
2001-01-04 2001
2001-01-05 2001
2001-01-06 2001
2001-01-07 2001
2001-01-08 2001
2001-01-09 2001
2001-01-10 2001
Fill with generators
After having created a generator for the date_dim table, we can pass it to the data fill command with the –generator-selector option to make the data generation more controled.
tabul data fill --generator-selector date_dim_2001--generator.yml@howto *@sqlite
As the option generator-selector is a resource selector, you can create a generator for each table where you want to customize the generated data and select them with a glob pattern.
Output:
Transfer results
input target latency record_count error_code error_message
----------------------------- ----------------------------- ------- ------------ ---------- -------------
call_center@memgen call_center@sqlite 0.37s 100
catalog_page@memgen catalog_page@sqlite 0.13s 100
catalog_sales@memgen catalog_sales@sqlite 0.37s 100
customer@memgen customer@sqlite 0.19s 100
customer_address@memgen customer_address@sqlite 0.16s 100
customer_demographics@memgen customer_demographics@sqlite 0.13s 100
date_dim@memgen date_dim@sqlite 0.351s 100
household_demographics@memgen household_demographics@sqlite 0.11s 100
income_band@memgen income_band@sqlite 0.9s 100
item@memgen item@sqlite 0.35s 100
promotion@memgen promotion@sqlite 0.18s 100
ship_mode@memgen ship_mode@sqlite 0.11s 100
store@memgen store@sqlite 0.29s 100
store_sales@memgen store_sales@sqlite 0.23s 100
time_dim@memgen time_dim@sqlite 0.15s 100
warehouse@memgen warehouse@sqlite 0.16s 100
web_page@memgen web_page@sqlite 0.12s 100
web_sales@memgen web_sales@sqlite 0.28s 100
web_site@memgen web_site@sqlite 0.28s 100
And the query 11 is now giving back a result. The generated data is minimal and should be further defined.
tabul data print '(sqlite/query_11.sql@tpcds_query)@sqlite'
# The quotes are only mandatory in bash because parenthesis are a bash token (ie subshell)
(sqlite/query_11.sql@tpcds_query)@sqlite
customer_id customer_first_name customer_last_name customer_email_address
----------- ------------------- ------------------ ----------------------
Next
Learn how to compare data resource.