How to load and analyze YAML documents in a database?

About

This howto will show you how to load and analyze yaml document (files) into a relational database.

This example uses the sqlite howto database but you may use any other relational database

Steps

tabul data drop --no-strict-selection users@sqlite

Understanding YAML file

At its core a YAML file is just a JSON file styled in a block style ¹⁾.

Tabulify will treat them by default as JSON. When you load a Yaml, you are loading a Json file (ie the yaml style attribute has a value of json)

This block style yaml file:

tabul data print --type text yaml/phonebook/Arnold.yaml@howto

phones:
  - "704-555-0150"
  - "202-555-0150"
city: Los Angeles

is equivalent to this json file

tabul data print \
   --pipe-mode \
   --attribute style=json \
   yaml/phonebook/Arnold.yaml@howto | jq

{
  "phones": [
    "704-555-0150",
    "202-555-0150"
  ],
  "city": "Los Angeles"
}

Note that:

style=json is optional as this is the default
--pipe-mode is used to pipe a raw json into jq (pipe-mode deletes the headers and will not print control characters)

Yaml relational structure

By default, a Yaml file is seen as a relation with one column called json

The structure of the YAML file can be seen with the data struct command.

For instance, for the Don.yaml file, there is one column:

at the position 1
called yaml (the value of the column name attribute)
with the json data type

tabul data describe yaml/phonebook/Don.yaml@howto

Structure of the resource yaml/phonebook/Don.yaml@howto
position   name   type   precision   scale   primary_key   nullable   autoincrement   comment
--------   ----   ----   ---------   -----   -----------   --------   -------------   -------
1          yaml   json   0           0                     x

The Yaml files

To be able to follow along the SQLite examples of the JSON documentation, we will load the yaml data (ie city and phones numbers in a user table.

The Yaml files are located in the subdirectory yaml/phonebook of the howto directory where there is one file by user.

With the data list command, we can list them.

tabul data list yaml/phonebook/*.yaml@howto

path                         media_type
--------------------------   ----------
yaml/phonebook/Arnold.yaml   text/yaml
yaml/phonebook/Bob.yaml      text/yaml
yaml/phonebook/Dale.yaml     text/yaml
yaml/phonebook/Daniel.yaml   text/yaml
yaml/phonebook/Don.yaml      text/yaml

Loading the YAML files

The below data insert command will load the YAML files into the user table of the sqlite howto database.

tabul data drop --no-strict-selection user@sqlite

tabul data insert yaml/phonebook/*.yaml@howto  user@sqlite

Transfer results
input                              target        latency   record_count   error_code   error_message
--------------------------------   -----------   -------   ------------   ----------   -------------
yaml/phonebook/Arnold.yaml@howto   user@sqlite   0.53s                1
yaml/phonebook/Bob.yaml@howto      user@sqlite   0.8s                 1
yaml/phonebook/Dale.yaml@howto     user@sqlite   0.7s                 1
yaml/phonebook/Daniel.yaml@howto   user@sqlite   0.9s                 1
yaml/phonebook/Don.yaml@howto      user@sqlite   0.8s                 1

You can inspect the content with the head command

tabul data head user@sqlite

The first 10 rows of the data resource (user@sqlite):
yaml
--------------------------------------------------------------------------------------------------------------------
{\n  "phones": [\n    "704-555-0150",\n    "202-555-0150"\n  ]\n  ,\n  "city": "Los Angeles"\n}\n
{\n  "phones": [\n    "323-423-2238",\n    "747-444-3222",\n    "475-202-7142"\n  ]\n  ,\n  "city": "New York"\n}\n
{\n  "phones": [\n    "731-414-7236",\n    "256-233-3557",\n    "409-645-3618"\n  ]\n  ,\n  "city": "New York"\n}\n
{\n  "phones": [\n    "425-898-3481",\n    "360-637-7972",\n    "205-533-8271"\n  ]\n  ,\n  "city": "Princeton"\n}\n
{\n  "phones": [\n    "470-293-7517"\n  ]\n  ,\n  "city": "San Diego"\n}\n

Changing the name of the YAML column

The name of the YAML column is by default yaml but we can change it by setting the columnName json attribute

After running the below head command, we can see that the column name has changed to data

tabul data head --attribute columnName=data yaml/phonebook/Don.yaml@howto

The first 10 rows of the data resource (yaml/phonebook/Don.yaml@howto):
data
--------------------------------------------------------------------------
{\n  "phones": [\n    "470-293-7517"\n  ]\n  ,\n  "city": "San Diego"\n}\n

Adding a column with the logical name of the JSON file

Because the YAML content of files does not have any kind of identifier, we can see that the data is actually worthless because we can't relate the data to a user.

To be able to add context to file loaded, Tabulify supports the concept of virtual columns where it's possible to add any attribute information about the file loaded.

The virtual columns are specified with the –virtual-column option that has the following syntax:

--virtual-column virtualColumnName=resourceAttributeName

where:

virtualColumnName is the name of the column added
resourceAttributeName is the name of resource attribute.

For instance, if we want to add:

the logical name attribute
into a column called name,

we would add –virtual-column name=logical_name

With the print command, the full command would look like below.

tabul data print \
    --attribute columnName=data \
    --virtual-column name=logical_name \
    yaml/phonebook/Bob.yaml@howto

yaml/phonebook/Bob.yaml@howto
data                                                                                                                  name
-------------------------------------------------------------------------------------------------------------------   ----
{\n  "phones": [\n    "323-423-2238",\n    "747-444-3222",\n    "475-202-7142"\n  ]\n  ,\n  "city": "New York"\n}\n   Bob

Reloading the YAML files with the virtual column name

To reload the YAML files:

with the extra column name and the file logical name
with a column data containing the YAML document

the syntax of the tabul data replace would be:

tabul data replace \
   --source-attribute columnName=data \
   --virtual-column name=logical_name \
   yaml/phonebook/*@howto  user@sqlite

where:

replace will drop the target user@sqlite
--source-attribute columnName=data will set the JSON attribute columnName to data
--virtual-column name=logical_name will add the virtual column names name with the logical name attribute of the source
yaml/phonebook/*@howto is the source data selector
user@sqlite is the target uri

Output:

Transfer results
input                              target        latency   record_count   error_code   error_message
--------------------------------   -----------   -------   ------------   ----------   -------------
yaml/phonebook/Arnold.yaml@howto   user@sqlite   0.64s                1
yaml/phonebook/Bob.yaml@howto      user@sqlite   0.10s                1
yaml/phonebook/Dale.yaml@howto     user@sqlite   0.9s                 1
yaml/phonebook/Daniel.yaml@howto   user@sqlite   0.7s                 1
yaml/phonebook/Don.yaml@howto      user@sqlite   0.8s                 1

The YAML data loaded has now the name column as identifier.

tabul data print user@sqlite

user@sqlite
data                                                                                                                   name
--------------------------------------------------------------------------------------------------------------------   ------
{\n  "phones": [\n    "704-555-0150",\n    "202-555-0150"\n  ]\n  ,\n  "city": "Los Angeles"\n}\n                      Arnold
{\n  "phones": [\n    "323-423-2238",\n    "747-444-3222",\n    "475-202-7142"\n  ]\n  ,\n  "city": "New York"\n}\n    Bob
{\n  "phones": [\n    "731-414-7236",\n    "256-233-3557",\n    "409-645-3618"\n  ]\n  ,\n  "city": "New York"\n}\n    Dale
{\n  "phones": [\n    "425-898-3481",\n    "360-637-7972",\n    "205-533-8271"\n  ]\n  ,\n  "city": "Princeton"\n}\n   Daniel
{\n  "phones": [\n    "470-293-7517"\n  ]\n  ,\n  "city": "San Diego"\n}\n                                             Don

Performing JSON SQL statement

Inspired from the SQLite examples of the JSON documentation - section 4.3.1, we will now run the below SQL with JSON function.

-- https://www.sqlite.org/json1.html#examples_using_json_each_and_json_tree_
SELECT distinct name
FROM user, json_tree(user.data, '$.phones')
WHERE value LIKE '704-%';

To see the result of this SQL query, we will use the print command and see that Arnold is the user that have a phone number that starts with 704.

tabul data print '(sqlite/user_phone_json_from_yaml.sql@howto)@sqlite'
# The quotes are only mandatory in bash because parenthesis are a bash token (ie subshell)

(sqlite/user_phone_json_from_yaml.sql@howto)@sqlite
name
------
Arnold

If you want to explore the JSON SQL function of SQLite, you can read the Sqlite JSON documentation.

¹⁾

Yaml Block Style

About

Understanding YAML file

Yaml relational structure

The Yaml files

Loading the YAML files

Changing the name of the YAML column

Adding a column with the logical name of the JSON file

Reloading the YAML files with the virtual column name

Performing JSON SQL statement

Next