---json { "name": "10 Data Diff", "page_id": "repj70wjne9kncbcqm21l" } --- ====== Learning Tabulify - Step 10 - Data Diff, a step by step ====== ===== About ===== ''Data Diff'' is the cornerstone of every development because it validates the processing of data. ''Tabulify'' ships with a [[docs:op:diff|Data diff operation]] that allows you to compare: * the [[docs:resource:content|data content]] * and the [[docs:resource:structure|data structure]] The [[docs:op:diff|diff operation (comparison)]] is implemented with the [[docs:tabul:data:diff|data diff command]]. ===== Data to compare ===== For the sake of simplicity, we will compare two [[docs:resource:csv|csv data resources]] but you may compare any [[docs:resource:content|content data resources]] such as a [[docs:resource:sql_table|SQL table]] or [[docs:resource:sql_select|SQL query]] * The ''source'' data resource (also known as the ''original'', ''from'' or ''left'' resource) last_name,first_name,birth_date,wikipedia_page Schwarzenegger,Arnold,1947-07-30,https://en.wikipedia.org/wiki/Arnold_Schwarzenegger Norman,Don,1935-12-25,https://en.wikipedia.org/wiki/Don_Norman Harbison Carnagey,Dale,1888-11-24,https://en.wikipedia.org/wiki/Dale_Carnegie Kahneman,Daniel,1934-03-05,https://en.wikipedia.org/wiki/Daniel_Kahneman Pittman,Bob,1953-12-28,https://en.wikipedia.org/wiki/Robert_Pittman_(media_executive) * The ''target'' data resource (also known as the ''new'', ''to'' or ''right'' resource) Last Name,First Name,Birth Date,Wikipedia Page Schwarzenegger,Arnold,1947-07-30,https://en.wikipedia.org/wiki/Arnold_Schwarzenegger Norman,Don,1935-12-23,https://en.wikipedia.org/wiki/Don_Norman Kahneman,Daniel,1934-03-05,https://en.wikipedia.org/wiki/Daniel_Kahneman The fact that a record is missing can be easily spotted but there is a more subtle differences. ===== Failure Behavior ===== By default, a diff that detects differences will fail (ie return a bad exit code). * The `--no-fail` option changes this behavior. * You don't need to set it normally but as we monitor our documentation for failure, we don't want to fail ===== The Summary report ===== All diff can be executed with the [[docs:tabul:data:diff|data diff command]] by giving as argument the [[docs:flow:source|source]] and [[docs:flow:target|target]] data resource. tabul data diff \ --no-fail \ diff/characters_original.csv@howto diff/characters_new.csv@howto Diff Summary Report The data resources are not equals. from to equals record_count change_count ---------------------------------- ----------------------------- ------ ------------ ------------ diff/characters_original.csv@howto diff/characters_new.csv@howto 5 6 The default report is to show a high level report that lists the ''source'' and ''target'' resources and the result of the comparison: * ''equals'': Does the resources are equals * ''record_count'': How many records were read from the ''from'' resource. * ''change_count'': How many records changes were detected There is a total of: * ''5'' records compared * ''6'' records changes were detected. It doesn't seem quite right, let's take a look at it by asking a unified report to see the difference in details. ===== The Unified report ===== To get a diff at the [[:docs:resource:record|record]] level, you set the ''report-type'' option to the ''unified'' value. tabul data diff \ --no-fail \ --report-type unified \ diff/characters_original.csv@howto diff/characters_new.csv@howto Unified Diff Report between the resources (diff/characters_original.csv@howto) and (diff/characters_new.csv@howto). - red for deleted - green for added - blue for type change (example: number as varchar) The data resources are not equals. diff_status last_name first_name birth_date wikipedia_page ----------- ----------------- ---------- ---------- -------------------------------------------------------------- = Schwarzenegger Arnold 1947-07-30 https://en.wikipedia.org/wiki/Arnold_Schwarzenegger + Norman Don 1935-12-23 https://en.wikipedia.org/wiki/Don_Norman + Kahneman Daniel 1934-03-05 https://en.wikipedia.org/wiki/Daniel_Kahneman - Norman Don 1935-12-25 https://en.wikipedia.org/wiki/Don_Norman - Harbison Carnagey Dale 1888-11-24 https://en.wikipedia.org/wiki/Dale_Carnegie - Kahneman Daniel 1934-03-05 https://en.wikipedia.org/wiki/Daniel_Kahneman - Pittman Bob 1953-12-28 https://en.wikipedia.org/wiki/Robert_Pittman_(media_executive) This report will shows you how to get from: * the ''original'', ''from'', ''left'', ''source'' resource * to the ''new'', ''to'', ''right'', ''target'' resource It tells us that we need to: * add ''2'' rows * and delete ''4'' rows from the ''original'' resource to obtain the ''new'' resource It's a total of ''6'' changes as seen previously in the ''summary report''. But why does the diff did not compare the ''Norman'' record to each other ? This is because their is no known identifier for the record, therefore the diff was executed by ''record hash'' and not by ''last name''. The next step will show you how to define the unique columns to drive a diff comparison. ===== The Unified Report with Driver Columns Definition ===== To improve our comparison, we will define the unique column of our data resource with the ''%%--driver-columns%%'' option. tabul data diff \ --no-fail \ --report-type unified \ --driver-columns "Last Name" \ diff/characters_original.csv@howto diff/characters_new.csv@howto Unified Diff Report between the resources (diff/characters_original.csv@howto) and (diff/characters_new.csv@howto). - red for deleted - green for added - blue for type change (example: number as varchar) The data resources are not equals. diff_status last_name first_name birth_date wikipedia_page ----------- ----------------- ---------- ---------- -------------------------------------------------------------- = Schwarzenegger Arnold 1947-07-30 https://en.wikipedia.org/wiki/Arnold_Schwarzenegger +1 Norman Don 1935-12-23 https://en.wikipedia.org/wiki/Don_Norman -1 Norman Don 1935-12-25 https://en.wikipedia.org/wiki/Don_Norman - Harbison Carnagey Dale 1888-11-24 https://en.wikipedia.org/wiki/Dale_Carnegie = Kahneman Daniel 1934-03-05 https://en.wikipedia.org/wiki/Daniel_Kahneman - Pittman Bob 1953-12-28 https://en.wikipedia.org/wiki/Robert_Pittman_(media_executive) This time, the unified report tell us that we need to: * update the ''birth date'' of the ''Norman'' record (''+1,-1'', in the ''1'' change) * delete the records * ''Harbison Carnagey'' * and ''Pittman'' from the ''original'' resource to obtain the ''new'' resource ===== Conclusion ===== This page ends the learning guide of ''Tabulify'' at the command line with ''Tabul'' where we have learned to perform data operation one command at a time. ''Tabulify'' supports also the creation of workflows with the ''pipeline'' scripting language where you bind all this operations in pipelines. To learn more about the ''pipeline'' scripting language, we have create the ''pipeline learning guide'' where you will learn how to create a data warehouse and analyse weblog event of a website. Flow Learning Guide - How to analyze weblog ?