NOTE: This article is technical and assumes you already have experience with the SQL query language.
Introduction
If you connect Chain.io directly to a supported database system (like Snowflake), Chain.io uses our Canonical Data Schema to write your data.
The Canonical Data Schema is a common structure that we've developed which represents the major common data structures across supply chain and logistics business processes. Each of Chain.io's input plugins reads data from an external system and converts it into this common structure. This allows you to integrate multiple different systems into the same database without worrying about different field names and data structures.
For example, you might have one freight forwarder who sends you an XML file with a field called <PortOfArrival>USJFK</PortOfArrival>
and another that sends you a JSON file with {"arrival_port": "USJFK"}
. Chain.io will write both of these values to the shipment table in the shipment.arrival_unlocode
field.
Database Structure
Because you may want to look at supply chain data from many different perspectives, the Chain.io Canonical Data Schema uses a Graph data structure. In a Graph structure, data objects (also called "nodes") aren't linked directly to each other and instead are linked via "edges" which are typed. In SQL terms think of edges as many-to-many records where the relationship type is stored in the many-to-many table.
Let's look at a simple example. Let's say I'm a freight forwarder and I have a real world shipment where I bill my customer $1,000 for freight and in turn pay the carrier $900 for that freight.
I want to look at a shipment and find all of the invoices that I've sent to my customer related to that shipment. In Chain.io, the shipment nodes are all stored in the shipment table. All invoices for freight related services are stored in a table called service_invoice.
The service_invoice table is going to contain both the $1,000 invoice to my customer and the $900 invoice from the carrier. I can find the right invoice by linking in the right edge type using the flow_links table like this.
SELECT shipment.forwarder_reference, custInvoice.invoice_number
FROM shipment
LEFT JOIN flow_links custInvoiceLink ON custInvoiceLink.parent_uuid = shipment.uuid AND custInvoiceLink.link_type = 'shipment_to_customer_invoice' AND shipment.execution_uuid = custInvoiceLink.execution_uuid
LEFT JOIN service_invoice custInvoice ON custInvoiceLink.child_uuid = custInvoice.uuid AND custInvoice.execution_uuid = custInvoiceLink.exeuction_uuid
WHERE shipment.uuid = '85c7190f-4b82-4548-af28-8ad277fb11f7'
Here's what's going on in the joins of this query:
- Start with the shipment table since that's the first node in our graph
- Join the flow_links table where
- The shipment.uuid equals the flow_links.parent_uuid. Each link always has a parent_uuid and a child_uuid. See the canonical documentation to know which node is the parent and which is the child.
- The link_type = 'shipment_to_customer_invoice'. The link_type type tells us the logical reason that the two nodes are connected together. Always include the link_type in your query. It is common for two nodes to be connected together with multiple link types. For example, a single party might be linked to a shipment as both the shipment_to_buyer and shipment_to_consignee if they are both paying for and receiving the product.
- The shipment.execution_uuid equals the flow_links.execution_uuid. Chain.io populates your database each time data is processed with a separate set of records. This is done so you can monitor and build analytics on how your shipments & orders change over time. Every record as an exeuction_uuid field that you can join together to make sure you're not accidentally linking data from different transactions.
- Join the service_invoice table to the flow_links table using the child_uuid and the execution_uuid
Now let's add one more node. Suppose we want to also see the individual lines on the invoice. We just add one more edge and one more node. We'll use the service_invoice_to_service_invoice_line link type and the service_invoice_line node:
SELECT shipment.forwarder_reference, custInvoice.invoice_number, invLine.total_cost
FROM shipment
LEFT JOIN flow_links custInvoiceLink ON custInvoiceLink.parent_uuid = shipment.uuid AND custInvoiceLink.link_type = 'shipment_to_customer_invoice' AND shipment.execution_uuid = custInvoiceLink.execution_uuid
LEFT JOIN service_invoice custInvoice ON custInvoiceLink.child_uuid = custInvoice.uuid AND custInvoice.execution_uuid = custInvoiceLink.execution_uuid
LEFT JOIN flow_links invLineLink ON invLineLink.parent_uuid = custInvoice.uuid and invLineLink.link_type = 'service_invoice_to_service_invoice_line' AND custInvoice.execution_uuid = invLineLink.exeuction_uuid
LEFT JOIN service_invoice_line invLine ON invLineLink.child_uuid = invLine.uuid and invLine.execution_uuid = invLineLink.execution_uuid
WHERE shipment.uuid = '85c7190f-4b82-4548-af28-8ad277fb11f7'
Database Layout
The database layout is fully documented within your Chain.io account. Go to Workspace Settings and in the Developer Center section, click Canonical Schema.
Each node table is listed on the left. Click on them to see both the fields and link_types that can be used with that type.
Node Metadata Fields
In addition to the data fields listed in the documentation, each node table contains the following additional columns which are useful in your JOINs and WHERE clauses.
- uuid - This is the primary key for each record.
- execution_uuid - The flow execution_uuid visible on the flow execution screen. Use this to isolate your joins to only connect nodes from the same data transaction
- flow_started - This is a timestamp of the time that the flow started processing in Chain.io. Use this to order your executions properly. All records with the same execution_uuid will have the same flow_started time.
Additional Tables
flow_executions
This table contains metadata about each flow execution and can be used to link data back to it's source in Chain.io using either the Chain.io UI or our GraphQL API.
- flow_uuid - The UUID of the flow that populated the table. This is the same for each execution within that flow and is helpful if you only want to look at data that ran through a single process
- execution_uuid - The flow execution_uuid visible on the flow execution screen. Use this to isolate your joins to only connect nodes from the same data transaction
- workspace_uuid - This is the UUID of your Chain.io workspace. Use this if you are populating the same database from multiple Chain.io accounts and want to isolate on that dimension
- flow_started - This is a timestamp of the time that the flow started processing in Chain.io. Use this to order your exeuctions properly.
flow_links
This table represents the edges in the graph structure. Refer to the Canonical Schema documentation in the Chain.io portal to see what type of table is represented by the parent_uuid & child_uuid fields.
- execution_uuid - The flow execution_uuid visible on the flow execution screen. Use this to isolate your joins to only connect nodes from the same data transaction
- parent_uuid - The uuid of the parent node represented by the link.
- child_uuid - The uuid of the child node represented by the link
- link_type - The logical type of link that is represented by the record.
chainio_metadata
This table contains metadata used by the Chain.io backend processes to keep your database in sync.