Design a Schema for Distributed SQL

Tutorial

Analyze access patterns, build colocation chains, choose distribution zones, and verify your schema design with system views and EXPLAIN plans.

ignite3gridgain9

Intermediate|75 min|data-modeling

Tested onApache Ignite 3.1.0GridGain 9.1.8

Introduction

In a single-node database, schema design is about data integrity. You normalize to reduce redundancy, add indexes to speed queries, and move on. Where data physically lives never matters because it all lives on one machine.

A distributed database changes that equation. Every table is split into partitions spread across multiple nodes, and a JOIN is only local if the joined rows happen to be on the same node. When they are not, data moves across the network. If that sounds like sharding, it is. Both Apache Ignite 3 and GridGain 9 hash-shard every table, and the design choices you make in DDL control which queries stay local and which pay the network cost. This tutorial walks through that decision process using the Music Store schema.

Product compatibility

Apache Ignite 3 and GridGain 9 share the same distributed SQL engine. Everything in this tutorial (DDL syntax, system views, query plans) works identically on both products. Select your product version in the tabs where the CLI command differs.

Prerequisites

A running 3-node cluster with the Music Store dataset from Start Your Local Ignite 3 Development Cluster
Completed the Foundations path: Create Tables and Query Data with SQL and Use Transactions in Ignite 3

Returning to these tutorials? Verify your environment.

Check that the cluster is running and the Music Store data is loaded:

Apache Ignite 3
GridGain 9

docker exec ignite3-node1 /opt/ignite3cli/bin/ignite3 sql \
  "SELECT COUNT(*) AS tracks FROM Track;"

docker exec gridgain9-node1 /opt/gridgain9cli/bin/gridgain9 sql \
  "SELECT COUNT(*) AS tracks FROM Track;"

Expected result: 3503. If the query succeeds, your environment is ready.

If the containers are stopped, restart them from the directory containing your docker-compose.yml:

docker compose up -d

Data persists across restarts. Wait 15-30 seconds for the nodes to rejoin, then re-run the check above.

If the cluster was destroyed (docker compose down), start the containers and re-initialize:

Apache Ignite 3
GridGain 9

docker compose up -d

Wait 10 seconds for the nodes to start, then initialize the cluster:

curl -X POST http://localhost:10300/management/v1/cluster/init \
  -H "Content-Type: application/json" \
  -d '{"metaStorageNodes":["node1","node2","node3"],"cmgNodes":[],"clusterName":"my-cluster"}'

Download the schema and data files:

curl -sO /assets/dataset/music-store-schema.sql
curl -sO /assets/dataset/music-store-data.sql

Copy the files into the container and load them:

docker cp music-store-schema.sql ignite3-node1:/tmp/
docker cp music-store-data.sql ignite3-node1:/tmp/
docker exec ignite3-node1 /opt/ignite3cli/bin/ignite3 sql --file /tmp/music-store-schema.sql
docker exec ignite3-node1 /opt/ignite3cli/bin/ignite3 sql --file /tmp/music-store-data.sql

The schema loader prints "Updated 0 rows" for each DDL statement. The data loader prints row counts per batch. Both commands emit jline reflection warnings that are cosmetic and safe to ignore.

docker compose up -d

Wait 10 seconds for the nodes to start, then load the license and initialize the cluster:

LICENSE=$(jq -Rs . gridgain-license.json)

curl -X POST http://localhost:10300/management/v1/cluster/init \
  -H "Content-Type: application/json" \
  -d '{"metaStorageNodes":["node1","node2","node3"],"cmgNodes":[],"clusterName":"my-cluster","license":'"$LICENSE"'}'

Download the schema and data files:

curl -sO /assets/dataset/music-store-schema.sql
curl -sO /assets/dataset/music-store-data.sql

Copy the files into the container and load them:

docker cp music-store-schema.sql gridgain9-node1:/tmp/
docker cp music-store-data.sql gridgain9-node1:/tmp/
docker exec gridgain9-node1 /opt/gridgain9cli/bin/gridgain9 sql --file /tmp/music-store-schema.sql
docker exec gridgain9-node1 /opt/gridgain9cli/bin/gridgain9 sql --file /tmp/music-store-data.sql

The schema loader prints "Updated 0 rows" for each DDL statement. The data loader prints row counts per batch. Both commands emit jline reflection warnings that are cosmetic and safe to ignore.

Re-run the check above to verify 3503 tracks are loaded.

What You Will Learn

You will work through a repeatable five-step process for designing a distributed SQL schema, apply it to the Music Store domain, then apply it independently to a new domain fragment. Along the way you will learn:

How sharding and colocation determine query performance in a distributed cluster
How to analyze a domain model and build colocation chains from access patterns
How to choose between partitioned and replicated distribution zones
How to express these decisions in SQL DDL with CREATE ZONE, COLOCATE BY, and composite primary keys
How to verify your design using system views and EXPLAIN plans

The Distributed SQL Shift

In a single-node RDBMS, the primary key exists for one purpose: uniqueness. Where a row physically lives on disk is an implementation detail you never think about.

In a sharded database, the primary key takes on a second role: it controls which node stores the row. The assignment happens in three steps:

The engine hashes a designated column (the shard key) on the row being written.
The hash maps the row to one of a fixed number of partitions.
Each partition lives on a specific node, so the hash transitively picks the node.

Primary key design is no longer just about uniqueness. It directly determines where data lives in the cluster, and that changes the rules for every operation:

Concept	Single-Node RDBMS	Distributed SQL (Ignite/GridGain)
Primary key	Uniqueness constraint	Uniqueness + partition routing
JOIN	Always local (single disk)	Local only if data is colocated
Foreign key	Enforced referential integrity	Not enforced; colocation replaces the FK role
Table placement	All on one disk	Assigned to a distribution zone
Replication	Backup for high availability	Active replicas; more replicas improve reads but slow writes

The critical row is JOIN. A join is local only when the joined rows land on the same node. When they do not, the database shuffles data across the network. Your RDBMS schema will not break if you move it to a distributed cluster. Every query still returns correct results. But without colocation, every join pays a network cost, and that cost grows linearly with data volume.

Both products give you two separate mechanisms to control this. Distribution zones control where data lives: how many partitions, how many replicas, which storage engine. COLOCATE BY controls which data lives together: which rows from different tables share the same partition. Most distributed databases conflate these into a single mechanism. Keeping them separate is what makes the design process clear.

New to sharding? Here is the short version.

A sharded database splits each table into pieces (called shards or partitions) and distributes them across multiple nodes. Each row is assigned to a shard by hashing a designated column (the shard key). Queries that include the shard key route to one node. Queries that omit it must check every node. Joins between two tables are local only if both tables are sharded on the same key and the joined rows land on the same node.

Coming from another sharded database? Here is the terminology mapping.

Sharding concept	Ignite/GridGain term	DDL keyword
Shard	Partition	`PARTITIONS` in `CREATE ZONE`
Shard key	Colocation column	`COLOCATE BY` in `CREATE TABLE`
Sharding configuration	Distribution zone	`CREATE ZONE`
Co-sharding (related tables on the same shard)	Colocation	`COLOCATE BY` with a shared column

Analyze the Domain

Now that you understand what colocation does and why it matters, the next question is how to decide what gets colocated with what. Schema design for us starts with queries, not tables. The process for translating access patterns into a distributed schema has five steps:

Identify access patterns to find which tables are queried together
Find root entities that anchor each relationship chain
Build colocation chains from root to leaf using shared columns
Accept cross-chain tradeoffs where chains meet
Identify reference data that benefits from full replication

The rest of this step walks through each one with the Music Store. Here is the data model you have been querying since the first tutorial:

The PK and FK markers show primary keys and foreign key references. In a single-node database, these relationships are enforced by the engine and every join is local. The five-step process below determines how to preserve join locality when these tables are distributed across a cluster.

Identifying access patterns

The question is not "what entities do I have?" but "how does my application access them?" The Music Store has two hot paths:

Catalog browsing: "Show me an artist and all their albums and tracks." This traverses Artist to Album to Track.

Purchase history: "Show me a customer's invoices and line items." This traverses Customer to Invoice to InvoiceLine.

Each of these paths defines a chain of tables that are frequently joined together. If every table in a chain lands on the same partition, those joins execute locally without touching the network. That is the goal of colocation.

Finding the root entities

Every chain has a root entity that anchors the hierarchy. Artist is the root of the catalog chain, and Customer is the root of the commerce chain. Every other table relates back to one of these roots through parent-child relationships.

Building colocation chains

Terminology

A colocation chain is an Ignite/GridGain term for a hierarchy of parent-child tables that share a shard key lineage so they land on the same partition. Other distributed databases call this interleaved tables, colocation groups, or simply parent-child sharding. The concept is the same: related rows from different tables are placed together on one node so joins between them stay local.

A colocation chain links each child table to its parent through a shared column:

Catalog chain: Album colocates by ArtistId, Track colocates by AlbumId. All albums for the same artist are stored on the same partition. All tracks for the same album are on the same partition as the album.
Commerce chain: Invoice colocates by CustomerId, InvoiceLine colocates by InvoiceId. All invoices for the same customer share a partition. All line items for the same invoice share a partition with the invoice.

Solid arrows are colocation relationships: data on the same partition, joins are local. Dashed arrows are cross-chain references: data on different partitions, joins move data across the network. The italicized column in each composite PK is the colocation column. Reference tables (pink) are replicated to every node, so their dashed arrows never cause data movement.

Key constraint

The colocation column (shard key) must be part of the primary key. This is a hard constraint, and it is the same constraint other sharded databases enforce. The next step explores what this means for your table definitions.

Accepting cross-chain tradeoffs

A table can only colocate by one column.

Track colocates by AlbumId (catalog chain) and InvoiceLine colocates by InvoiceId (commerce chain), so a query like "what genre does customer #1 buy most?" has to join across both chains. Because Track and InvoiceLine belong to different colocation hierarchies, they live on different partitions, and the database must move data between nodes to complete the join.

That is the fundamental tradeoff. You cannot colocate everything, so you pick the hot paths and optimize those. The Music Store optimizes for catalog browsing and purchase history. Cross-chain analytics still work, just at higher latency.

Identifying reference data

Some tables are small, rarely updated, and joined by many queries. The Music Store has two: Genre (25 rows) and MediaType (5 rows). These are lookup tables. Rather than colocating them with one chain (which would not help queries from the other chain), replicate them so every node has a local copy. Any join with Genre or MediaType is always local, regardless of which chain initiated the query.

Verifying against the cluster

Query the system views to see how the Music Store schema implements these decisions:

Apache Ignite 3
GridGain 9

docker exec ignite3-node1 /opt/ignite3cli/bin/ignite3 sql \
  "SELECT TABLE_NAME, ZONE_NAME, TABLE_COLOCATION_COLUMNS FROM SYSTEM.TABLES WHERE SCHEMA_NAME = 'PUBLIC' ORDER BY TABLE_NAME;"

docker exec gridgain9-node1 /opt/gridgain9cli/bin/gridgain9 sql \
  "SELECT TABLE_NAME, ZONE_NAME, TABLE_COLOCATION_COLUMNS FROM SYSTEM.TABLES WHERE SCHEMA_NAME = 'PUBLIC' ORDER BY TABLE_NAME;"

TABLE_NAME    | ZONE_NAME            | TABLE_COLOCATION_COLUMNS
--------------+----------------------+-------------------------
ALBUM         | MUSICSTORE           | ARTISTID
ARTIST        | MUSICSTORE           | ARTISTID
CUSTOMER      | MUSICSTORE           | CUSTOMERID
EMPLOYEE      | MUSICSTORE           | EMPLOYEEID
GENRE         | MUSICSTOREREPLICATED | GENREID
INVOICE       | MUSICSTORE           | CUSTOMERID
INVOICELINE   | MUSICSTORE           | INVOICEID
MEDIATYPE     | MUSICSTOREREPLICATED | MEDIATYPEID
PLAYLIST      | MUSICSTORE           | PLAYLISTID
PLAYLISTTRACK | MUSICSTORE           | PLAYLISTID
TRACK         | MUSICSTORE           | ALBUMID

ZONE_NAME shows which zone owns each table, and TABLE_COLOCATION_COLUMNS shows the column that determines partition assignment. Genre and MediaType are in MUSICSTOREREPLICATED (full replication), while everything else is in MUSICSTORE (partitioned). The colocation columns match the chains: Album uses ARTISTID, Invoice uses CUSTOMERID, Track uses ALBUMID.

Checkpoint:The SYSTEM.TABLES query returns 11 rows. Genre and MediaType show MUSICSTOREREPLICATED as their zone. All other tables show MUSICSTORE. The colocation columns match the access pattern chains described above.

Composite Primary Keys

If you have worked with other sharded databases, you have likely seen the same constraint that applies here:

The colocation column (shard key) must be part of the primary key.
Album's primary key is (AlbumId, ArtistId) rather than just (AlbumId) because ArtistId is the colocation column.
Without ArtistId in the PK, a primary key lookup cannot route to a single partition and would need to scan all of them.

This changes three tables in the Music Store:

Album PK: (AlbumId, ArtistId) instead of (AlbumId). ArtistId is the colocation column.
Track PK: (TrackId, AlbumId) instead of (TrackId). AlbumId is the colocation column.
Invoice PK: (InvoiceId, CustomerId) instead of (InvoiceId). CustomerId is the colocation column.

Query the system views to see the PK and colocation ordinals:

Apache Ignite 3
GridGain 9

docker exec ignite3-node1 /opt/ignite3cli/bin/ignite3 sql \
  "SELECT TABLE_NAME, COLUMN_NAME, PK_COLUMN_ORDINAL, COLOCATION_COLUMN_ORDINAL FROM SYSTEM.TABLE_COLUMNS WHERE TABLE_NAME IN ('ARTIST','ALBUM','TRACK','INVOICE') AND PK_COLUMN_ORDINAL IS NOT NULL ORDER BY TABLE_NAME, PK_COLUMN_ORDINAL;"

docker exec gridgain9-node1 /opt/gridgain9cli/bin/gridgain9 sql \
  "SELECT TABLE_NAME, COLUMN_NAME, PK_COLUMN_ORDINAL, COLOCATION_COLUMN_ORDINAL FROM SYSTEM.TABLE_COLUMNS WHERE TABLE_NAME IN ('ARTIST','ALBUM','TRACK','INVOICE') AND PK_COLUMN_ORDINAL IS NOT NULL ORDER BY TABLE_NAME, PK_COLUMN_ORDINAL;"

TABLE_NAME | COLUMN_NAME | PK_COLUMN_ORDINAL | COLOCATION_COLUMN_ORDINAL
-----------+-------------+-------------------+--------------------------
ALBUM      | ALBUMID     | 0                 | null
ALBUM      | ARTISTID    | 1                 | 0
ARTIST     | ARTISTID    | 0                 | 0
INVOICE    | INVOICEID   | 0                 | null
INVOICE    | CUSTOMERID  | 1                 | 0
TRACK      | TRACKID     | 0                 | null
TRACK      | ALBUMID     | 1                 | 0

The PK_COLUMN_ORDINAL and COLOCATION_COLUMN_ORDINAL columns are zero-indexed positions. A value of 0 means "first," 1 means "second," and null means the column does not participate in that role.

Look at the Album rows. ALBUMID has PK ordinal 0 (it is the first column in the primary key) but its colocation ordinal is null, meaning it plays no role in partition routing. ARTISTID has PK ordinal 1 (second column in the PK) and colocation ordinal 0 (the first, and in this case only, colocation column). The takeaway: it is the colocation column, not the leading PK column, that determines which partition stores the row.

How composite keys affect query routing

Because the colocation column controls partition routing, different WHERE clauses produce fundamentally different query plans. Run all three variations to see this in practice:

Apache Ignite 3
GridGain 9

docker exec ignite3-node1 /opt/ignite3cli/bin/ignite3 sql \
  "EXPLAIN PLAN FOR SELECT * FROM Album WHERE AlbumId = 5 AND ArtistId = 1;"

docker exec gridgain9-node1 /opt/gridgain9cli/bin/gridgain9 sql \
  "EXPLAIN PLAN FOR SELECT * FROM Album WHERE AlbumId = 5 AND ArtistId = 1;"

KeyValueGet
    table: PUBLIC.ALBUM
    fieldNames: [ALBUMID, ARTISTID, TITLE]
    key: [5, 1]
    est: (rows=1)

A KeyValueGet: no partition scan, no table scan, just a direct read from the one partition that owns the row.

Now query by the colocation column alone:

Apache Ignite 3
GridGain 9

docker exec ignite3-node1 /opt/ignite3cli/bin/ignite3 sql \
  "EXPLAIN PLAN FOR SELECT * FROM Album WHERE ArtistId = 1;"

Exchange
    distribution: single
    est: (rows=116)
  IndexScan
      table: PUBLIC.ALBUM
      index: IFK_ALBUMARTISTID
      type: SORTED
      predicate: =(ARTISTID, 1)
      fieldNames: [ALBUMID, ARTISTID, TITLE]
      collation: [ARTISTID ASC]
      est: (rows=116)

docker exec gridgain9-node1 /opt/gridgain9cli/bin/gridgain9 sql \
  "EXPLAIN PLAN FOR SELECT * FROM Album WHERE ArtistId = 1;"

Exchange
    distribution: single
    est: (rows=1)
  TableScan
      table: PUBLIC.ALBUM
      predicate: =(ARTISTID, 1)
      fieldNames: [ALBUMID, ARTISTID, TITLE]
      est: (rows=1)

Colocation column only. The query routes to the single partition that stores ArtistId = 1. Apache Ignite 3 uses an IndexScan on the IFK_ALBUMARTISTID secondary index to find matching rows within that partition, while GridGain 9 uses a TableScan. The planner makes different cost model choices between versions, but the routing behavior is the same: one partition, not all of them.

Now query by a non-colocation PK column:

Apache Ignite 3
GridGain 9

docker exec ignite3-node1 /opt/ignite3cli/bin/ignite3 sql \
  "EXPLAIN PLAN FOR SELECT * FROM Album WHERE AlbumId = 5;"

docker exec gridgain9-node1 /opt/gridgain9cli/bin/gridgain9 sql \
  "EXPLAIN PLAN FOR SELECT * FROM Album WHERE AlbumId = 5;"

Exchange
    distribution: single
    est: (rows=1)
  TableScan
      table: PUBLIC.ALBUM
      predicate: =(ALBUMID, 5)
      fieldNames: [ALBUMID, ARTISTID, TITLE]
      est: (rows=1)

Both products produce a TableScan here. The planner cannot route to a single partition because AlbumId alone does not identify one, so the scan fans out to every partition. There is no secondary index on AlbumId to help.

The contrast between the second and third queries is the core lesson. Both filter by a single PK column, but the ArtistId query routes to one partition while the AlbumId query scans all of them. The difference is entirely determined by which column is the colocation column.

The cost of composite keys

Composite keys enable colocation, but they widen every reference. Track's PK becomes (TrackId, AlbumId) instead of (TrackId), and InvoiceLine's becomes (InvoiceLineId, InvoiceId) instead of (InvoiceLineId). Any table that references Album must carry both columns for a point lookup. The tradeoff is real, but it is not Ignite-specific. Every distributed SQL database pays this cost in some form. The benefit, colocated joins that execute locally without network hops, is what makes the wider keys worthwhile.

Checkpoint:All three EXPLAIN plans execute without errors. The full PK query shows KeyValueGet. The colocation-column query routes to a single partition (IndexScan on Ignite 3, TableScan on GridGain 9). The non-colocation query shows TableScan across all partitions on both products.

Distribution Zones

A distribution zone is a named configuration that bundles replication and partitioning settings. When you write ZONE MusicStore in a CREATE TABLE statement, you are binding that table to the zone's policies for how many partitions to create, how many replicas to maintain, and which storage engine to use.

The Music Store uses two zones:

Zone	Partitions	Replicas	Tables	Purpose
MusicStore	25	2	Artist, Album, Track, Customer, Invoice, InvoiceLine, Employee, Playlist, PlaylistTrack	Transactional data. Partitioned across nodes. 2 replicas balance write performance with fault tolerance.
MusicStoreReplicated	25	3	Genre, MediaType	Reference data. 3 replicas on a 3-node cluster means every node has a full copy. Joins with these tables never cross the network.

Query the zone configuration:

Apache Ignite 3
GridGain 9

docker exec ignite3-node1 /opt/ignite3cli/bin/ignite3 sql \
  "SELECT ZONE_NAME, ZONE_PARTITIONS, ZONE_REPLICAS, IS_DEFAULT_ZONE FROM SYSTEM.ZONES ORDER BY ZONE_NAME;"

docker exec gridgain9-node1 /opt/gridgain9cli/bin/gridgain9 sql \
  "SELECT ZONE_NAME, ZONE_PARTITIONS, ZONE_REPLICAS, IS_DEFAULT_ZONE FROM SYSTEM.ZONES ORDER BY ZONE_NAME;"

ZONE_NAME            | ZONE_PARTITIONS | ZONE_REPLICAS | IS_DEFAULT_ZONE
---------------------+-----------------+---------------+----------------
Default              | 25              | 1             | true
MUSICSTORE           | 25              | 2             | false
MUSICSTOREREPLICATED | 25              | 3             | false

The Default zone (1 replica, 25 partitions) is pre-created by the cluster. The Music Store schema specifies zones explicitly on every table, so the default zone is unused. If you omit the ZONE clause from CREATE TABLE, the table goes into this default zone automatically.

Now verify how partitions are distributed across nodes. This query aggregates partition state by node and zone:

Apache Ignite 3
GridGain 9

docker exec ignite3-node1 /opt/ignite3cli/bin/ignite3 sql \
  "SELECT NODE_NAME, ZONE_NAME, COUNT(*) AS PARTITIONS, SUM(ESTIMATED_ROWS) AS TOTAL_ROWS FROM SYSTEM.LOCAL_ZONE_PARTITION_STATES WHERE ZONE_NAME IN ('MUSICSTORE','MUSICSTOREREPLICATED') GROUP BY NODE_NAME, ZONE_NAME ORDER BY ZONE_NAME, NODE_NAME;"

docker exec gridgain9-node1 /opt/gridgain9cli/bin/gridgain9 sql \
  "SELECT NODE_NAME, ZONE_NAME, COUNT(*) AS PARTITIONS, SUM(ESTIMATED_ROWS) AS TOTAL_ROWS FROM SYSTEM.LOCAL_ZONE_PARTITION_STATES WHERE ZONE_NAME IN ('MUSICSTORE','MUSICSTOREREPLICATED') GROUP BY NODE_NAME, ZONE_NAME ORDER BY ZONE_NAME, NODE_NAME;"

NODE_NAME | ZONE_NAME            | PARTITIONS | TOTAL_ROWS
----------+----------------------+------------+-----------
node1     | MUSICSTORE           | 18         | 13099
node2     | MUSICSTORE           | 17         | 9755
node3     | MUSICSTORE           | 15         | 8300
node1     | MUSICSTOREREPLICATED | 25         | 30
node2     | MUSICSTOREREPLICATED | 25         | 30
node3     | MUSICSTOREREPLICATED | 25         | 30

MusicStore partitions are distributed across nodes (18/17/15), with a slight imbalance that is normal for rendezvous hashing with 25 partitions across 3 nodes. MusicStoreReplicated shows 25 partitions on every node: 3 replicas across 3 nodes means full replication, so every node has a complete copy of Genre and MediaType.

Your numbers will differ

Partition counts per node and row estimates vary by cluster. The pattern is consistent: MusicStore partitions are distributed, MusicStoreReplicated partitions are on every node.

The decision framework

Use this framework when choosing between partitioned and replicated zones:

Replicate when: The table is small (hundreds of rows), rarely updated, and frequently joined by tables in multiple colocation chains. Genre and MediaType fit this profile.
Partition when: The table is large or growing, frequently updated, and has parent-child relationships that benefit from colocation. Artist, Album, Track, Customer, Invoice, InvoiceLine all fit this profile.

One-way door

Tables are permanently bound to their zone. You cannot move a table to a different zone after creation. If you need different replication or partitioning for a table, you must drop and recreate it. Choose your zone assignment carefully.

DDL walkthrough

Walk through the DDL that creates the Music Store schema. Each clause maps to a design decision:

-- Zone: where data lives, how it's replicated
CREATE ZONE IF NOT EXISTS MusicStore
    WITH REPLICAS=2, PARTITIONS=25, STORAGE_PROFILES='default';

-- Root entity: Artist is the colocation anchor for the catalog chain
CREATE TABLE IF NOT EXISTS Artist (
    ArtistId INT NOT NULL,
    Name VARCHAR(120) NOT NULL,
    PRIMARY KEY (ArtistId)
) ZONE MusicStore;
-- ArtistId is both PK and colocation column (default when only one PK column)

-- Child entity: Album colocates by ArtistId
CREATE TABLE IF NOT EXISTS Album (
    AlbumId INT NOT NULL,
    ArtistId INT NOT NULL,
    Title VARCHAR(160) NOT NULL,
    PRIMARY KEY (AlbumId, ArtistId)
) COLOCATE BY (ArtistId) ZONE MusicStore;
-- ArtistId in the PK because it's the colocation column

-- Secondary index: supports joins and lookups by ArtistId within a partition
CREATE INDEX IF NOT EXISTS IFK_AlbumArtistId ON Album (ArtistId);

-- Grandchild: Track colocates by AlbumId
CREATE TABLE IF NOT EXISTS Track (
    TrackId INT NOT NULL,
    AlbumId INT,
    Name VARCHAR(200) NOT NULL,
    MediaTypeId INT NOT NULL,
    GenreId INT,
    Composer VARCHAR(220),
    Milliseconds INT NOT NULL,
    Bytes INT,
    UnitPrice DECIMAL(10,2) NOT NULL,
    PRIMARY KEY (TrackId, AlbumId)
) COLOCATE BY (AlbumId) ZONE MusicStore;

The reference data zone uses full replication:

-- Reference zone: full replication for small lookup tables
CREATE ZONE IF NOT EXISTS MusicStoreReplicated
    WITH REPLICAS=3, PARTITIONS=25, STORAGE_PROFILES='default';

CREATE TABLE IF NOT EXISTS Genre (
    GenreId INT NOT NULL,
    Name VARCHAR(120),
    PRIMARY KEY (GenreId)
) ZONE MusicStoreReplicated;

CREATE TABLE IF NOT EXISTS MediaType (
    MediaTypeId INT NOT NULL,
    Name VARCHAR(120),
    PRIMARY KEY (MediaTypeId)
) ZONE MusicStoreReplicated;

Read the DDL clause by clause: ZONE picks the placement policy, COLOCATE BY declares the routing key, the composite PK ensures the routing key is available for lookups, and REPLICAS controls fault tolerance. Each line is a design decision expressed as SQL.

Always specify COLOCATE BY

Without an explicit COLOCATE BY, the engine uses the entire primary key to choose which partition stores each row. For Album with PK (AlbumId, ArtistId), that means two albums by the same artist can end up on different partitions because their AlbumId values differ. The colocation chain breaks: Artist and Album are no longer guaranteed to share a partition, and the join between them may cross the network.

Adding COLOCATE BY (ArtistId) fixes this. It tells the engine to partition Album rows by ArtistId alone, ignoring AlbumId. Every album by Artist 1 lands on the same partition as the Artist 1 row, which is what makes the join local. Always specify COLOCATE BY on child tables.

Checkpoint:The SYSTEM.ZONES query shows MusicStore with 2 replicas and MusicStoreReplicated with 3 replicas. The partition distribution query shows MusicStoreReplicated on all 3 nodes (25/25/25).

Apply the Process

The Music Store has a third relationship chain: Playlist and PlaylistTrack. Apply the five-step process to this domain fragment.

Step 1: Identify the access pattern. "Show me all tracks in a playlist." This traverses Playlist to PlaylistTrack to Track.

Step 2: Find the root entity. Playlist is the root.

Step 3: Build the colocation chain. PlaylistTrack should colocate by PlaylistId so that all entries in a playlist are stored on the same partition as the playlist itself.

Step 4: Accept cross-chain tradeoffs. PlaylistTrack references Track, which belongs to the catalog chain (colocated by AlbumId). The join from PlaylistTrack to Track crosses chains. This is acceptable because the primary access pattern ("show me the tracks in this playlist") colocates the playlist with its membership list, and the track details can be fetched in a second query if needed.

Step 5: Reference data? No. Playlists are user-created and can grow to any size.

Now write the DDL. Before looking at the solution below, write the CREATE TABLE statements for Playlist and PlaylistTrack based on the analysis above. Consider:

What is the primary key for Playlist?
What is the primary key for PlaylistTrack? (Remember: the colocation column must be in the PK.)
Which zone should these tables use?
Does PlaylistTrack need a COLOCATE BY clause?

Solution: Playlist chain DDL

-- Root entity: Playlist is the chain anchor
-- Simple PK because PlaylistId is both the identity and the colocation column
CREATE TABLE IF NOT EXISTS Playlist (
    PlaylistId INT NOT NULL,
    Name VARCHAR(120),
    PRIMARY KEY (PlaylistId)
) ZONE MusicStore;

-- Child entity: PlaylistTrack colocates by PlaylistId
-- Composite PK (PlaylistId, TrackId) because the colocation column must be in the PK
-- COLOCATE BY (PlaylistId) ensures all entries for the same playlist share a partition
CREATE TABLE IF NOT EXISTS PlaylistTrack (
    PlaylistId INT NOT NULL,
    TrackId INT NOT NULL,
    PRIMARY KEY (PlaylistId, TrackId)
) COLOCATE BY (PlaylistId) ZONE MusicStore;
-- TrackId references the catalog chain, so PlaylistTrack-to-Track joins cross chains

This is exactly how the Music Store schema defines these tables. Query SYSTEM.TABLES to confirm:

Apache Ignite 3
GridGain 9

docker exec ignite3-node1 /opt/ignite3cli/bin/ignite3 sql \
  "SELECT TABLE_NAME, ZONE_NAME, TABLE_COLOCATION_COLUMNS FROM SYSTEM.TABLES WHERE TABLE_NAME IN ('PLAYLIST','PLAYLISTTRACK');"

docker exec gridgain9-node1 /opt/gridgain9cli/bin/gridgain9 sql \
  "SELECT TABLE_NAME, ZONE_NAME, TABLE_COLOCATION_COLUMNS FROM SYSTEM.TABLES WHERE TABLE_NAME IN ('PLAYLIST','PLAYLISTTRACK');"

TABLE_NAME    | ZONE_NAME  | TABLE_COLOCATION_COLUMNS
--------------+------------+-------------------------
PLAYLIST      | MUSICSTORE | PLAYLISTID
PLAYLISTTRACK | MUSICSTORE | PLAYLISTID

Checkpoint:You wrote DDL for the Playlist chain that matches the existing schema: Playlist with a simple PK, PlaylistTrack with a composite PK and COLOCATE BY (PlaylistId), both in the MusicStore zone.

Colocated vs Cross-Chain Joins

So far, every query has touched a single table. Joins are where colocation pays off, because the planner can execute a join locally when both tables share a partition, or must shuffle data between nodes when they do not.

Colocated join: Artist to Album

Artist and Album share the catalog colocation chain. A join filtered by ArtistId should execute within a single partition:

Apache Ignite 3
GridGain 9

docker exec ignite3-node1 /opt/ignite3cli/bin/ignite3 sql \
  "EXPLAIN PLAN FOR SELECT a.Name, al.Title FROM Artist a JOIN Album al ON a.ArtistId = al.ArtistId WHERE a.ArtistId = 1;"

docker exec gridgain9-node1 /opt/gridgain9cli/bin/gridgain9 sql \
  "EXPLAIN PLAN FOR SELECT a.Name, al.Title FROM Artist a JOIN Album al ON a.ArtistId = al.ArtistId WHERE a.ArtistId = 1;"

Apache Ignite 3
GridGain 9

Exchange
    distribution: single
    est: (rows=1)
  Project
      fieldNames: [NAME, TITLE]
      est: (rows=1)
    MergeJoin
        predicate: =(ARTISTID$0, ARTISTID)
        type: inner
        est: (rows=1)
      IndexScan
          table: PUBLIC.ALBUM
          index: IFK_ALBUMARTISTID
          type: SORTED
          collation: [ARTISTID ASC]
          est: (rows=347)
      Sort
          collation: [ARTISTID ASC]
          est: (rows=1)
        IndexScan
            table: PUBLIC.ARTIST
            index: ARTIST_PK
            type: HASH
            predicate: =(ARTISTID, 1)
            est: (rows=1)

Exchange
    distribution: single
    est: (rows=347)
  Project
      fieldNames: [NAME, TITLE]
      est: (rows=347)
    MergeJoin
        predicate: =(ARTISTID$0, ARTISTID)
        type: inner
        est: (rows=347)
      IndexScan
          table: PUBLIC.ALBUM
          index: IFK_ALBUMARTISTID
          type: SORTED
          collation: [ARTISTID ASC]
          est: (rows=347)
      Sort
          collation: [ARTISTID ASC]
          est: (rows=1)
        TableScan
            table: PUBLIC.ARTIST
            predicate: =(ARTISTID, 1)
            est: (rows=1)

A MergeJoin between Album (IndexScan on IFK_ALBUMARTISTID) and Artist. The single Exchange at the top collects results to the coordinator, but there is no Exchange between the join inputs. Both tables are colocated by ArtistId, so the join executes locally on one node without moving data. The planner accesses the Artist table differently between products (IndexScan on the primary key vs TableScan), but the routing and join strategy are the same.

Cross-chain join: Customer to Track

A query that joins across the commerce chain and catalog chain requires data movement:

Apache Ignite 3
GridGain 9

docker exec ignite3-node1 /opt/ignite3cli/bin/ignite3 sql \
  "EXPLAIN PLAN FOR SELECT c.FirstName, t.Name FROM Customer c JOIN Invoice i ON c.CustomerId = i.CustomerId JOIN InvoiceLine il ON i.InvoiceId = il.InvoiceId JOIN Track t ON il.TrackId = t.TrackId WHERE c.CustomerId = 1;"

docker exec gridgain9-node1 /opt/gridgain9cli/bin/gridgain9 sql \
  "EXPLAIN PLAN FOR SELECT c.FirstName, t.Name FROM Customer c JOIN Invoice i ON c.CustomerId = i.CustomerId JOIN InvoiceLine il ON i.InvoiceId = il.InvoiceId JOIN Track t ON il.TrackId = t.TrackId WHERE c.CustomerId = 1;"

Apache Ignite 3
GridGain 9

Project
    fieldNames: [FIRSTNAME, NAME]
    est: (rows=1232866)
  MergeJoin
      predicate: =(TRACKID$0, TRACKID)
      est: (rows=1232866)
    Exchange                          <-- data movement
        distribution: single
      Sort
        TableScan
            table: PUBLIC.TRACK
            est: (rows=3503)
    HashJoin
        predicate: =(INVOICEID$0, INVOICEID)
        est: (rows=2346)
      Exchange                        <-- data movement
          distribution: single
        IndexScan
            table: PUBLIC.INVOICELINE
            index: IFK_INVOICELINETRACKID
            est: (rows=2240)
      MergeJoin
          predicate: =(CUSTOMERID$0, CUSTOMERID)
          est: (rows=7)
        Exchange                      <-- data movement
            distribution: single
          IndexScan
              table: PUBLIC.INVOICE
              index: IFK_INVOICECUSTOMERID
              est: (rows=412)
        Exchange                      <-- data movement
            distribution: single
          Sort
            IndexScan
                table: PUBLIC.CUSTOMER
                index: CUSTOMER_PK
                predicate: =(CUSTOMERID, 1)
                est: (rows=1)

Project
    fieldNames: [FIRSTNAME, NAME]
    est: (rows=1)
  MergeJoin
      predicate: =(CUSTOMERID, CUSTOMERID$0)
      est: (rows=1)
    Exchange                          <-- data movement
        distribution: single
      Sort
        TableScan
            table: PUBLIC.CUSTOMER
            predicate: =(CUSTOMERID, 1)
            est: (rows=1)
    HashJoin
        predicate: =(INVOICEID, INVOICEID$0)
        est: (rows=1)
      Exchange                        <-- data movement
          distribution: single
        IndexScan
            table: PUBLIC.INVOICE
            index: IFK_INVOICECUSTOMERID
            est: (rows=1)
      MergeJoin
          predicate: =(TRACKID, TRACKID$0)
          est: (rows=1)
        Exchange                      <-- data movement
            distribution: single
          IndexScan
              table: PUBLIC.INVOICELINE
              index: IFK_INVOICELINETRACKID
              est: (rows=1)
        Exchange                      <-- data movement
            distribution: single
          Sort
            TableScan
                table: PUBLIC.TRACK
                est: (rows=1)

Both plans show four Exchange nodes (marked with arrows), each representing data moving between nodes. The planners choose different join orders between products (Ignite 3 starts with Track, GridGain 9 starts with Customer), but the structural lesson is the same: crossing colocation chains forces data movement.

The Customer-to-Invoice join stays within the commerce chain. The InvoiceLine-to-Track join crosses into the catalog chain, which forces the planner to shuffle both tables to a common node.

Compare the two plans:

	Colocated (Artist-Album)	Cross-chain (Customer-Track)
Exchange nodes	1 (result collection only)	4 (data movement)
Join execution	Local on one node	Data shuffled between nodes
Performance at scale	Constant (single partition)	Degrades linearly with data volume

At the Music Store's current scale, both queries return instantly, so the difference is academic. The next step loads production-volume data where it stops being academic.

Checkpoint:The colocated EXPLAIN plan shows no Exchange operators between the join inputs. The cross-chain plan shows four Exchange operators, one for each table scan that must be shuffled to complete the join.

Verify at Scale

With 59 customers and 2,240 invoice lines, every query runs fast regardless of how the schema is designed. To see schema design decisions make a real difference, load a dataset that simulates production volume in the commerce chain.

Download the commerce data generator and run it against your cluster. The generator uses the Data Streamer API to load 3.4 million rows in approximately 30 seconds:

curl -sO /assets/tools/commerce-data-generator.jar

java -jar commerce-data-generator.jar --customers 50000

The JAR connects to localhost:10800 by default. If your cluster uses a different address, add --host node1:10800. The generator reads existing TrackIds from the Chinook catalog, so the Music Store schema must be loaded first. The same JAR works with both Apache Ignite 3 and GridGain 9.

When it finishes, the cluster will contain approximately 50,000 customers, 518,000 invoices, and 2.85 million invoice lines alongside the existing Music Store catalog.

Production context

This dataset simulates two years of customer activity for a music store. In a production deployment, this data arrives continuously from order processing pipelines via Kafka Connect or the Data Streamer API. The schema design determines whether those writes are colocated (fast, single-partition) and whether dashboard queries serve results from a single partition. The Data Streamer tutorial covers the Data Streamer API for high-throughput ingestion.

Verify the partition distribution at scale

Apache Ignite 3
GridGain 9

docker exec ignite3-node1 /opt/ignite3cli/bin/ignite3 sql \
  "SELECT NODE_NAME, ZONE_NAME, COUNT(*) AS PARTITIONS, SUM(ESTIMATED_ROWS) AS TOTAL_ROWS FROM SYSTEM.LOCAL_ZONE_PARTITION_STATES WHERE ZONE_NAME = 'MUSICSTORE' GROUP BY NODE_NAME, ZONE_NAME ORDER BY NODE_NAME;"

docker exec gridgain9-node1 /opt/gridgain9cli/bin/gridgain9 sql \
  "SELECT NODE_NAME, ZONE_NAME, COUNT(*) AS PARTITIONS, SUM(ESTIMATED_ROWS) AS TOTAL_ROWS FROM SYSTEM.LOCAL_ZONE_PARTITION_STATES WHERE ZONE_NAME = 'MUSICSTORE' GROUP BY NODE_NAME, ZONE_NAME ORDER BY NODE_NAME;"

NODE_NAME | ZONE_NAME  | PARTITIONS | TOTAL_ROWS
----------+------------+------------+-----------
node1     | MUSICSTORE | 18         | 2463124
node2     | MUSICSTORE | 17         | 2339985
node3     | MUSICSTORE | 15         | 2060905

3.4 million rows distributed across three nodes. The EXPLAIN plans for the Album table do not change at this scale (the Album table itself is the same size), but the commerce queries that join across millions of rows show the real impact of colocation. Run a colocated commerce query and a cross-chain aggregation side by side to see the timing difference.

Colocated commerce query

All invoices for a specific customer, staying within the commerce colocation chain:

Apache Ignite 3
GridGain 9

docker exec ignite3-node1 /opt/ignite3cli/bin/ignite3 sql \
  "SELECT c.FirstName, c.LastName, COUNT(i.InvoiceId) AS OrderCount, SUM(i.Total) AS TotalSpent FROM Customer c JOIN Invoice i ON c.CustomerId = i.CustomerId WHERE c.CustomerId = 1001 GROUP BY c.FirstName, c.LastName;"

docker exec gridgain9-node1 /opt/gridgain9cli/bin/gridgain9 sql \
  "SELECT c.FirstName, c.LastName, COUNT(i.InvoiceId) AS OrderCount, SUM(i.Total) AS TotalSpent FROM Customer c JOIN Invoice i ON c.CustomerId = i.CustomerId WHERE c.CustomerId = 1001 GROUP BY c.FirstName, c.LastName;"

FIRSTNAME | LASTNAME | ORDERCOUNT | TOTALSPENT
----------+----------+------------+-----------
Raj       | Garcia   | 3          | 51.48

Because Customer and Invoice are colocated by CustomerId, the entire join executes on a single partition without touching the network. On a 3-node cluster with 50,000 customers, expect this to return in roughly 3-4 seconds.

Cross-chain aggregation

Top-selling tracks by artist across the entire commerce dataset. This joins InvoiceLine (commerce chain) through Track, Album, and Artist (catalog chain):

Apache Ignite 3
GridGain 9

docker exec ignite3-node1 /opt/ignite3cli/bin/ignite3 sql \
  "SELECT ar.Name AS Artist, t.Name AS Track, COUNT(*) AS TimesSold FROM InvoiceLine il JOIN Track t ON il.TrackId = t.TrackId JOIN Album al ON t.AlbumId = al.AlbumId JOIN Artist ar ON al.ArtistId = ar.ArtistId GROUP BY ar.Name, t.Name ORDER BY TimesSold DESC LIMIT 10;"

docker exec gridgain9-node1 /opt/gridgain9cli/bin/gridgain9 sql \
  "SELECT ar.Name AS Artist, t.Name AS Track, COUNT(*) AS TimesSold FROM InvoiceLine il JOIN Track t ON il.TrackId = t.TrackId JOIN Album al ON t.AlbumId = al.AlbumId JOIN Artist ar ON al.ArtistId = ar.ArtistId GROUP BY ar.Name, t.Name ORDER BY TimesSold DESC LIMIT 10;"

ARTIST       | TRACK                   | TIMESSOLD
-------------+-------------------------+----------
Iron Maiden  | The Number Of The Beast | 4120
Iron Maiden  | The Trooper             | 4060
Iron Maiden  | Hallowed Be Thy Name    | 4055
Iron Maiden  | 2 Minutes To Midnight   | 3964
Iron Maiden  | The Evil That Men Do    | 3393
Iron Maiden  | Iron Maiden             | 3258
Iron Maiden  | Wrathchild              | 3240
Iron Maiden  | Fear Of The Dark        | 3179
Iron Maiden  | Heaven Can Wait         | 2548
Led Zeppelin | I Can't Quit You Baby   | 2534

Nine of the top 10 tracks are Iron Maiden. The Chinook catalog has 213 Iron Maiden tracks (more than any other artist), and the generator's random invoice lines reference real TrackIds, so Iron Maiden songs appear in more invoices by sheer weight of catalog presence.

More interesting than the results is the time it took. This query joins across both colocation chains: InvoiceLine (commerce) to Track to Album to Artist (catalog). The database shuffled 2.85 million InvoiceLine rows across the network to join them with the catalog tables. Expect roughly 7-8 seconds on a 3-node cluster, about twice the colocated commerce query. At 3.4 million rows the gap is noticeable. At production volumes (hundreds of millions of rows), it becomes the dominant factor in query latency, because cross-chain joins degrade linearly with data while colocated joins stay constant.

Checkpoint:The partition distribution query shows over 2 million rows per node. The colocated commerce query returns in a few seconds. The cross-chain aggregation shows Iron Maiden dominating the results but takes noticeably longer.

Common Anti-Patterns

Before designing your own schema, review these common mistakes:

Anti-Pattern	Why It Fails	Fix
Porting an RDBMS schema unchanged	No colocation. All joins cross the network.	Analyze access patterns and add `COLOCATE BY`.
Simple PK on child tables	Cannot colocate without the parent's key in the PK.	Composite PK with the parent's key column.
Omitting `COLOCATE BY`	Defaults to all PK columns. Requires the full PK in `WHERE` for single-partition routing.	Always specify `COLOCATE BY` explicitly.
One zone for everything	Reference data and transactional data have different replication needs.	Separate zones: partitioned for transactional, replicated for reference.
Not verifying the design	Incorrect colocation is invisible until production load.	Query `SYSTEM.TABLES` and `SYSTEM.ZONES` after every schema change.

Summary

Schema design in distributed SQL is performance design. The DDL you write decides where data lands, and where data lands decides how fast your queries run.

The five-step process applies to any domain:

Identify access patterns to find which tables are queried together
Find root entities that anchor each relationship chain
Build colocation chains from root to leaf using shared columns
Accept cross-chain tradeoffs where colocation chains meet
Identify reference data that benefits from full replication

Each of those decisions maps to a DDL clause: CREATE ZONE for placement and replication, COLOCATE BY for the routing key, and composite primary keys for partition assignment. System views let you verify the design after the fact, and the scale test showed the payoff: colocated queries stay on one partition while cross-chain queries pay for every row they shuffle across the network.

What's next:

Implement Schema with Java Annotations (coming soon) expresses the same design as code, with compile-time validation and automatic DDL generation.
Choose the Right Data Access Pattern (coming soon) shows how your schema design shapes which API view (KeyValueView, RecordView, or SQL) is the right choice for each operation.
Stream Data with the Data Streamer API (coming soon) builds on this schema to load data at high throughput, where colocated writes matter for ingestion performance.

Was this helpful?

Introduction​

Prerequisites​

What You Will Learn​

The Distributed SQL Shift​

Analyze the Domain​

Identifying access patterns​

Finding the root entities​

Building colocation chains​

Accepting cross-chain tradeoffs​

Identifying reference data​

Verifying against the cluster​

Composite Primary Keys​

How composite keys affect query routing​

The cost of composite keys​

Distribution Zones​

The decision framework​

DDL walkthrough​

Apply the Process​

Colocated vs Cross-Chain Joins​

Colocated join: Artist to Album​

Cross-chain join: Customer to Track​

Verify at Scale​

Verify the partition distribution at scale​

Colocated commerce query​

Cross-chain aggregation​

Common Anti-Patterns​

Summary​