Reference tables are relatively small tables that do not need to be distributed and are present on every node in the cluster. They are both created and written to on the master aggregator. Reference tables are implemented via primary-secondary replication to every node in the cluster from the master aggregator. Replication enables reference tables to be dynamic: updates that you perform to a reference table on the master aggregator are quickly reflected on every machine in the cluster.
aggregators can take advantage of reference tables’ ubiquity by pushing joins between reference tables and a distributed table onto the leaves. Imagine you have a distributed
clicks table storing billions of records and a smaller
customers table with just a few million records. Since the
customers table is small, it can be replicated on every node in the cluster. If you run a join between the
clicks table and the
customers table, then the bulk of the work for the join will occur on the leaves.
Reference tables are a convenient way to implement dimension tables.