Best practices guide

The following sections represent best practices for administering and operating a Dremio cluster.

Think in Terms of Several Discrete Data Reflections

Data Reflections allow administrators to be iterative in their approach to performance optimization. Because Data Reflections do not require any change in the behavior of the Data Consumer, administrators can add and refine Data Reflections on an ongoing basis with little to no impact to ongoing workloads.

Optimize data reflection

To determine the optimal set of Data Reflections, Administrators should isolate known query patterns into groups that do not interact with one another. Having more discrete query groups means:

  • Smaller reflections on disk.

  • More efficient Data Reflection maintenance can be performed.

  • Queries can be executed more efficiently.

Keep in mind that a single query can use multiple Data Reflections and a single Data Reflection can serve many queries.

Accelerate a query pattern

Dremio supports two types of Data Reflections: Raw Reflections and Aggregation Reflections. If a known query pattern returns row-level information, Raw Reflections are appropriate. If the query returns summarized data based on GROUP BY expressions or aggregations (e.g., SUM, AVG, COUNT, MIN, MAX), then an Aggregation Reflection is appropriate.

Aggregation

Dremio can pre-aggregate data at multiple levels of granularity. Then, at query time, Dremio can determine how to further aggregate the data as appropriate. Administrators can create Aggregation Reflections that include the lowest level granularity as well as the most coarse granularity, and Dremio will automatically aggregate at the appropriate level at query time.

Calculated Fields

For calculated fields that are frequently used by Data Consumers, administrators have a few different options for accelerating these calculations:

  • Add the calculated field to a virtual dataset - The administrator can add a new column that provides the calculation. Depending on the expression, Dremio may be able to match the new column without making the Data Consumers explicitly use the new column. Otherwise, they will need to include the new column in their queries.

  • Use a Supporting Anchor Dataset - The administrator can create a Supporting Anchor Dataset that includes the calculated field along with other fields from the dataset, and Dremio will automatically use the associated Data Reflection to accelerate the query.

Last modified: June 22, 2022

Was this article helpful?