Distkeys and Sortkeys are Redshift-only column designations that can help speed up query performance.
Sortkeys
What is a Sortkey?
A sortkey is a designation given to a column that allows Redshift to optimize its query planning process. Assigning a sortkey can help Redshift quickly find data when executing a query.
How Many Can a Table Have?
Redshift allows for up to 400 columns in a table to be designated as sortkeys. Civis's import feature currently supports a maximum of two sortkeys per table.
Which Columns Should be Sortkeys?
A good rule of thumb is to designate any column that is commonly used in JOINs or in the “WHERE” clause of a SQL query as a sortkey. A prominent example would be a date column. Here’s an example from Amazon’s documentation:
“For example, if a table stores five years of data sorted by date and a query specifies a date range of one month, up to 98% of the disk blocks can be eliminated from the scan. If the data is not sorted, more of the disk blocks (possibly all of them) have to be scanned.”
When it comes to choosing the best sortkey, Amazon has this to say:
- If recent data is queried most frequently, specify the timestamp column as the leading column for the sort key.
- Queries will be more efficient because they can skip entire blocks that fall outside the time range.
- If you do frequent range filtering or equality filtering on one column, specify that column as the sort key.
- Redshift can skip reading entire blocks of data for that column because it keeps track of the minimum and maximum column values stored on each block and can skip blocks that don't apply to the predicate range.
- If you frequently join a table, specify the join column as both the sort key and the distribution key.
- This enables the query optimizer to choose a sort merge join instead of a slower hash join. Because the data is already sorted on the join key, the query optimizer can bypass the sort phase of the sort merge join.
What About Tables That Add/Remove Data Often?
When you add data to a table, it’s sorted but also temporarily stored in a separate region on disk. To keep your table fully sorted you’ll occasionally need to run a VACUUM command on it. Remember that vacuuming is resource-intensive so you’ll want to run the command when you expect there to be minimal activity on the cluster. You can execute a VACUUM in a scheduled SQL Script.
Distkeys
What is a Distkey?
A distkey is a way for Redshift to decide which row goes to which node of a cluster. This can be useful when joining datasets together because it lets Redshift know where to easily locate the queried data.
Imagine if you have information stored in a bunch of different file cabinets. Within those cabinets you have two types of information: a file on each student and report cards. If the student's file and report cards were all put into your set of file cabinets at random, it would take you a long time to match up students with report cards. But if you knew that a student’s file and their report card were in the same filing cabinet, you’re going to be able to match them up much quicker. That’s what assigning a distkey does.
How Many Can a Table Have?
Because Redshift uses a distkey to decide on which node to place the row of data, there can only be one per table.
Which Column Should Be a Distkey?
The column that is most commonly used in joins should be the distkey. Remember that the column that you’re joining on should be the distkey in both tables to fully gain the benefit of the distkey designation.
Assigning Keys to a Table
If you’re creating a table from scratch, your code would look something like this:
CREATE TABLE example_table ( column_1 int distkey, column_2 int, column_3 int) SORTKEY(column_1,column_3);
You can use ALTER TABLE to change a table's distkey if desired.
When it comes to distkeys, beware of data skewing. This is where a table’s data is stored unevenly across nodes within a cluster. Data skewing can have a negative impact on query performance. An example of this would be if you had a table of every registered voter in California and Alaska and you set “state” as the distkey. One of the nodes would have all of the California voters while another node would contain Alaska voters, which, given the population difference between the two states, would skew distribution. Your goal should be to have a table’s data spread approximately evenly across the nodes in the cluster.
Comments
0 comments
Please sign in to leave a comment.