Troubleshooting Redshift Slowness

There are multiple reasons why a Redshift cluster may appear slow. This document will detail potential causes and solutions.

Running Queries

There are multiple ways that running queries can cause Redshift slowness. It's important to know how to find out what queries are running on your cluster. To do this you can run the following SQL statement:

SELECT *
FROM stv_recents
WHERE status = 'Running'
ORDER BY duration DESC;

It is recommended that you evaluate the longest-running queries to see if one is potentially blocking others. If so, you may want to cancel the query. The result set from the above query contains a "pid" column that you can use by running:

CANCEL <pid>;

Queue

By default your Redshift cluster is configured to allow for 10 queries to be running simultaneously. This means that if you have more than that number of queries trying to run at the same time, some queries may be queued. Use the query from the Running Queries section of this document to see how many queries are running on your cluster. If there are 10+ queries running, you may be running into queueing. You will need to make a determination on whether there are queries that should be canceled or if you should let everything run.

Table Locking

One reason a query might not be completing is that a table referenced in the query is currently locked. There are certain operations (such as DROP TABLE and TRUNCATE) that will lock a table and prevent other queries from completing until the lock is released. See what queries are running on your cluster and determine if there are any DROPs or TRUNCATEs happening on a table that is referenced in your query. It may be that the operation is stuck, or somehow there's a deadlock situation. Our general advice is to cancel the longest-running query that may be causing a lock to see if that allows for other queries to complete.

Poorly Performing Queries

Unoptimized queries will run for longer and consume more database compute resources. Longer running queries can also make queueing or table locking problems worse.

Redshift provides system tables with information about query performance. STL_QUERY_METRICS provides information about completed queries and STV_QUERY_METRICS provides information about running queries. By joining these tables to the corresponding STL_QUERY or STV_QUERY table it is possible to identify the Platform Job and Run that executed a given query. Using these tables you can identify poorly performing queries. The example below will use STL_QUERY_METRICS and STL_QUERY.

Run this query to identify poorly performing queries:

select

q.userid, split_part(label, '-', 2 ) as platform_job_id, split_part(label, '-', 3 ) as platform_run_id,  querytxt,

query_cpu_time, query_blocks_read, query_execution_time, query_cpu_usage_percent, query_temp_blocks_to_disk, return_row_count,

nested_loop_join_row_count, join_row_count, scan_row_count, cpu_skew, io_skew, segment_execution_time, query_queue_time

from

stl_query q, svl_query_metrics_summary qms

where qms.query = q.query

and label like('console-%-%')

order by

   query_temp_blocks_to_disk desc nulls last,

   return_row_count desc nulls last,

   nested_loop_join_row_count desc nulls last,

   join_row_count desc nulls last,

   scan_row_count desc nulls last

The ordering in the where clause is a rough approximation of most resource intensive to least resource intensive.

Once you have identified poorly performing queries use the EXPLAIN command for additional information. Reference the Redshift documentation on analyzing and improving queries for more information on how to better optimize your queries.

VACUUM

VACUUMing is a resource-intensive command that re-sorts rows and reclaims space on your cluster. Civis does not run any VACUUMing on your cluster by default. If you run a VACUUM on your cluster, these are best done overnight, but there are scenarios in which a VACUUM may be running once business hours begin. Check if a VACUUM is running on your cluster and cancel it if you'd like.

Table Structure

Unsorted Tables

Tables that have a large percentage of unsorted rows can cause query slowness. AWS recommends that any tables with 20% or greater unsorted rows should be VACUUMed. This query will produce a list of tables that meet this criteria:

select trim(pgn.nspname) as schema, 
trim(a.name) as table, id as tableid, 
decode(pgc.reldiststyle,0, 'even',1,det.distkey ,8,'all') as distkey, dist_ratio.ratio::decimal(10,4) as skew, 
a.rows, 
decode( det.n_sortkeys, 0, null, a.unsorted_rows ) as unsorted_rows , 
decode( det.n_sortkeys, 0, null, decode( a.rows,0,0, (a.unsorted_rows::decimal(32)/a.rows)*100) )::decimal(5,2) as pct_unsorted 
from (select db_id, id, name, sum(rows) as rows, 
sum(rows)-sum(sorted_rows) as unsorted_rows 
from stv_tbl_perm a 
group by db_id, id, name) as a 
join pg_class as pgc on pgc.oid = a.id
join pg_namespace as pgn on pgn.oid = pgc.relnamespace
left outer join (select tbl, count(*) as mbytes 
from stv_blocklist group by tbl) b on a.id=b.tbl
inner join (select attrelid, 
min(case attisdistkey when 't' then attname else null end) as "distkey",
min(case attsortkeyord when 1 then attname else null end ) as head_sort , 
max(attsortkeyord) as n_sortkeys, 
max(attencodingtype) as max_enc 
from pg_attribute group by 1) as det 
on det.attrelid = a.id
inner join ( select tbl, max(mbytes)::decimal(32)/min(mbytes) as ratio 
from (select tbl, trim(name) as name, slice, count(*) as mbytes
from svv_diskusage group by tbl, name, slice ) 
group by tbl, name ) as dist_ratio on a.id = dist_ratio.tbl
join ( select sum(capacity) as total
from stv_partitions where part_begin=0 ) as part on 1=1
where mbytes is not null and pct_unsorted >= 20
order by pct_unsorted desc nulls last;

You can then run "VACUUM SORT ONLY schema.table" for any relevant tables.

Skewed Tables

Tables that have a high skew value may indicate a suboptimal distkey designation. AWS recommends that any tables with a skew value of 4.00 or higher should consider a different distkey. See Distkeys and Sortkeys. This query will show you tables with a high skew value:

select trim(pgn.nspname) as schema
, trim(a.name) as table
, id as tableid
, decode(pgc.reldiststyle,0, 'even',1,det.distkey ,8,'all') as distkey
, dist_ratio.ratio::decimal(10,4) as skew
, b.mbytes
, a.rows
from (select db_id, id, name, sum(rows) as rows, 
sum(rows)-sum(sorted_rows) as unsorted_rows 
from stv_tbl_perm a 
group by db_id, id, name) as a 
join pg_class as pgc on pgc.oid = a.id
join pg_namespace as pgn on pgn.oid = pgc.relnamespace
left outer join (select tbl, count(*) as mbytes 
from stv_blocklist group by tbl) b on a.id=b.tbl
inner join (select attrelid, 
min(case attisdistkey when 't' then attname else null end) as "distkey",
min(case attsortkeyord when 1 then attname else null end ) as head_sort , 
max(attsortkeyord) as n_sortkeys, 
max(attencodingtype) as max_enc 
from pg_attribute group by 1) as det 
on det.attrelid = a.id
inner join ( select tbl, max(mbytes)::decimal(32)/min(mbytes) as ratio 
from (select tbl, trim(name) as name, slice, count(*) as mbytes
from svv_diskusage group by tbl, name, slice ) 
group by tbl, name ) as dist_ratio on a.id = dist_ratio.tbl
join ( select sum(capacity) as total
from stv_partitions where part_begin=0 ) as part on 1=1
where mbytes is not null and skew >= 4
order by skew desc, mbytes desc;

Articles in this section

Running Queries

Queue

Table Locking

Poorly Performing Queries

VACUUM

Table Structure

Unsorted Tables

Skewed Tables

Comments

Articles in this section

Running Queries

Queue

Table Locking

Poorly Performing Queries

VACUUM

Table Structure

Unsorted Tables

Skewed Tables

Related articles