- Published on
Redshift Deep Dive Best Practices
- Authors
- Name
- Sanjay Krishna
- sanjaykrishna
Deep Dive Redshift
Data Ingestion:
- Small Files instead of Large Files using COPY command
- Split larger input files into smaller files. This helps use all compute nodes in a cluster rather than just the first one if a single large file is used.
- Make the number of files a multiple of the number of slices in your cluster.
- Use INSERT INTO SELECT from external S3 tables
- Aggregate the incoming data
- Select subset of columns and / or rows (instead of SELECT * )
- Manipulate incoming column data with SQL
- Load data in alternative file formats such as AWS ION, Grok, Parquet, ORC etc.
- Best Practices
- Save cluster resources for querying and reporting rather than on ELT
- Filtering / aggregating incoming data can improve performance over COPY
- Redshift is designed for OLAP workloads
- Small write operations (~1 - 10 rows) has a similar cost to a larger write (~100K rows)
- DELETE command uses logical deletion and since the smallest immutable blocks in Redshift use 1MB of data, it is recommended to use it wisely.
- Add a vacuum statement (VACUUM) after each delete operation to remove ghost rows. This is normally done automatically by Redshift.
- AUTO VACUUM DELETE will reclaim space & AUTO TABLE SORT will sort needed portions of the table.
- If you are aware of your workload times for typical ELT operations, it is recommended to run VACUUM during off-peak periods.
- VACUUM BOOST is also another option (must be run only during off-peak hours, as it is very resource intensive)
- Run ANALYZE, typically after an ingest operation is complete on just the columns that WHERE predicates are filtered on.
- AWS has redshift utility script to run check what tables can be regularly analyzed and / or vacuumed based on certain set of thresholds. This can be run as a part of a monthly / periodical cluster maintenance.
Workload Management(WLM):
- SQA
- Redshift automatically identifies (using ML) detects short-running queries and runs them within the short query queue, if queuing occurs.
- It is enabled by default
- Concurrency Scaling
- When queues are full, queries are routed to a transient (spun-up temporary in a few seconds) Redshift clusters.
- Concurrency Scaling needs to be enabled. It is turned off by default as it does cost some $$. But there are a few caveats.
- For every 24 hours that your main Redshift cluster is in use, a 1 hour credit for Concurrency Scaling is accrued. Basically it means that Concurrency Scaling should be free for most Redshift customers.
- It is recommended to use automatic WLM if your workload is highly unpredictable or if you are using the old default WLM.
- Use WLM to limit ingestion / ELT concurrency to 2 - 3.
- To maximize query throughput, use WLM to throttle the number of concurrent queries to 15 or less.
- It is recommended to the save the
superuser
queue for administration tasks and canceling queries. - Concurrency Scaling does come with a few limitations :
- It doesn't support queries on tables that use interleaved sort keys.
- It doesn't support queries on temporary tables.
- It doesn't support queries that access external resources that are protected by restrictive network or virtual private cloud (VPC) configurations.
- It doesn't support queries that contain Python user-defined functions (UDFs).
- It doesn't support queries that access system tables, PostgreSQL catalog tables, or no-backup tables.
- It doesn’t support COPY or UNLOAD queries that access an external resource that has restrictive resource policies. External resources can include Amazon S3 buckets or DynamoDB tables. Policies can restrict access to a specific source VPC (
aws:sourceVpc
), source VPC endpoint (aws:sourceVpce
), or source IP address (aws:SourceIp
). In some cases, you might need to remove policies on an external resource. Doing this means that COPY or UNLOAD queries accessing the resource are sent to the concurrency-scaling cluster. - Amazon Redshift concurrency scaling for write operations is not supported for DDL operations, such as CREATE TABLE or ALTER TABLE.
- It doesn't support ANALYZE for the COPY command.
- It doesn't support write operations on a target table where DISTSTYLE is set to ALL.
- It doesn't support COPY from Amazon Redshift Spectrum or Amazon EMR.
- It doesn't support write operations on tables with identity columns.
- Amazon Redshift supports concurrency scaling for write operations on only Amazon Redshift RA3 nodes, specifically ra3.16xlarge, ra3.4xlarge, and ra3.xlplus. Concurrency scaling for write operations isn't supported on other node types.
Query Monitoring Rules(QMR):
- QMR is an extension of WLM
- It allows the automatic handling of poorly written queries (yes, we’ve all and wrote those!)
- It helps assign rules to a query that meets a particular criteria. For example, a rule can be setup to abort a long running query, or skipped.
- QMR helps wasteful use of a cluster and also log resource-intensive queries that could be candidates for query tuning for performance improvements.
- QMR + Redshift Advisor can serve as a good resource for optimizing usage of your Redshift cluster for optimum performance.
- Use QMR rather than WLM to set query timeouts.
Sources:
[1]: AWS Docs
[2]: Stack Overflow 1.
[4]: AWS Docs for VACUUM