Performance optimizations in Apache Impala
历史和动机
SQL on Apache Hadoop
- SQL
- Run on top of HDFS
- Supported various file formats
- Converted query operators to map-reduce jobs
- Run at scale
- Fault-tolerant
- High startup-costs/materialization overhead (slow…)
impala performance optimizations
query planning
2-phase cost-based optimizer
- Phase 1: Generate a single node plan (transformations, join ordering, static partition pruning, runtime filters)
- Phase 2: Convert the single node plan into a distributed query execution plan (add exchange nodes, decide join strategy)
Query fragments (units of work): - Parts of query execution tree
- Each fragment is executed in one or more impalads
metadata & statistics
Metadata
• Table metadata (HMS) and block level (HDFS) information are cached to speed-up query time
• Cached data is stored in FlatBuffers to save space and avoid excessive GC
• Metadata loading from HDFS/S3/ADLS uses a thread pool to speedup the operation when needed
Statistics
• Impala uses an HLL to compute Number of distinct values (NDV)
• HLL is much faster than the combination of COUNT and DISTINCT, and uses a constant amount of memory and thus is less memory-intensive for columns with high cardinality.
• HLL size is 1KB per column
• A Novel implementation of sampled stats is coming soon
Query optimizations based on statistics
本博客所有文章除特别声明外,均采用 CC BY-NC-SA 4.0 许可协议。转载请注明来自 WBINGのBLOG!
评论