Performance optimizations in Apache Impala

历史和动机

SQL on Apache Hadoop

  • SQL
  • Run on top of HDFS
  • Supported various file formats
  • Converted query operators to map-reduce jobs
  • Run at scale
  • Fault-tolerant
  • High startup-costs/materialization overhead (slow…)

impala performance optimizations

query planning

2-phase cost-based optimizer

  • Phase 1: Generate a single node plan (transformations, join ordering, static partition pruning, runtime filters)
  • Phase 2: Convert the single node plan into a distributed query execution plan (add exchange nodes, decide join strategy)
    Query fragments (units of work):
  • Parts of query execution tree
  • Each fragment is executed in one or more impalads

metadata & statistics

Metadata
• Table metadata (HMS) and block level (HDFS) information are cached to speed-up query time
• Cached data is stored in FlatBuffers to save space and avoid excessive GC
• Metadata loading from HDFS/S3/ADLS uses a thread pool to speedup the operation when needed
Statistics
• Impala uses an HLL to compute Number of distinct values (NDV)
• HLL is much faster than the combination of COUNT and DISTINCT, and uses a constant amount of memory and thus is less memory-intensive for columns with high cardinality.
• HLL size is 1KB per column
• A Novel implementation of sampled stats is coming soon

Query optimizations based on statistics

Author: weibingo
Link: https://wbice.cn/article/impala-3-0.html
Copyright Notice: All articles in this blog are licensed under CC BY-NC-SA 4.0 unless stating additionally.