Apache Hive _ Stinger_ Petabyte Scale SQL, IN Hadoop Presentation.pptx_第1页
Apache Hive _ Stinger_ Petabyte Scale SQL, IN Hadoop Presentation.pptx_第2页
Apache Hive _ Stinger_ Petabyte Scale SQL, IN Hadoop Presentation.pptx_第3页
Apache Hive _ Stinger_ Petabyte Scale SQL, IN Hadoop Presentation.pptx_第4页
Apache Hive _ Stinger_ Petabyte Scale SQL, IN Hadoop Presentation.pptx_第5页
已阅读5页,还剩21页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

1、Apache Hive and Stinger: SQL in Hadoop,Arun Murthy (acmurthy) Alan Gates (alanfgates) Owen OMalley (owen_omalley) hortonworks,YARN: Taking Hadoop Beyond Batch,Page 2,Applications Run Natively IN Hadoop,HDFS2 (Redundant, Reliable Storage),YARN (Cluster Resource Management),BATCH (MapReduce),INTERACTI

2、VE (Tez),STREAMING (Storm, S4,),GRAPH (Giraph),IN-MEMORY (Spark),HPC MPI (OpenMPI),ONLINE (HBase),OTHER (Search) (Weave),Store ALL DATA in one place Interact with that data in MULTIPLE WAYS with Predictable Performance and Quality of Service,Hadoop Beyond Batch with YARN,HADOOP 1,HDFS (redundant, re

3、liable storage),MapReduce (cluster resource management ,SELECT a.state,JOIN (a, c) SELECT c.price,SELECT b.id,JOIN(a, b) GROUP BY a.state COUNT(*) AVERAGE(c.price),M,M,M,R,R,M,M,R,M,M,R,M,M,R,HDFS,HDFS,HDFS,M,M,M,R,R,R,M,M,R,R,SELECT a.state, c.itemId,JOIN (a, c),JOIN(a, b) GROUP BY a.state COUNT(*)

4、 AVERAGE(c.price),SELECT b.id,Tez avoids unneeded writes to HDFS,Tez Sessions, because Map/Reduce query startup is expensive Tez Sessions Hot containers ready for immediate use Removes task and job launch overhead (5s 30s) Hive Session launch/shutdown in background (seamless, user not aware) Submits

5、 query plan directly to Tez Session Native Hadoop service, not ad-hoc,Tez Delivers Interactive Query - Out of the Box!,Page 8,Stinger Project (announced February 2013),Batch AND Interactive SQL-IN-Hadoop,Stinger InitiativeA broad, community-based effort to drive the next generation of HIVE,Coming So

6、on: Hive on Apache Tez Query Service Buffer Cache Cost Based Optimizer (Optiq) Vectorized Processing,Hive 0.11, May 2013: Base Optimizations SQL Analytic Functions ORCFile, Modern File Format,Hive 0.12, October 2013: VARCHAR, DATE Types ORCFile predicate pushdown Advanced Optimizations Performance B

7、oosts via YARN,SpeedImprove Hive query performance by 100X to allow for interactive query times (seconds),Scale The only SQL interface to Hadoop designed for queries that scale from TB to PB,SQL Support broadest range of SQL semantics for analytic applications running against Hadoop,all IN Hadoop,Go

8、als:,Hive 0.12,SPEED: Increasing Hive Performance,Performance Improvements included in Hive 12 Base & advanced query optimization Startup time improvement Join optimizations,Interactive Query Times across ALL use cases Simple and advanced queries in seconds Integrates seamlessly with existing tools

9、Currently a 100 x improvement in just nine months,Stinger Phase 3: Interactive Query In Hadoop,Page 12,190 x Improvement,200 x Improvement,Query 27: Pricing Analytics using Star Schema Join Query 82: Inventory Analytics Joining 2 Large Fact Tables,All Results at Scale Factor 200 (Approximately 200GB

10、 Data),41.1s,4.2s,39.8s,4.1s,TPC-DS Query 52,TPC-DS Query 55,Query Time in Seconds,Speed: Delivering Interactive Query,Test Cluster: 200 GB Data (Impala: Parquet Hive: ORCFile) 20 Nodes, 24GB RAM each, 6x disk each,Hive 0.12,Query 52: Star Schema Join Query 5: Star Schema Join,22s,9.8s,31s,6.7s,TPC-

11、DS Query 28,TPC-DS Query 12,Query Time in Seconds,Speed: Delivering Interactive Query,Test Cluster: 200 GB Data (Impala: Parquet Hive: ORCFile) 20 Nodes, 24GB RAM each, 6x disk each,Hive 0.12,Query 28: Vectorization Query 12: Complex join (M-R-R pattern),AMPLab Big Data Benchmark,Page 15,45s,63s,63s

12、,9.4s,AMPLab Query 1a,AMPLab Query 1b,AMPLab Query 1c,Query Time in Seconds (lower is better),1.6s,2.3s,AMPLab Query 1: Simple Filter Query,AMPLab Big Data Benchmark,Page 16,466s,104.3s,490s,118.3s,552s,172.7s,AMPLab Query 2a,AMPLab Query 2b,AMPLab Query 2c,Query Time in Seconds (lower is better),AM

13、PLab Query 2: Group By IP Block and Aggregate,AMPLab Big Data Benchmark,Page 17,Query Time in Seconds (lower is better),AMPLab Query 3: Correlate Page Rankings and Revenues Across Time,How Stinger Phase 3 Delivers Interactive Query,Page 18,SQL: Enhancing SQL Semantics,SQL Compliance Hive 12 provides

14、 a wide array of SQL datatypes and semantics so your existing tools integrate more seamlessly with Hadoop,ORC File Format,Columnar format for complex data types Built into Hive from 0.11 Support for Pig and MapReduce via HCat Two levels of compression Lightweight type-specific and generic Built in i

15、ndexes Every 10,000 rows with position information Min, Max, Sum, Count of each column Supports seek to row number,Page 20,SCALE: Interactive Query at Petabyte Scale,Sustained Query Times Apache Hive 0.12 provides sustained acceptable query times even at petabyte scale,131 GB (78% Smaller),File Size

16、 Comparison Across Encoding Methods Dataset: TPC-DS Scale 500 Dataset,221 GB (62% Smaller),Encoded with Text,Encoded with RCFile,Encoded with ORCFile,Encoded with Parquet,505 GB (14% Smaller),585 GB (Original Size),Larger Block Sizes Columnar format arranges columns adjacent within the file for comp

17、ression & fast access,Impala,Hive 12,Smaller Footprint Better encoding with ORC in Apache Hive 0.12 reduces resource requirements for your cluster,ORC File Format,Hive 0.12 Predicate Push Down Improved run length encoding Adaptive string dictionaries Padding stripes to HDFS block boundaries Trunk St

18、ripe-based Input Splits Input Split elimination Vectorized Reader Customized Pig Load and Store functions,Page 22,Vectorized Query Execution,Designed for Modern Processor Architectures Avoid branching in the inner loop. Make the most use of L1 and L2 cache. How It Works Process records in batches of 1,000 rows Gen

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论