Impala v Hive - DataSciences版

本页内容为未名空间相应帖子的节选和存档，一周内的贴子最多显示50字，超过一周显示500字访问原贴

相关主题
● What's the best way to convert text/csv file into PARQUET	● Hive ODBC caching/latency的问题
● How to load csv file converted from excel file into Cloudera Hive or Impala?	● DS对数据库需要了解多少？
● Re: MapR Technologies continue hiring a lot of positions (转载)	● 贴个工作
● 讨论，（Big）Data Engineer到底是个什么职位	● data scientist对sql要求高吗
● 请问大家有没有直接用java全程写mapreduce的程序的？	● 大数据这个东西，如果用hive，岂不是跟SQL差不多了
● 如何学习Hadoop?	● 招data scientist intern立即开始
● 求Google 的 Data Science 有关的位置内推 (转载)	● 请问有没有Pig Hive Hadoop SQL的速成课？
● Data scientist--Zillow电面	● 请问data scientist 相关职务，面试要准备什么?

相关话题的讨论汇总
话题: hive话题: impala话题: sql话题: mapreduce话题: data

进入DataSciences版参与讨论

(共1页)

d*2
发帖数: 2053

http://vision.cloudera.com/impala-v-hive/
by Mike Olson
December 22, 2013
We introduced Cloudera Impala more than a year ago. It was a good launch for
us — it made our platform better in ways that mattered to our customers,
and it’s allowed us to win business that was previously unavailable because
earlier products simply couldn’t tackle interactive SQL workloads.
As a side effect, though, that launch ignited fierce competition among
vendors for SQL market share in the Apache Hadoop ecosystem, with claims and
counter-claims flying. Chest-beating on performance abounds (and we like
our numbers pretty well), but I want to approach the matter from a different
direction here.
I get asked all the time about Cloudera’s decision to develop Impala from
the ground up as a new project, rather than improving the existing Apache
Hive project. If there’s existing code, the thinking goes, surely it’s
best to start there — right?
Well, no. We thought long and hard about it, and we concluded that the best
thing to do was to create a new open source project, designed on different
principles from Hive. Impala is that system. Our experiences over the last
year increase our conviction on that strategy.
Let me walk you through our thinking.
Origins of Pig and Hive
The very earliest versions of Hadoop supported exactly one way of getting at
data: MapReduce. The platform was used extensively at Yahoo! and Facebook
to store, process and analyze huge amounts of data. Both companies found it
tremendously valuable, but both spent a lot of time writing Java code to run
under MapReduce to get useful work out of the system.
Both wanted to work faster and to let non-programmers get at their data
directly. Yahoo! invented a language of its own, called Apache Pig, for that
purpose. Facebook, likely thinking about using existing skills rather than
training people with new ones, built the SQL system called Hive.
The two work in a very similar way. A user types a query in either Pig or
Hive. A parser reads the query, figures out what the user wants and runs a
series of MapReduce jobs that do the work. By using MapReduce, Pig and Hive
were simple to build, so that Yahoo!’s and Facebook’s engineers could
spend their time on other new features.
Sins of the Past
A common criticism of Hadoop is that MapReduce is a batch data processing
engine. It turns out to be a hugely useful batch data processing engine that
’s drastically changed the data management industry, but it’s a fair
criticism: By design, MapReduce is a heavyweight, high-latency execution
framework. Jobs are slow to start and have all kinds of overhead.
Pig and Hive inherit those properties.
Early on, that didn’t matter. No one was pointing their BI tools at Hadoop
for analytics and reporting. Yahoo! and Facebook wanted more users to get at
their data directly. If those users had to sit for a while staring at their
terminal screens while the work happened, it wasn’t that big a deal. Slow
queries were better than no queries. Many of those early workloads were, in
any case, about data transformation, not query or analytics. Batch was fine
for that.
Tools that use ODBC and JDBC to talk to relational databases are available
from lots of vendors. Making those tools work against Hadoop is a really big
deal — the number of casual business users who can get at data that way is
enormous.
We wrote and shipped the first widely-available ODBC driver for Hive in our
early years. Big established vendors used it to connect their reporting
tools to our platform. The experience of using those tools was excruciating.
A user would construct a query and push a button, and would wait… and wait
… and wait. Decades of experience had taught people to expect real-time
responses from their databases. Hive, built on MapReduce, couldn’t deliver.
In fact, Hive’s performance meant that some of the vendor tools flat out
didn’t work. They assumed that the database had crashed when answers didn’
t come back promptly, and reported an error to the user.
That led to a lot of angry users.
Creating Impala
We knew we had to make SQL a first-class citizen in the Hadoop ecosystem, so
we sat down and began working through the alternatives.
There are decades of experience in the database industry in building scale-
out distributed query processing engines. It’s an active area for academic
and commercial research. Plenty of very good, very fast distributed SQL
systems exist. None of them uses a MapReduce-style architecture.
We also looked to Google, which created the Hadoop architecture in the first
place. Google had developed a new distributed query processing engine of
its own, not based on MapReduce, to get SQL access to its big data.
We elected, after thinking through all the options, to do exactly the same
thing. We were optimistic about its performance, so named it Impala.
We hired a leader for the project, Marcel Kornacker, who had worked for a
time at Google and who did his graduate work in database systems at UC
Berkeley in the 1990s. Marcel assembled a team of seasoned distributed
systems and database kernel engineers, and they got to work.
Every node in a Cloudera cluster has Impala code installed, waiting for SQL
queries to execute. That code lives right alongside the MapReduce engine on
every node — and the Apache HBase engine (not based on MapReduce), and the
Search engine (not based on MapReduce), and third-party engines like SAS and
Apache Spark that customers may choose to deploy. Those engines all have
access to exactly the same data, and users can choose which of them to use,
based on the problem they’re trying to solve.
Impala doesn’t have to translate a SQL query into another processing
framework, like the map/shuffle/reduce operations on which Hive depends
today. As a result, Impala doesn’t suffer the latencies that those
operations impose. An Impala query begins to deliver results right away.
Because it’s designed from the ground up for SQL query execution, not as a
general-purpose distributed processing system like MapReduce, it’s able to
deliver much better performance for that particular workload.
Why Not Improve Hive?
Put bluntly: We chose to build Impala because Hive is the wrong architecture
for real-time distributed SQL processing. The landscape of parallel SQL
databases is densely populated. No traditional relational vendor — IBM,
Oracle, Microsoft, ParAccel, Greenplum, Teradata, Netezza, Vertica — uses
anything like MapReduce. The next generation of shared-nothing scale-out
systems — Google’s F1, notably, but also lesser-known options like NuoDB
— all rely on native distributed query processing, not on a MapReduce
foundation.
Facebook built Hive on MapReduce early because it was the shortest path to
SQL on Hadoop. That was a sensible decision given the requirements of the
time. Recently, however, the company announced Presto, its next-generation
query processing engine for real-time access to data via SQL. It’s built,
like Impala, new, from the ground up, as a distributed query processing
engine.
It’s certainly possible to improve Hive, and to tune MapReduce, to speed
things up and to reduce latencies. By design, though, Hive will always
impose overhead and incur performance penalties that successor systems,
built as native distributed SQL engines, avoid.
Tuning MapReduce to run Hive better is a mistake for another reason.
MapReduce is excellent at general-purpose batch processing workloads, and
working to specialize it for a specific query workload may impose a tax on
non-SQL users. Back in the early days, the idea that a single engine would
do all the work in Hadoop made sense. Today we see a proliferation of
special-purpose engines (HBase, Apache Accumulo, Search, Myrrix, Spark and
others). These extend the capabilities of the platform as a whole without
compromising the workloads for which each is specifically tuned.
The Impala Community
We worked on Impala for nearly two years at Cloudera before publishing it
under the Apache software license in late 2012. We did that because we think
real-time SQL on Hadoop is a big deal, and we didn’t want to signal our
intentions before we could actually deliver the product. We have a
considerable team working on it.
Hive can’t compete on performance with a modern distributed query engine
for real-time SQL. There are, though, components in the Hive ecosystem that
don’t rely on MapReduce. We have worked hard to integrate those with Impala
, and continue to contribute enhancements to Hive as a result of our work.
We collaborate, for example, with the community on HCatalog and the Hive
metastore as a metadata repository. HCatalog is central to our Enterprise
Data Hub strategy, since sharing data among lots of different engines
requires shared metadata that describes it. Our use of the Hive metastore
and HCatalog means that Hive, Impala, MapReduce, Pig, and REST applications
can share tables.
Twitter, Cloudera and Criteo collaborate on Parquet, a columnar format that
lets Impala run analytic database workloads much faster. Contributors are
working on integrating Parquet with Cascading, Pig and even Hive.
User- and role-based authentication and access control are crucial for
database applications. Cloudera created the Sentry project to define and
enforce those security policies for Impala and collaborates with Oracle and
others there. Without Sentry, every Hive user gets to see every table and
every record that belongs to every other Hive user. We’ve contributed the
code to integrate Sentry into Hive but it’s not uniformly included in
commercial distributions today.
Who Ships Impala?
Open source is good because it means that enterprise users can work with the
company that offers the best mix of software and services at the best price
. They’re not locked into a single vendor.
Impala, to succeed, must be shipped and supported by lots of companies.
Naturally, you can get Impala from Cloudera. It’s core to our platform and
central to our Enterprise Data Hub strategy. That means it’s available from
our resellers, as well. Oracle ships the entire Enterprise Data Hub
software suite from Cloudera, including Impala, on its Big Data Appliance.
Cloud vendors including the Softlayer subsidiary of IBM, Verizon Business
Systems, CenturyLink Savvis, T-Systems and more deliver “Enterprise Data
Hub as a Service” in their managed clouds, Impala included.
Amazon delivers its own big data service called Elastic MapReduce, or EMR.
Just last week, Amazon announced availability of Impala — the third engine
included as part of its EMR offering, along with MapReduce and HBase. Amazon
’s endorsement was really exciting for us. It’s a clear vote of confidence
in our architecture, and a huge new collection of potential users. That, in
turn, means that more applications and tools vendors will integrate with
Impala.
Most interestingly, in my view, is the fact that a head-to-head Cloudera
competitor now offers Impala to its customers. MapR customers can use Impala
on the company’s commercial product.
For more than five years, Cloudera has been building and shipping new open
source code in the Hadoop ecosystem. We have contributors and committers
working across a huge range of projects. We’ve created projects — Apache
Flume, Apache Sqoop, Hue, Apache Bigtop — that have been picked up by our
competitors as part of their offerings.
Impala is merely one more example on that list, and just one more piece of
evidence that our commitment to the open source platform is deep and real.
No lip service here — we’ve been living up to those words since our very
first days. The fact that you can get Impala today from more than eight
different vendors, including both our partners and our competitors,
absolutely validates our open source strategy.
Why So Many SQL Engines?
Survey the landscape and you’ll see that there are at least as many SQL
engines running on Hadoop as there are vendors in the market. Cloudera
offers Impala. IBM likes its BigSQL offering. Pivotal is clearly committed
to HAWQ. Hortonworks has declared its allegiance to Hive. Hadapt, an early
entrant, has expanded its ambitions to take on NoSQL workloads as well, but
continues to deliver SQL in the Hadoop framework. And so on.
This is the same dynamic that has driven the established relational database
players for several decades. Everyone implements a standard language
specification, and all the vendors compete on performance and extended
services. There is far too much existing product in the market to collapse
those offerings into a single one. Frankly, there’s too much money at stake
to drive shared investment.
There’s good news, though. Users are insulated from difficulties because
the applications that talk to those engines over SQL, JDBC and ODBC are
portable. Cloudera believes that open source is fundamental, and that Impala
will win. We nevertheless recognize that the language standard forces us to
compete on a mix of power and performance combined with the best ecosystem
of third-party products that integrate with our product. That competitive
dynamic is the one that is best for our customers and the market at large.
Whither Pig and Hive?
Pig, the Yahoo!-created dataflow language, has no competition from existing
vendors. The language is powerful, but its user base is too small to attract
the attention of big vendors with alternative implementations. There may be
better ways than MapReduce to run Pig queries, but based on the workloads
it supports (mostly transformations), that doesn’t matter; the software is
plenty good in its current form, and the Pig community continues to enhance
it.
Hive, on the other hand, is SQL, and SQL is intergalactic dataspeak.
Everyone knows it. Everybody expects it to be fast. Lots of companies
support it.
Hive is, therefore, sure to be under fierce competitive pressure. As more
big vendors enter the market, that gets steadily worse. We don’t believe
that the implementation can keep up with the demands of high-performance
interactive query and analytics. A horse can only go so fast, no matter how
hard you whip it. Users, and the tool and application vendors that serve
them, will migrate to better offerings.
Hive handles data processing jobs just fine, and it’s used widely in the
Cloudera customer base for just that purpose. It has, after all, been part
of Cloudera’s commercial offering for the past five years. Because it’s
got such presence in the installed base, we’ll continue to support it. We’
re especially keen to drive areas of overlap with Impala — witness the Hive
integration of Sentry and Parquet, described above.
But we must have a first-class, real-time, open source SQL engine in our
Enterprise Data Hub. Our forward bet for data analysis, real-time query and
high-performance SQL applications is solidly on Impala. We may, of course,
be biased, but we’re convinced that it’s already the very best SQL
offering available in the Hadoop ecosystem. We’re very pleased at its
uptake by customers and by other players in the market. We are excited about
the services and features that will roll out in the next several releases.
- See more at: http://vision.cloudera.com/impala-v-hive/#sthash.jSUzIhQ1.dpuf

l******n
发帖数: 9344

摘要，好长

for
because
and

【在 d*2 的大作中提到】

: http://vision.cloudera.com/impala-v-hive/
: by Mike Olson
: December 22, 2013
: We introduced Cloudera Impala more than a year ago. It was a good launch for
: us — it made our platform better in ways that mattered to our customers,
: and it’s allowed us to win business that was previously unavailable because
: earlier products simply couldn’t tackle interactive SQL workloads.
: As a side effect, though, that launch ignited fierce competition among
: vendors for SQL market share in the Apache Hadoop ecosystem, with claims and
: counter-claims flying. Chest-beating on performance abounds (and we like

(共1页)

进入DataSciences版参与讨论

相关主题
● 请问data scientist 相关职务，面试要准备什么?	● 请问大家有没有直接用java全程写mapreduce的程序的？
● Big data是下一个大坑吗	● 如何学习Hadoop?
● Re: 请问大数据问题和以前的数据挖掘有什么区别？ (转载)	● 求Google 的 Data Science 有关的位置内推 (转载)
● 克劳迪娅包怎么用啊	● Data scientist--Zillow电面
● What's the best way to convert text/csv file into PARQUET	● Hive ODBC caching/latency的问题
● How to load csv file converted from excel file into Cloudera Hive or Impala?	● DS对数据库需要了解多少？
● Re: MapR Technologies continue hiring a lot of positions (转载)	● 贴个工作
● 讨论，（Big）Data Engineer到底是个什么职位	● data scientist对sql要求高吗

相关话题的讨论汇总
话题: hive话题: impala话题: sql话题: mapreduce话题: data

#	版面	帖数(主题数)
-	全站	4871 (796)
1	Military	3777 (569)
2	Stock	341 (51)
3	Joke	117 (17)
4	History	116 (3)
5	Automobile	100 (9)
6	USANews	55 (9)
7	Midlife	45 (1)
8	Headline	41 (41)
9	Dreamer	33 (13)
10	FleaMarket	32 (20)
11	Living	30 (7)

boards

未名新帖统计// 7月16日

历史上的今天