Cloudera Impala provides fast, interactive SQL queries directly on your Apache Hadoop data stored in HDFS or HBase. In addition to using the same unified storage platform, Impala also uses the same metadata, SQL syntax (Hive SQL), ODBC driver, and user interface (Cloudera Impala query UI in Hue) as Apache Hive. This provides a familiar and unified platform for real-time or batch-oriented queries.
Hive and other frameworks built on MapReduce are best suited for long-running batch jobs, such as those involving batch processing of Extract, Transform, and Load (ETL) type jobs. Impala does not replace the batch processing frameworks built on MapReduce such as Hive.
You can read more details about Apache Hive and it’s architecture check these articles
What is Impala?
Cloudera Impala is an excellent choice for programmers for running queries on HDFS and Apache HBase as it doesn’t require data to be moved or transformed prior to processing.
Cloudera Impala easily integrates with the Hadoop ecosystem, as its file and data formats, metadata, security, and resource management frameworks are the same as those used by MapReduce, Apache Hive, Apache Pig, and other Hadoop software.
It is architected specifically to assimilate the strengths of Hadoop and the familiarity of SQL support and multi-user performance of the traditional databases.
Its unified resource management across frameworks has made it the de facto standard for open-source interactive business intelligence tasks.
- Familiar SQL interface that data scientists and analysts already know.
- Ability to query high volumes of data (“big data”) in Apache Hadoop.
- Distributed queries in a cluster environment, for convenient scaling and to make use of cost-effective commodity hardware.
- Ability to share data files between different components with no copy or export/import step; for example, to write with Pig, transform with Hive, and query with Impala. Impala can read from and write to Hive tables, enabling simple data interchange using Impala for analytics on Hive-produced data.
- A single system for big data processing and analytics, so customers can avoid costly modeling and ETL just for analytics.
While it comes to Impala Daemon, it is one of the core components of the Hadoop Impala. Basically, it runs on every node in the CDH cluster. It generally identified by the Impalad process. Moreover, we use it to read and write data files. In addition, it accepts the queries transmitted from impala-shell command, ODBC, JDBC, or Hue.
The main functionality of impala daemon we can list as follows:
- Reads and writes to data files.
- Accepts queries transmitted from the impala-shell command, Hue, JDBC, or ODBC.
- Parallelizes the queries and distributes work across the cluster.
- Transmits intermediate query results back to the central coordinator.
impala daemon can be deployed in HDFS cluster directly, and this what you can see from above architecture, in each Data Node we will have one impala daemon. Another way for deployment is to setup impala remotely on another cluster and impala daemons will communicate with the storage either it is on HDFS, S3, ..etc.
Impala daemon is in a continuous communication with Statestore to check which current available healthy daemons to accept new tasks.
To check the health of all Impala Daemons on all the data nodes in the Hadoop cluster we use The Impala Statestore. Also, we call it a process statestored. However, only in the Hadoop cluster one such process we need on one host.
If an Impala daemon goes offline due to hardware failure, network error, software issue, or other reason, the Statestore informs all the other Impala daemons so that future queries can avoid making requests to the unreachable Impala daemon
Impala Catalog Service
The Catalog Service tells metadata changes from Impala SQL statements to all the DataNodes in the Hadoop cluster. Basically, by Daemon process catalogD it is physically represented. Also, we only need one such process on one host in the Hadoop cluster. Generally, as catalog services are passed through statestored, and catalogd process will be running on the same host.
Impala Query Processing Interfaces
However, Impala offers three interfaces in order to process queries such as:
Basically, by typing the command impala-shell in the editor, we can start the Impala shell. But it happens after setting up Impala using the Cloudera VM.
Moreover, using the Hue browser we can easily process Impala queries. Also, we have Impala query editor in the Hue browser. Thus, there we can type and execute the Impala queries. Although, at first, we need to logging to the Hue browser in order to access this editor.
Impala offers ODBC/JDBC drivers, as same as other databases. Moreover, we can connect to impala through programming languages by using these drivers. Hence, that supports these drivers and build applications that process queries in Impala using those programming languages.
Impala Execution Procedure
Basically, using any of the interfaces provided, whenever users pass a query, this is accepted by one of the Impala daemons in the cluster. In addition, for that particular query, this Impala is treated as a coordinator.
Further, using Table Schema from the Hive metastore the query coordinator verifies whether the query is appropriate, just after receiving the query. Afterward, from HDFS namenode it collects the information about the location of the data which is required to execute the query. Then, to execute the query it sends this information to other Impalad daemons.
Impala vs Apache Hive
|Impala is an excellent choice for programmers for running queries on HDFS and Apache HBase as it doesn’t require data to be moved or transformed||Apache hive Is a data warehouse infrastructure build over Hadoop platform for preforming big data intensive tasks such as querying, analysis, processing and visualization.|
|Impala does runtime code generation for “big loops” using a low-level virtual machine||Hive generates query expressions at compile time|
|Impala avoids startup overhead as daemon processes are started at boot time itself, always being ready to process a query Pry Hive query has||Every Hive query has this problem of “cold start”|
|Features: Hadoop Distributed File System (HDFS) and Apache HBase storage support Recognizes Hadoop file formats, text, LZO, SequenceFile, Auro, RCFile, and Parquet Supports Hadoop Security (Kerberos authentication) Fine – gained, released authorization with Apache Sentry Can easily read metadata, ODBC driver and SQL syntax from Apache Hive||Features: Indexing for accelerated processing. Support for storage types such as plain text, RCFile, HBase, ORC. Metadata storage in RDBMS, bringing down time to perform semantic checks during query execution. Has SQL Like queries that get implicitly converted Into tAapReduce, Tez or Spark jobs Familiar built in user defined functions (UDFs) to manipulate strings, dates and other data mining tools.|
|Impala responds quickly through massively parallel processing.||Hive translates queries to be executed into MapReduce jobs under the hood involving overheads.|
|Impala is used to unleash its brute processing power and give lightning-fast analytic results.||Hive is more universal and pluggable language|
|Impala is an ideal choice when starting a new project.||For an upgradation project where compatibility and speed are equally important, hive is an ideal choice|