Aggregation Queries in Apache Hive | Apache Hive

DataValley Team
August 13, 2020
10:02 am

Introduction

Data aggregation is the process of gathering and expressing data in a summary to get more information about particular groups based on specific conditions. HiveQL offers several built-in aggregate functions, such as max, min, avg,..etc. It also supports advanced aggregation using keywords such as Variance and Standard Deviation and different types of window functions. In this article, we will demonstrate HiveQL aggregation functions with examples.

Prerequisites

Let’s start by creating our Managed table from ‘orders.csv’ file

Orders Dataset Download

CREATE TABLE Orders(ProductID int ,
                    Quantity string ,
                    UnitPrice decimal,
                    Discount int,
                    OrderID int,
                    CustomerID string ,
                    OrdersID int,
                    ShipVia int ,
                    Freight decimal,
                    ShipName string,
                    ShipAddress string,
                    ShipCity string,
                    ShipRegion string,
                    ShipPostalCode int,
                    ShipCountry string)
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY ',' 
STORED AS TEXTFILE LOCATION 'hdfs://<name node host name>:<name node port>/<path of the data files>';

Refer to our previous article about Hive tables to know the difference between Managed and External tables

Apache Hive Table Types | Apache Hive

Aggregation functions

SUM

Returns the sum of the column in the group or sum of the distinct values of the column in the group.

SELECT sum(Quantity) FROM Orders;

SELECT Sum(Quantity) FROM Orders WHERE ShipCountry = ‘France’;

Count

In Count function, we have three variations we can use as following:

COUNT(*) – Returns the total number of retrieved rows, including rows containing NULL values;
COUNT(<expression>) – Returns the number of rows for which the supplied expression is non-NULL;
COUNT(DISTINCT expr[, expr]) – Returns the number of rows for which the supplied expression(s) are unique

SELECT COUNT(*) FROM Orders;

SELECT COUNT(distinct ShipCountry) FROM Orders;

SELECT COUNT(*) FROM Orders WHERE UnitPrice >10;

Average

Returns the average of the elements in the group or the average of the distinct values of the column in the group.

AVG(<expression or column name>)
AVG(DISTINCT <expression or column name>)

SELECT AVG(UnitPrice) FROM Orders WHERE ShipCountry = 'Germany';

Minimum

Returns the minimum of the column in the group.

MIN(<expression or column>

SELECT MIN(UnitPrice) FROM Orders;

Maximum

Returns the maximum of the column in the group.

MAX(<expression or column>

SELECT MAX(UnitPrice) FROM Orders;

Variance

Returns the variance of a numeric column in the group.

VARIANCE(<expression or column>

SELECT VARIANCE(Quantity) FROM Orders;

Standard Deviation

Returns the Standard Deviation of a numeric column in the group.

STDDEV_POP(<expression or column>

SELECT STDDEV_POP(UnitPrice) FROM Orders;

Covariance

Returns the population covariance of a pair of numeric columns in the group.

COVAR_POP(<expression or column> , <expression or column>)

SELECT COVAR_POP(UnitPrice, OrderID) FROM Orders;

Correlation

Returns the Pearson coefficient of correlation of a pair of numeric columns in the group.

CORR(<expression or column> , <expression or column>)

SELECT CORR(UnitPrice, OrderID) FROM Orders;

Collections

Returns a set of objects with duplicate elements eliminated, repeated values will be rejected in the output you will get a distinct list of your input column or expression in the output.

COLLECT_SET(<expression or column>)

SELECT COLLECT_SET(UnitPrice) FROM Orders;

Histogram

Computes a histogram of a numeric column in the group using b non-uniformly spaced bins. The output is an array of size b of double-valued (x,y) coordinates that represent the bin centers and heights.

SELECT HISTOGRAM_NUMERIC(UnitPrice,5) FROM Orders;