Introduction
Data aggregation is the process of gathering and expressing data in a summary to get more information about particular groups based on specific conditions. HiveQL offers several built-in aggregate functions, such as max, min, avg,..etc. It also supports advanced aggregation using keywords such as Variance and Standard Deviation and different types of window functions. In this article, we will demonstrate HiveQL aggregation functions with examples.
Prerequisites
Let’s start by creating our Managed table from ‘orders.csv’ file
CREATE TABLE Orders(ProductID int , Quantity string , UnitPrice decimal, Discount int, OrderID int, CustomerID string , OrdersID int, ShipVia int , Freight decimal, ShipName string, ShipAddress string, ShipCity string, ShipRegion string, ShipPostalCode int, ShipCountry string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION 'hdfs://<name node host name>:<name node port>/<path of the data files>';
Refer to our previous article about Hive tables to know the difference between Managed and External tables
Aggregation functions
SUM
Returns the sum of the column in the group or sum of the distinct values of the column in the group.
SELECT sum(Quantity) FROM Orders;
SELECT Sum(Quantity) FROM Orders WHERE ShipCountry = ‘France’;
Count
In Count function, we have three variations we can use as following:
COUNT(*) – Returns the total number of retrieved rows, including rows containing NULL values;
COUNT(<expression>) – Returns the number of rows for which the supplied expression is non-NULL;
COUNT(DISTINCT expr[, expr]) – Returns the number of rows for which the supplied expression(s) are unique
SELECT COUNT(*) FROM Orders;
SELECT COUNT(distinct ShipCountry) FROM Orders;
SELECT COUNT(*) FROM Orders WHERE UnitPrice >10;
Average
Returns the average of the elements in the group or the average of the distinct values of the column in the group.
AVG(<expression or column name>)
AVG(DISTINCT <expression or column name>)
SELECT AVG(UnitPrice) FROM Orders WHERE ShipCountry = 'Germany';

Minimum
Returns the minimum of the column in the group.
MIN(<expression or column>
SELECT MIN(UnitPrice) FROM Orders;

Maximum
Returns the maximum of the column in the group.
MAX(<expression or column>
SELECT MAX(UnitPrice) FROM Orders;

Variance
Returns the variance of a numeric column in the group.
VARIANCE(<expression or column>
SELECT VARIANCE(Quantity) FROM Orders;

Standard Deviation
Returns the Standard Deviation of a numeric column in the group.
STDDEV_POP(<expression or column>
SELECT STDDEV_POP(UnitPrice) FROM Orders;

Covariance
Returns the population covariance of a pair of numeric columns in the group.
COVAR_POP(<expression or column> , <expression or column>)
SELECT COVAR_POP(UnitPrice, OrderID) FROM Orders;

Correlation
Returns the Pearson coefficient of correlation of a pair of numeric columns in the group.
CORR(<expression or column> , <expression or column>)
SELECT CORR(UnitPrice, OrderID) FROM Orders;
Collections
Returns a set of objects with duplicate elements eliminated, repeated values will be rejected in the output you will get a distinct list of your input column or expression in the output.
COLLECT_SET(<expression or column>)
SELECT COLLECT_SET(UnitPrice) FROM Orders;

Histogram
Computes a histogram of a numeric column in the group using b non-uniformly spaced bins. The output is an array of size b of double-valued (x,y) coordinates that represent the bin centers and heights.
SELECT HISTOGRAM_NUMERIC(UnitPrice,5) FROM Orders;