Run Apache Hive workloads utilizing Spark SQL with Amazon EMR on EKS

Apache Hive is a distributed, fault-tolerant information warehouse system that permits analytics at an enormous scale. Utilizing Spark SQL to run Hive workloads gives not solely the simplicity of SQL-like queries but additionally faucets into the distinctive pace and efficiency offered by Spark. Spark SQL is an Apache Spark module for structured information processing. One in every of its hottest use instances is to learn and write Hive tables with connectivity to a persistent Hive metastore, supporting Hive SerDes and user-defined features.

Ranging from model 1.2.0, Apache Spark has supported queries written in HiveQL. HiveQL is a SQL-like language that produces information queries containing enterprise logic, which may be transformed to Spark jobs. Nonetheless, this function is just supported by YARN or standalone Spark mode. To run HiveQL-based information workloads with Spark on Kubernetes mode, engineers should embed their SQL queries into programmatic code corresponding to PySpark, which requires extra effort to manually change code.

Amazon EMR on Amazon EKS gives a deployment possibility for Amazon EMR that you should use to run open-source huge information frameworks on Amazon Elastic Kubernetes Service (Amazon EKS).

Amazon EMR on EKS launch 6.7.0 and later embody the flexibility to run SparkSQL by means of the StartJobRun API. Because of this enhancement, prospects will now have the ability to provide SQL entry-point information and run HiveQL queries as Spark jobs on EMR on EKS straight. The function is obtainable in all AWS Areas the place EMR on EKS is obtainable.

Use case

FINRA is among the largest Amazon EMR prospects that’s working SQL-based workloads utilizing the Hive on Spark strategy. FINRA, Monetary Business Regulatory Authority, is a non-public sector regulator answerable for analyzing equities and possibility buying and selling exercise within the US. To search for fraud, market manipulation, insider buying and selling, and abuse, FINRA’s know-how group has developed a sturdy set of huge information instruments within the AWS Cloud to help these actions.

FINRA centralizes all its information in Amazon Easy Storage Service (Amazon S3) with a distant Hive metastore on Amazon Relational Database Service (Amazon RDS) to handle their metadata data. They use varied AWS analytics companies, corresponding to Amazon EMR, to allow their analysts and information scientists to use superior analytics methods to interactively develop and check new surveillance patterns and enhance investor safety. To make these interactions extra environment friendly and productive, FINRA modernized their hive workloads in Amazon EMR from its legacy Hive on MapReduce to Hive on Spark, which resulted in question efficiency features between 50 and 80 %.

Going ahead, FINRA desires to additional innovate the interactive huge information platform by transferring from a monolithic design sample to a job-centric paradigm, in order that it will possibly fulfill future capability necessities as its enterprise grows. The potential of working Hive workloads utilizing SparkSQL straight with EMR on EKS is among the key enablers that helps FINRA constantly pursue that objective.

Moreover, EMR on EKS presents the next advantages to speed up adoption:

  • Nice-grained entry controls (IRSA) which might be job-centric to harden prospects’ safety posture
  • Minimized adoption effort because it permits direct Hive question submission as a Spark job with out code modifications
  • Lowered run prices by consolidating a number of software program variations for Hive or Spark, unifying synthetic intelligence and machine studying (AI/ML) and alternate, remodel, and cargo (ETL) pipelines right into a single setting
  • Simplified cluster administration by means of multi-Availability Zone help and extremely responsive autoscaling and provisioning
  • Lowered operational overhead by internet hosting a number of compute and storage sorts or CPU architectures (x86 & Arm64) in a single configuration
  • Elevated software reusability and portability supported by customized docker pictures, which permits them to encapsulate all essential dependencies

Operating Hive SQL queries on EMR on EKS


Just remember to have AWS Command Line Interface (AWS CLI) model 1.25.70 or later put in. If you happen to’re working AWS CLI model 2, you want model 2.7.31 or later. Use the next command to test your AWS CLI model:

If essential, set up or replace the most recent model of the AWS CLI.

Answer Overview

To get began, let’s take a look at the next diagram. It illustrates a high-level architectural design and totally different companies that can be utilized within the Hive workload. To match with FINRA’s use case, we selected an Amazon RDS database because the distant Hive metastore. Alternatively, you should use AWS Glue Information Catalog because the metastore for Hive if wanted. For extra particulars, see the aws-sample github venture.

The minimal required infrastructure is:

  • An S3 bucket to retailer a Hive SQL script file
  • An Amazon EKS cluster with EMR on EKS enabled
  • An Amazon RDS for MySQL database in the identical digital non-public cloud (VPC) because the Amazon EKS cluster
  • A standalone Hive metastore service (HMS) working on the EKS cluster or a small Amazon EMR on EC2 cluster with the Hive software put in

To have a fast begin, run the pattern CloudFormation deployment. The infrastructure deployment contains the next sources:

Create a Hive script file

Retailer a couple of traces of Hive queries in a single file, then add the file to your S3 bucket, which may be present in your AWS Administration Console within the AWS CloudFormation Outputs tab. Seek for the important thing worth of CODEBUCKET as proven in previous screenshot. For a fast begin, you may skip this step and use the pattern file saved in s3://<YOUR_S3BUCKET>/app_code/job/set-of-hive-queries.sql. The next is a code snippet from the pattern file :

-- drop database in case change between totally different hive metastore

CREATE DATABASE hiveonspark;
USE hiveonspark;

--create hive managed desk
DROP TABLE IF EXISTS testtable purge;
CREATE TABLE IF NOT EXISTS testtable (`key` INT, `worth` STRING) utilizing hive;
LOAD DATA LOCAL INPATH '/usr/lib/spark/examples/src/most important/sources/kv1.txt' INTO TABLE testtable;
SELECT * FROM testtable WHERE key=238;

-- test1: add column
ALTER TABLE testtable ADD COLUMNS (`arrayCol` Array<int>);
-- test2: insert
SELECT * FROM testtable WHERE key=238;
-- test3: UDF
CREATE TEMPORARY FUNCTION hiveUDF AS 'org.apache.hadoop.hive.ql.udf.generic.GenericUDTFExplode';
SELECT `key`,`worth`,hiveUDF(arrayCol) FROM testtable WHERE key=238;
-- test4: CTAS desk with parameter
DROP TABLE IF EXISTS ctas_testtable purge;
CREATE TABLE ctas_testtable 
SELECT * FROM testtable;
SELECT * FROM ctas_testtable WHERE key=${key_ID};
-- test5: Exterior desk mapped to S3
  market string, 
  customer_id string, 
  review_id  string, 
  product_id  string, 
  product_parent  string, 
  product_title  string, 
  star_rating  integer, 
  helpful_votes  integer, 
  total_votes  integer, 
  vine  string, 
  verified_purchase  string, 
  review_headline  string, 
  review_body  string, 
  review_date  date, 
  12 months  integer
LOCATION 's3://${S3Bucket}/app_code/information/toy/';
SELECT rely(*) FROM amazonreview;

Submit the Hive script to EMR on EKS

First, arrange the required setting variables. See the shell script

export VIRTUAL_CLUSTER_ID=$(aws cloudformation describe-stacks --stack-name $stack_name --query "Stacks[0].Outputs[?OutputKey=='VirtualClusterId'].OutputValue" --output textual content)
export EMR_ROLE_ARN=$(aws cloudformation describe-stacks --stack-name $stack_name --query "Stacks[0].Outputs[?OutputKey=='EMRExecRoleARN'].OutputValue" --output textual content)
export S3BUCKET=$(aws cloudformation describe-stacks --stack-name $stack_name --query "Stacks[0].Outputs[?OutputKey=='CODEBUCKET'].OutputValue" --output textual content)

Connect with the demo EKS cluster:

echo `aws cloudformation describe-stacks --stack-name $stack_name --query "Stacks[0].Outputs[?starts_with(OutputKey,'eksclusterEKSConfig')].OutputValue" --output textual content` | bash
kubectl get svc

Make sure the entryPoint path is appropriate, then submit the set-of-hive-queries.sql to EMR on EKS.

aws emr-containers start-job-run 
--virtual-cluster-id $VIRTUAL_CLUSTER_ID 
--name sparksql-test 
--execution-role-arn $EMR_ROLE_ARN 
--release-label emr-6.8.0-latest 
--job-driver '{
  "sparkSqlJobDriver": {
      "entryPoint": "s3://'$S3BUCKET'/app_code/job/set-of-hive-queries.sql",
      "sparkSqlParameters": "-hivevar S3Bucket="$S3BUCKET" -hivevar Key_ID=238"}}' 
--configuration-overrides '{
    "applicationConfiguration": [
        "classification": "spark-defaults", 
        "properties": {
          "spark.sql.warehouse.dir": "s3://'$S3BUCKET'/warehouse/",
          "spark.hive.metastore.uris": "thrift://hive-metastore:9083"
    "monitoringConfiguration": {
      "s3MonitoringConfiguration": {"logUri": "s3://'$S3BUCKET'/elasticmapreduce/emr-containers"}}}'

Observe that the shell script referenced the set-of-hive-queries.sql Hive script file as an entry level script. It makes use of the sparkSqlJobDriver attribute, not the standard sparkSubmitJobDriver designed for Spark purposes. Within the sparkSqlParameters part, we go in two setting variables S3Bucket and key_ID to the Hive script.

The property "spark.hive.metastore.uris": "thrift://hive-metastore:9083" units a connection to a Hive Metastore Service (HMS) referred to as hive-metastore, which is working as a Kubernetes service on the demo EKS cluster as proven within the comply with screenshot. If you happen to’re working the thrift service on Amazon EMR on EC2, the URI ought to be thrift://<YOUR_EMR_MASTER_NODE_DNS_NAME>:9083. If you happen to selected AWS Glue Information Catalog as your Hive metastore, exchange your complete property with "spark.hadoop.hive.metastore.consumer.manufacturing facility.class": "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory".

Lastly, test the job standing utilizing the kubectl command line instrument: kubectl get po -n emr --watch

Anticipated output

  1. Go to the Amazon EMR console.
  2. Navigate to the facet menu Digital clusters, then choose the HiveDemo cluster, You’ll be able to see an entry for the SparkSQL check job.
  3. Click on Spark UI hyperlink to observe every question’s length and standing on an internet interface.
  4. To question the Amazon RDS based mostly Hive metastore, you want a MYSQL consumer instrument put in. To make it simpler, the pattern CloudFormation template has put in the question instrument on grasp node of a small Amazon EMR on EC2 cluster.
  5. Discover the EMR grasp node by working the next command:
aws ec2 describe-instances --filter Title=tag:venture,Values=$stack_name Title=tag:aws:elasticmapreduce:instance-group-role,Values=MASTER --query 'Reservations[].Cases[].InstanceId[]'

  1. Go to the Amazon EC2 console and hook up with the grasp node by means of the Session Supervisor.
  2. Earlier than querying the MySQL RDS database (the Hive metastore), run the next instructions in your native machine to get the credentials:
    export secret_name=$(aws cloudformation describe-stacks --stack-name $stack_name --query "Stacks[0].Outputs[?OutputKey=='HiveSecretName'].OutputValue" --output textual content) 
    export HOST_NAME=$(aws secretsmanager get-secret-value --secret-id $secret_name --query SecretString --output textual content | jq -r '.host')
    export PASSWORD=$(aws secretsmanager get-secret-value --secret-id $secret_name --query SecretString --output textual content | jq -r '.password')
    export DB_NAME=$(aws secretsmanager get-secret-value --secret-id $secret_name --query SecretString --output textual content | jq -r '.dbname')
    export USER_NAME=$(aws secretsmanager get-secret-value --secret-id $secret_name --query SecretString --output textual content | jq -r '.username')
    echo -e "n host: $HOST_NAMEn DB: $DB_NAMEn passowrd: $PASSWORDn username: $USER_NAMEn"

  3. After related by means of Session Supervisor, question the Hive metastore out of your Amazon EMR grasp node.
    mysql -u admin -P 3306 -p -h <YOUR_HOST_NAME>
    Enter password:<YOUR_PASSWORD>
    # Question the metastore
    MySQL[(none)]> Use HiveEMRonEKS;
    MySQL[HiveEMRonEKS]> choose * from DBS;
    MySQL[HiveEMRonEKS]> choose * from TBLS;
    MySQL[HiveEMRonEKS]> exit();

  4. Validate the Hive tables (created by set-of-hive-queries.sql) by means of the interactive Hive CLI instrument.

Observe:-Your question setting should have the Hive Consumer instrument put in and a connection to your Hive metastore or AWS Glue Information Catalog. For the testing goal, you may hook up with the identical Amazon EMR on EC2 grasp node and question your Hive tables. The EMR cluster has been pre-configured with the required setups.

sudo su
hive> present databases;

Clear up

To keep away from incurring future expenses, delete the sources generated should you don’t want the answer anymore. Run the next cleanup script.

curl important/deployment/app_code/ | bash

Go to the CloudFormation console and manually delete the remaining sources if wanted.


Amazon EMR on EKS releases 6.7.0 and better embody a Spark SQL job driver to be able to straight run Spark SQL scripts through the StartJobRun API. With none modifications to your present Hive scripts, you may straight execute them as a SparkSQL job on Amazon EMR on EKS.

FINRA is among the largest Amazon EMR prospects. It runs over 400 Hive clusters for its analysts who must interactively question multi-petabyte information units. Modernizing its Hive workloads with SparkSQL offers FINRA a 50 to 80 % question efficiency enchancment. The help to run Spark SQL by means of the StartJobRun API in EMR on EKS has additional enabled FINRA’s innovation in information analytics.

On this submit, we demonstrated submit a Hive script to Amazon EMR on EKS and run it as a SparkSQL job. We encourage you to present it a try to are eager to listen to your suggestions and ideas.

Concerning the authors

Amit Maindola is a Senior Information Architect targeted on huge information and analytics at Amazon Internet Providers. He helps prospects of their digital transformation journey and permits them to construct extremely scalable, strong, and safe cloud-based analytical options on AWS to achieve well timed insights and make essential enterprise choices.

Melody Yang is a Senior Large Information Options Architect for Amazon EMR at AWS. She is an skilled analytics chief working with AWS prospects to offer finest observe steering and technical recommendation so as to help their success in information transformation. Her areas of pursuits are open-source frameworks and automation, information engineering, and DataOps.

Leave a Reply

Your email address will not be published. Required fields are marked *