Automate alerting and reporting for AWS Glue job useful resource utilization


Knowledge transformation performs a pivotal position in offering the required information insights for companies in any group, small and huge. To achieve these insights, prospects usually carry out ETL (extract, rework, and cargo) jobs from their supply methods and output an enriched dataset. Many organizations at the moment are utilizing AWS Glue to construct ETL pipelines that convey information from disparate sources and retailer the info in repositories like a knowledge lake, database, or information warehouse for additional consumption. These organizations are in search of methods they will scale back price throughout their IT environments and nonetheless be operationally performant and environment friendly.

Image a state of affairs the place you, the VP of Knowledge and Analytics, are in control of your information and analytics environments and workloads working on AWS the place you handle a group of information engineers and analysts. This group is allowed to create AWS Glue for Spark jobs in growth, take a look at, and manufacturing environments. Throughout testing, one of many jobs wasn’t configured to routinely scale its compute assets, leading to jobs timing out, costing the group greater than anticipated. The following steps normally embrace finishing an evaluation of the roles, taking a look at price reviews to see which account generated the spike in utilization, going by logs to see when what occurred with the job, and so forth. After the ETL job has been corrected, it’s possible you’ll wish to implement monitoring and set commonplace alert thresholds to your AWS Glue surroundings.

This submit will assist organizations proactively monitor and price optimize their AWS Glue environments by offering a better path for groups to measure effectivity of their ETL jobs and align configuration particulars in keeping with organizational necessities. Included is an answer it is possible for you to to deploy that may notify your group through e-mail about any Glue job that has been configured incorrectly. Moreover, a weekly report is generated and despatched through e-mail that aggregates useful resource utilization and offers price estimates per job.

AWS Glue price issues

AWS Glue for Apache Spark jobs are provisioned with plenty of staff and a employee sort. These jobs will be both G.1X, G.2X, G.4X, G.8X or Z.2X (Ray) employee sorts that map to information processing models (DPUs). DPUs embrace a certain quantity of CPU, reminiscence, and disk house. The next desk comprises extra particulars.

Employee Sort DPUs vCPUs Reminiscence (GB) Disk (GB)
G.1X 1 4 16 64
G.2X 2 8 32 128
G.4X 4 16 64 256
G.8X 8 32 128 512
Z.2X 2 8 32 128

For instance, if a job is provisioned with 10 staff as G.1X employee sort, the job may have entry to 40 vCPU and 160 GB of RAM to course of information and double utilizing G.2X. Over-provisioning staff can result in elevated price, as a consequence of not all staff being utilized effectively.

In April 2022, Auto Scaling for AWS Glue was launched for AWS Glue model 3.0 and later, which incorporates AWS Glue for Apache Spark and streaming jobs. Enabling auto scaling in your Glue for Apache Spark jobs will assist you to solely allocate staff as wanted, as much as the employee most you specify. We advocate enabling auto scaling to your AWS Glue 3.0 & 4.0 jobs as a result of this function will assist scale back price and optimize your ETL jobs.

Amazon CloudWatch metrics are additionally a good way to watch your AWS Glue surroundings by creating alarms for sure metrics like common CPU or reminiscence utilization. To be taught extra about the best way to use CloudWatch metrics with AWS Glue, discuss with Monitoring AWS Glue utilizing Amazon CloudWatch metrics.

The next answer offers a easy solution to set AWS Glue employee and job length thresholds, configure monitoring, and obtain emails for notifications on how your AWS Glue surroundings is performing. If a Glue job finishes and detects employee or job length thresholds had been exceeded, it’ll notify you after the job run has accomplished, failed, or timed out.

Answer overview

The next diagram illustrates the answer structure.

Whenever you deploy this software through AWS Serverless Utility Mannequin (AWS SAM), it’ll ask what AWS Glue employee and job length thresholds you wish to set to watch the AWS Glue for Apache Spark and AWS Glue for Ray jobs working in that account. The answer will use these values as the choice standards when invoked. The next is a breakdown of every step within the structure:

  1. Any AWS Glue for Apache Spark job that succeeds, fails, stops, or instances out is distributed to Amazon EventBridge.
  2. EventBridge picks up the occasion from AWS Glue and triggers an AWS Lambda perform.
  3. The Lambda perform processes the occasion and determines if the info and analytics group ought to be notified in regards to the specific job run. The perform performs the next duties:
    1. The perform sends an e-mail utilizing Amazon Easy Notification Service (Amazon SNS) if wanted.
      • If the AWS Glue job succeeded or was stopped with out going over the employee or job length thresholds, or is tagged to not be monitored, no alerts or notifications are despatched.
      • If the job succeeded however ran with a employee or job length thresholds larger than allowed, or the job both failed or timed out, Amazon SNS sends a notification to the designated e-mail with details about the AWS Glue job, run ID, and purpose for alerting, together with a hyperlink to the precise run ID on the AWS Glue console.
    2. The perform logs the job run info to Amazon DynamoDB for a weekly aggregated report delivered to e-mail. The Dynamo desk has Time to Reside enabled for 7 days, which retains the storage to minimal.
  4. As soon as every week, the info inside DynamoDB is aggregated by a separate Lambda perform with significant info like longest-running jobs, variety of retries, failures, timeouts, price evaluation, and extra.
  5. Amazon Easy E mail Service (Amazon SES) is used to ship the report as a result of it may be higher formatted than utilizing Amazon SNS. The e-mail is formatted through HTML output that gives tables for the aggregated job run information.
  6. The information and analytics group is notified in regards to the ongoing job runs by Amazon SNS, and so they obtain the weekly aggregation report by Amazon SES.

Word that AWS Glue Python shell and streaming ETL jobs usually are not supported as a result of they’re not in scope of this answer.

Stipulations

It’s essential to have the next conditions:

  • An AWS account to deploy the answer to
  • Correct AWS Id and Entry Administration (IAM) privileges to create the assets
  • The AWS SAM CLI to construct and deploy the answer button beneath, to run template in your AWS surroundings

Deploy the answer

This AWS SAM software provisions the next assets:

  • Two EventBridge guidelines
  • Two Lambda capabilities
  • An SNS matter and subscription
  • A DynamoDB desk
  • An SES subscription
  • The required IAM roles and insurance policies

To deploy the AWS SAM software, full the next steps:

Clone the aws-samples GitHub repository:

git clone https://github.com/aws-samples/aws-glue-job-tracker.git

Deploy the AWS SAM software:

cd aws-glue-job-tracker
sam deploy --guided

sam deploy configuration

Present the next parameters:

  • GlueJobWorkerThreshold – Enter the utmost variety of staff you need an AWS Glue job to have the ability to run with earlier than sending threshold alert. The default is 10. An alert might be despatched if a Glue job runs with larger staff than specified.
  • GlueJobDurationThreshold – Enter the utmost length in minutes you need an AWS Glue job to run earlier than sending threshold alert. The default is 480 minutes (8 hours). An alert might be despatched if a Glue job runs with larger job length than specified.
  • GlueJobNotifications – Enter an e-mail or distribution record of those that should be notified by Amazon SNS and Amazon SES. You possibly can go to the SNS matter after the deployment is full and add emails as wanted.

To obtain emails from Amazon SNS and Amazon SES, you will need to verify your subscriptions. After the stack is deployed, verify your e-mail that was specified within the template and make sure by selecting the hyperlink in every message. When the applying is efficiently provisioned, it’ll start monitoring your AWS Glue for Apache Spark job surroundings. The following time a job fails, instances out, or exceeds a specified threshold, you’ll obtain an e-mail through Amazon SNS. For instance, the next screenshot exhibits an SNS message a few job that succeeded however had a job length threshold violation.

You may need jobs that have to run at the next employee or job length threshold, and also you don’t need the answer to guage them. You possibly can merely tag that job with the important thing/worth of remediate and false. The step perform will nonetheless be invoked, however will use the PASS state when it acknowledges the tag. For extra info on job tagging, discuss with AWS tags in AWS Glue.

Adding tags to glue job configuration

Configure weekly reporting

As talked about beforehand, when an AWS Glue for Apache Spark job succeeds, fails, instances out, or is stopped, EventBridge forwards this occasion to Lambda, the place it logs particular details about every job run. As soon as every week, a separate Lambda perform queries DynamoDB and aggregates your job runs to supply significant insights and suggestions about your AWS Glue for Apache Spark surroundings. This report is distributed through e-mail with a tabular construction as proven within the following screenshot. It’s meant for top-level visibility so that you’re capable of see your longest job runs over time, jobs which have had many retries, failures, and extra. It additionally offers an total price calculation as an estimate of what every AWS Glue job will price for that week. It shouldn’t be used as a assured price. If you want to see precise price per job, the AWS Value and Utilization Report is the very best useful resource to make use of. The next screenshot exhibits one desk (of 5 complete) from the AWS Glue report perform.

weekly report

Clear up

In the event you don’t wish to run the answer anymore, delete the AWS SAM software for every account that it was provisioned in. To delete your AWS SAM stack, run the next command out of your challenge listing:

sam delete

Conclusion

On this submit, we mentioned how one can monitor and cost-optimize your AWS Glue job configurations to adjust to organizational requirements and coverage. This technique can present price controls over AWS Glue jobs throughout your group. Another methods to assist management the prices of your AWS Glue for Apache Spark jobs embrace the newly launched AWS Glue Flex jobs and Auto Scaling. We additionally offered an AWS SAM software as an answer to deploy into your accounts. We encourage you to evaluate the assets offered on this submit to proceed studying about AWS Glue. To be taught extra about monitoring and optimizing for price utilizing AWS Glue, please go to this latest weblog. It goes in depth on all the price optimization choices and features a template that builds a CloudWatch dashboard for you with metrics about your whole Glue job runs.


Concerning the authors

Michael Hamilton is a Sr Analytics Options Architect specializing in serving to enterprise prospects within the south east modernize and simplify their analytics workloads on AWS. He enjoys mountain biking and spending time together with his spouse and three kids when not working.

Angus Ferguson is a Options Architect at AWS who’s enthusiastic about assembly prospects internationally, serving to them remedy their technical challenges. Angus makes a speciality of Knowledge & Analytics with a deal with prospects within the monetary companies trade.

Leave a Reply

Your email address will not be published. Required fields are marked *