Visualize information high quality scores and metrics generated by AWS Glue Information High quality


AWS Glue Information High quality lets you measure and monitor the standard of information in your information repositories. It’s vital for enterprise customers to have the ability to see high quality scores and metrics to make assured enterprise choices and debug information high quality points. AWS Glue Information High quality generates a considerable quantity of operational runtime info throughout the analysis of rulesets.

An operational scorecard is a mechanism used to judge and measure the standard of information processed and validated by AWS Glue Information High quality rulesets. It offers insights and metrics associated to the efficiency and effectiveness of information high quality processes.

On this submit, we spotlight the seamless integration of Amazon Athena and Amazon QuickSight, which permits the visualization of operational metrics for AWS Glue Information High quality rule analysis in an environment friendly and efficient method.

This submit is Half 5 of a five-post collection to elucidate the way to construct dashboards to measure and monitor your information high quality:

Resolution overview

The answer lets you construct your AWS Glue Information High quality rating and metrics dashboard utilizing QuickSight in a simple and simple method. The next structure diagram reveals an summary of the whole pipeline.

These are six primary steps within the information pipeline:

  1. Amazon EventBridge triggers an AWS Lambda operate when the occasion sample for AWS Glue Information High quality matches the outlined rule. (Consult with Arrange alerts and orchestrate information high quality guidelines with AWS Glue Information High quality)
  2. The Lambda operate writes the AWS Glue Information High quality end result to an Amazon Easy Storage Service (Amazon S3) bucket.
  3. An AWS Glue crawler crawls the outcomes.
  4. The crawler builds a Information Catalog, so the information will be queried utilizing Athena.
  5. We will analyze the information high quality rating and metrics utilizing Athena SQL queries.
  6. We will question and submit the Athena information to QuickSight to create visuals for the dashboard.

Within the following sections, we focus on these steps in additional element.

Conditions

To comply with together with this submit, full the next conditions:

  1. Have an AWS Id and Entry Administration (IAM) function with permissions to extract information from an S3 bucket and write to the AWS Glue Information Catalog.
  2. Equally, have a Lambda operate execution function with entry to AWS Glue and  S3 buckets.
  3. Arrange the Athena question end result location. For extra info, seek advice from Working with Question Outcomes, Output Recordsdata, and Question Historical past.
  4. Arrange QuickSight permissions and allow Athena desk and S3 bucket entry.

Arrange and deploy the Lambda pipeline

To check the answer, we will use the next AWS CloudFormation template. The CloudFormation template creates the EventBridge rule, Lambda operate, and S3 bucket to retailer the information high quality outcomes.

For those who deployed the CloudFormation template within the earlier submit, you don’t must deploy it once more on this step.

The next screenshot reveals a line of code wherein the Lambda operate writes the outcomes from AWS Glue Information High quality to an S3 bucket. As depicted, the information shall be saved in JSON format and arranged in response to the time horizon, facilitating handy entry and evaluation of the information over time.

Arrange the AWS Glue Information Catalog utilizing a crawler

Full the next steps to create an AWS Glue crawler and arrange the Information Catalog:

  1. On the AWS Glue console, select Crawlers within the navigation pane.
  2. Select Create crawler.
  3. For Identify, enter data-quality-result-crawler, then select Subsequent.
  4. Below Information sources, select Add an information supply.
  5. For Information supply, select S3.
  6. For S3 path, enter the S3 path to your information supply. (s3://<AWS CloudFormation outputs key:DataQualityS3BucketNameOutputs>/gluedataqualitylogs/). Consult with Arrange alerts and orchestrate information high quality guidelines with AWS Glue Information High quality for particulars.
  7. Select Add an S3 information supply and select Subsequent.
  8. For Present IAM function, select your IAM function (GlueDataQualityLaunchBlogDemoRole-xxxx). Consult with Arrange alerts and orchestrate information high quality guidelines with AWS Glue Information High quality for particulars. Then select Subsequent.
  9. For Goal database, select Add database.
  10. For Database identify, enter data-quality-result-database, then select Create.
  11. For Desk identify prefix, enter dq_, then select Subsequent.
  12. Select Create crawler.
  13. On the Crawlers web page, choose data-quality-result-crawler and select Run.

When the crawler is full, you’ll be able to see the AWS Glue Information Catalog desk definition.

After you create the desk definition on the AWS Glue Information Catalog, you need to use Athena to question the Information Catalog desk.

Question the Information Catalog desk utilizing Athena

Athena is an interactive question service that makes it simple to research information in Amazon S3 and the AWS Glue Information Catalog utilizing commonplace SQL. Athena is serverless, so there isn’t a infrastructure to handle, and also you pay just for the queries that you simply run on datasets at petabyte scale.

The aim of this step is to grasp our information high quality statistics on the desk stage in addition to on the ruleset stage. Athena offers easy queries to help you with this activity. Use the queries on this part to research your information high quality metrics and create an Athena view to make use of to construct a QuickSight dashboard within the subsequent step.

Question 1

The next is a straightforward SELECT question on the Information Catalog desk:

SELECT * FROM "data-quality-result-database"."dq_gluedataqualitylogs" restrict 10;

The next screenshot reveals the output.

Earlier than we run the second question, let’s examine the schema for the desk dq_gluedataqualitylogs.

The next screenshot reveals the output.

The desk reveals that one of many columns, resultrun, is the array information sort. With a purpose to work with this column in QuickSight, we have to carry out an extra step to remodel it into a number of strings. That is needed as a result of QuickSight doesn’t help the array information sort.

Question 2

Use the next question to evaluate the information within the resultrun column:

SELECT resultrun FROM "data-quality-result-database"."dq_gluedataqualitylogs" restrict 10;

The next screenshot reveals the output.

Question 3

The next question flattens an array into a number of rows utilizing CROSS JOIN together with the unnest operator and creates a view on the chosen columns:

CREATE OR REPLACE VIEW data_quality_result_view AS
SELECT "databasename","tablename", 
"ruleset_name","runid", "resultid", 
"state", "numrulessucceeded", 
"numrulesfailed", "numrulesskipped", 
"rating", "12 months","month",
"day",runs.identify,runs.end result,
runs.evaluationmessage,runs.Description
FROM "dq_gluedataqualitylogs"
CROSS JOIN unnest(resultrun) AS t(runs)

The next screenshot reveals the output.

Confirm the columns that had been created utilizing the unnest operator.

The next screenshot reveals the output.

Question 4

Confirm the Athena view created within the earlier question:

SELECT * FROM data_quality_result_view LIMIT 10

The next screenshot reveals the output.

Visualize the information with QuickSight

Now you could question your information in Athena, you need to use QuickSight to visualise the outcomes. Full the next steps:

  1. Register to the QuickSight console.
  2. Within the higher proper nook of the console, select Admin/username, then select Handle QuickSight.
  3. Select Safety and permissions.
  4. Below QuickSight entry to AWS companies, select Add or take away.
  5. Select Amazon Athena, then select Subsequent.
  6. Give QuickSight entry to the S3 bucket the place your information high quality result’s saved.

Create your datasets

Earlier than you’ll be able to analyze and visualize the information in QuickSight, you have to create datasets on your Athena view (data_quality_result_view). Full the next steps:

  1. On the Datasets web page, select New dataset, then select Athena.
  2. Select the AWS Glue database that you simply created earlier.
  3. Choose Import to SPICE (alternatively, you’ll be able to choose Immediately question your information).
  4. Select Visualize.

Construct your dashboard

Create your evaluation with one donut chart, one pivot desk, one vertical stacked bar, and one funnel chart that use the completely different fields within the dataset. QuickSight presents a variety of charts and visuals that can assist you create your dashboard. For extra info, seek advice from Visible sorts in Amazon QuickSight.

Clear up

To keep away from incurring future costs, delete the assets created on this submit.

Conclusion

On this submit, we offer insights into operating Athena queries and constructing personalized dashboards in QuickSight to grasp information high quality metrics. This provides you an incredible place to begin for utilizing this resolution along with your datasets and making use of enterprise guidelines to construct a whole information high quality framework to watch points inside your datasets.

To dive into the AWS Glue Information High quality APIs, seek advice from Information High quality API. To be taught extra about AWS Glue Information High quality, see the AWS Glue Information High quality Developer Information. To be taught extra about QuickSight dashboards, seek advice from the Amazon QuickSight Developer Information.


Concerning the authors

Zack Zhou is a Software program Growth Engineer on the AWS Glue workforce.

Deenbandhu Prasad is a Senior Analytics Specialist at AWS, specializing in huge information companies. He’s keen about serving to clients construct fashionable information structure on the AWS Cloud. He has helped clients of all sizes implement information administration, information warehouse, and information lake options.

Avik Bhattacharjee is a Senior Associate Options Architect at AWS. He works with clients to construct IT technique, making digital transformation by means of the cloud extra accessible, specializing in huge information and analytics and AI/ML.

Amit Kumar Panda is a Information Architect at AWS Skilled Companies who’s keen about serving to clients construct scalable information analytics options to allow making essential enterprise choices.

Leave a Reply

Your email address will not be published. Required fields are marked *