Amazon Redshift is a broadly used, absolutely managed, petabyte-scale cloud knowledge warehouse. Tens of 1000’s of shoppers use Amazon Redshift to course of exabytes of knowledge daily to energy their analytic workloads. Prospects are searching for instruments that make it simpler emigrate from different knowledge warehouses, similar to Google BigQuery, to Amazon Redshift to make the most of the service price-performance, ease of use, safety, and reliability.
On this submit, we present you find out how to use AWS native providers to speed up your migration from Google BigQuery to Amazon Redshift. We use AWS Glue, a totally managed, serverless, ETL (extract, remodel, and cargo) service, and the Google BigQuery Connector for AWS Glue (for extra info, check with Migrating knowledge from Google BigQuery to Amazon S3 utilizing AWS Glue customized connectors). We additionally add automation and suppleness to simplify migration of a number of tables to Amazon Redshift utilizing the Customized Auto Loader Framework.
Resolution overview
The answer gives a scalable and managed knowledge migration workflow emigrate knowledge from Google BigQuery to Amazon Easy Storage Service (Amazon S3), after which from Amazon S3 to Amazon Redshift. This pre-built answer scales to load knowledge in parallel utilizing enter parameters.
The next structure diagram exhibits how the answer works. It begins with organising the migration configuration to hook up with Google BigQuery, then converts the database schemas, and at last migrates the info to Amazon Redshift.
The workflow accommodates the next steps:
- A configuration file is uploaded to an S3 bucket you will have chosen for this answer. This JSON file accommodates the migration metadata, specifically the next:
- A listing of Google BigQuery tasks and datasets.
- A listing of all tables to be migrated for every undertaking and dataset pair.
- An Amazon EventBridge rule triggers an AWS Step Features state machine to start out migrating the tables.
- The Step Features state machine iterates on the tables to be migrated and runs an AWS Glue Python shell job to extract the metadata from Google BigQuery and retailer it in an Amazon DynamoDB desk used for monitoring the tables’ migration standing.
- The state machine iterates on the metadata from this DynamoDB desk to run the desk migration in parallel, primarily based on the utmost variety of migration jobs with out incurring limits or quotas on Google BigQuery. It performs the next steps:
- Runs the AWS Glue migration job for every desk in parallel.
- Tracks the run standing within the DynamoDB desk.
- After the tables have been migrated, checks for errors and exits.
- The information exported from Google BigQuery is saved to Amazon S3. We use Amazon S3 (despite the fact that AWS Glue jobs can write on to Amazon Redshift tables) for a number of particular causes:
- We will decouple the info migration and the info load steps.
- It affords extra management on the load steps, with the power to reload the info or pause the method.
- It gives fine-grained monitoring of the Amazon Redshift load standing.
- The Customized Auto Loader Framework mechanically creates schemas and tables within the goal database and repeatedly masses knowledge from Amazon S3 to Amazon Redshift.
A couple of further factors to notice:
- You probably have already created the goal schema and tables within the Amazon Redshift database, you possibly can configure the Customized Auto Loader Framework to not mechanically detect and convert the schema.
- If you would like extra management over changing the Google BigQuery schema, you should use the AWS Schema Conversion Device (AWS SCT). For extra info, check with Migrate Google BigQuery to Amazon Redshift utilizing AWS Schema Conversion instrument (SCT).
- As of this writing, neither the AWS SCT nor Customized Auto Loader Framework assist the conversion of nested knowledge varieties (file, array and struct). Amazon Redshift helps semistructured knowledge utilizing the Tremendous knowledge sort, so in case your desk makes use of such complicated knowledge varieties, then you should create the goal tables manually.
To deploy the answer, there are two major steps:
- Deploy the answer stack utilizing AWS CloudFormation.
- Deploy and configure Customized Auto Loader Framework to load information from Amazon S3 to Amazon Redshift.
Conditions
Earlier than getting began, be sure you have the next:
- An account in Google Cloud, particularly a service account that has permissions to Google BigQuery.
- An AWS account with an AWS Id and Entry Administration (IAM) person with an entry key and secret key to configure the AWS Command Line Interface (AWS CLI). The IAM person additionally wants permissions to create an IAM position and insurance policies.
- An Amazon Redshift cluster or Amazon Redshift Serverless workgroup. If you happen to don’t have one, check with Amazon Redshift Serverless.
- An S3 bucket. If you happen to don’t wish to use one in every of your present buckets, you possibly can create a brand new one. Observe the identify of the bucket to make use of later while you go it to the CloudFormation stack as an enter parameter.
- A configuration file with the checklist of tables to be migrated. This file ought to have the next construction:
Alternatively, you possibly can obtain the demo file, which makes use of the open dataset created by the Facilities for Medicare & Medicaid Providers.
On this instance, we named the file bq-mig-config.json
-
- Configure your Google account.
- Create an IAM position for AWS Glue (and be aware down the identify of the IAM position).
- Subscribe to and activate the Google BigQuery Connector for AWS Glue.
Deploy the answer utilizing AWS CloudFormation
To deploy the answer stack utilizing AWS CloudFormation, full the next steps:
- Select Launch Stack:
This template provisions the AWS assets within the us-east-1
Area. If you wish to deploy to a unique Area, obtain the template bigquery-cft.yaml
and launch it manually: on the AWS CloudFormation console, select Create stack with new assets and add the template file you downloaded.
The checklist of provisioned assets is as follows:
-
- An EventBridge rule to start out the Step Features state machine on the add of the configuration file.
- A Step Features state machine that runs the migration logic. The next diagram illustrates the state machine.
- An AWS Glue Python shell job used to extract the metadata from Google BigQuery. The metadata will probably be saved in an DynamoDB desk, with a calculated attribute to prioritize the migration job. By default, the connector creates one partition per 400 MB within the desk being learn (earlier than filtering). As of this writing, the Google BigQuery Storage API has a most quota for parallel learn streams, so we set the restrict for employee nodes for tables bigger than 400 GB. We additionally calculate the max variety of jobs that may run in parallel primarily based on these values.
- An AWS Glue ETL job used to extract the info from every Google BigQuery desk and saves it in Amazon S3 in Parquet format.
- A DynamoDB desk (
bq_to_s3_tracking
) used to retailer the metadata for every desk to be migrated (dimension of the desk, S3 path used to retailer the migrated knowledge, and the variety of employees wanted emigrate the desk). - A DynamoDB desk (
bq_to_s3_maxstreams
) used to retailer the utmost variety of streams per state machine run. This helps us decrease job failures as a result of limits or quotas. Use the Cloud Formation template to customise the identify of the DynamoDB desk. The prefix for the DynamoDB desk isbq_to_s3
. - The IAM roles wanted by the state machine and AWS Glue jobs.
- Select Subsequent.
- For Stack identify, enter a reputation.
- For Parameters, enter the parameters listed within the following desk, then select Create.
CloudFormation Template Parameter | Allowed Values | Description |
InputBucketName |
S3 bucket identify |
The S3 bucket the place the AWS Glue job shops the migrated knowledge. The information will probably be really saved in a folder named |
InputConnectionName |
AWS Glue connection identify, the default is glue-bq-connector-24 |
The identify of the AWS Glue connection that’s created utilizing the Google BigQuery connector. |
InputDynamoDBTablePrefix |
DynamoDB desk identify prefix, the default is bq_to_s3 |
The prefix that will probably be used when naming the 2 DynamoDB tables created by the answer. |
InputGlueETLJob |
AWS Glue ETL job identify, the default is bq-migration-ETL |
The identify you wish to give to the AWS Glue ETL job. The precise script is saved within the S3 path specified within the parameter InputGlueS3Path . |
InputGlueMetaJob |
AWS Glue Python shell job identify, the default is bq-get-metadata |
The identify you wish to give to AWS Glue Python shell job. The precise script is saved within the S3 path specified within the parameter InputGlueS3Path . |
InputGlueS3Path |
S3 path, the default is s3://aws-glue-scripts-${AWS::Account}-${AWS::Area}/admin/ |
That is the S3 path through which the stack will copy the scripts for AWS Glue jobs. Keep in mind to interchange: ${AWS::Account} with the precise AWS account ID and ${AWS::Area} with the Area you intend to make use of, or present your personal bucket and prefix in a whole path. |
InputMaxParallelism |
Variety of parallel migration jobs to run, the default is 30 | The utmost variety of tables you wish to migrate concurrently. |
InputBQSecret |
AWS Secrets and techniques Supervisor secret identify | The identify of the AWS Secrets and techniques Supervisor secret through which you saved the Google BigQuery credential. |
InputBQProjectName |
Google BigQuery undertaking identify | The identify of your undertaking in Google BigQuery through which you wish to retailer momentary tables; you will want write permissions on the undertaking. |
StateMachineName |
Step Features state machine identify, the default is
|
The identify of the Step Features state machine. |
SourceS3BucketName |
S3 bucket identify, the default is aws-blogs-artifacts-public |
The S3 bucket the place the artifacts for this submit are saved. Don’t change the default. |
Deploy and configure the Customized Auto Loader Framework to load information from Amazon S3 to Amazon Redshift
The Customized Auto Loader Framework utility makes knowledge ingestion to Amazon Redshift less complicated and mechanically masses knowledge information from Amazon S3 to Amazon Redshift. The information are mapped to the respective tables by merely dropping information into preconfigured areas on Amazon S3. For extra particulars concerning the structure and inside workflow, check with Customized Auto Loader Framework.
To arrange the Customized Auto Loader Framework, full the next steps:
- Select Launch Stack to deploy the CloudFormation stack within the
us-east-1
Area:
- On the AWS CloudFormation console, select Subsequent.
- Present the next parameters to assist make sure the profitable creation of assets. Be sure you have collected these values beforehand.
Parameter Identify | Allowed Values | Description |
---|---|---|
CopyCommandSchedule |
cron(0/5 * ? * * *) |
The EventBridge guidelines KickoffFileProcessingSchedule and QueueRSProcessingSchedule are triggered primarily based on this schedule. The default is 5 minutes. |
DatabaseName |
dev |
The Amazon Redshift database identify. |
DatabaseSchemaName |
public |
The Amazon Redshift schema identify. |
DatabaseUserName |
demo |
The Amazon Redshift person identify who has entry to run COPY instructions on the Amazon Redshift database and schema. |
RedshiftClusterIdentifier |
democluster |
The Amazon Redshift cluster identify. |
RedshiftIAMRoleARN |
arn:aws:iam::7000000000:position/RedshiftDemoRole |
The Amazon Redshift cluster hooked up position, which has entry to the S3 bucket. This position is utilized in COPY instructions. |
SourceS3Bucket |
Your-bucket-name |
The S3 bucket the place knowledge is situated. Use the identical bucket you used to retailer the migrated knowledge as indicated within the earlier stack. |
CopyCommandOptions |
delimiter '|' gzip |
Present the extra COPY command knowledge format parameters as follows:
|
InitiateSchemaDetection |
Sure |
The setting to dynamically detect the schema previous to file add. |
The next screenshot exhibits an instance of our parameters.
- Select Create.
- Monitor the progress of the Stack creation and wait till it’s full.
- To confirm the Customized Auto Loader Framework configuration, log in to the Amazon S3 console and navigate to the S3 bucket you supplied as a worth to the
SourceS3Bucket
parameter.
It is best to see a brand new listing known as s3-redshift-loader-source
is created.
Take a look at the answer
To check the answer, full the next steps:
- Create the configuration file primarily based on the conditions. It’s also possible to obtain the demo file.
- To arrange the S3 bucket, on the Amazon S3 console, navigate to the folder
bq-mig-config
within the bucket you supplied within the stack. - Add the config file into it.
- To allow EventBridge notifications to the bucket, open the bucket on the console and on the Properties tab, find Occasion notifications.
- Within the Amazon EventBridge part, select Edit.
- Choose On, then select Save modifications.
- On AWS Step Perform console, monitor the run of the state machine.
- Monitor the standing of the masses in Amazon Redshift. For directions, check with Viewing Present Hundreds.
- Open the Amazon Redshift Question Editor V2 and question your knowledge.
Pricing concerns
You might need egress prices for migrating knowledge out of Google BigQuery into Amazon S3. Evaluate and calculate the price for shifting your knowledge in your Google cloud billing console. As of this writing, AWS Glue 3.0 or later prices $0.44 per DPU-hour, billed per second, with a 1-minute minimal for Spark ETL jobs. For extra info, see AWS Glue Pricing. With auto scaling enabled, AWS Glue mechanically provides and removes employees from the cluster relying on the parallelism at every stage or microbatch of the job run.
Clear up
To keep away from incurring future prices, clear up your assets:
- Delete the CloudFormation answer stack.
- Delete the CloudFormation Customized Auto Loader Framework stack.
Conclusion
On this submit, we demonstrated find out how to construct a scalable and automatic knowledge pipeline emigrate your knowledge from Google BigQuery to Amazon Redshift. We additionally highlighted how the Customized Auto Loader framework can automate the schema detection, create tables to your S3 information, and repeatedly load the information into your Amazon Redshift warehouse. With this strategy, you possibly can automate the migration of total tasks (even a number of tasks on the time) in Google BigQuery to Amazon Redshift. This helps enhance knowledge migration instances into Amazon Redshift considerably by means of the automated desk migration parallelization.
The auto-copy characteristic in Amazon Redshift simplifies computerized knowledge loading from Amazon S3 with a easy SQL command, customers can simply automate knowledge ingestion from Amazon S3 to Amazon Redshift utilizing the Amazon Redshift auto-copy preview characteristic
For extra details about the efficiency of the Google BigQuery Connector for AWS Glue, check with Migrate terabytes of knowledge shortly from Google Cloud to Amazon S3 with AWS Glue Connector for Google BigQuery and discover ways to migrate a considerable amount of knowledge (1.9 TB) into Amazon S3 shortly (about 8 minutes).
To study extra about AWS Glue ETL jobs, see Simplify knowledge pipelines with AWS Glue computerized code technology and workflows and Making ETL simpler with AWS Glue Studio.
Concerning the Authors
Tahir Aziz is an Analytics Resolution Architect at AWS. He has labored with constructing knowledge warehouses and massive knowledge options for over 13 years. He loves to assist clients design end-to-end analytics options on AWS. Outdoors of labor, he enjoys touring and cooking.
Ritesh Kumar Sinha is an Analytics Specialist Options Architect primarily based out of San Francisco. He has helped clients construct scalable knowledge warehousing and massive knowledge options for over 16 years. He likes to design and construct environment friendly end-to-end options on AWS. In his spare time, he loves studying, strolling, and doing yoga.
Fabrizio Napolitano is a Principal Specialist Options Architect for DB and Analytics. He has labored within the analytics area for the final 20 years, and has lately and fairly without warning change into a Hockey Dad after shifting to Canada.
Manjula Nagineni is a Senior Options Architect with AWS primarily based in New York. She works with main monetary service establishments, architecting and modernizing their large-scale functions whereas adopting AWS Cloud providers. She is keen about designing large knowledge workloads cloud-natively. She has over 20 years of IT expertise in software program growth, analytics, and structure throughout a number of domains similar to finance, retail, and telecom.
Sohaib Katariwala is an Analytics Specialist Options Architect at AWS. He has over 12 years of expertise serving to organizations derive insights from their knowledge.