Photo by Barth Bailey on Unsplash

Move data from AWS S3 to GCP Cloud Storage on-demand using AWS Lambda and GCP Storage Transfer Service

Stefano Passador

--

There are two different methods to synchronize data from an AWS S3 bucket to a GCP Cloud Storage bucket. The first one can be named PULL with which we consider all mechanisms that are triggered by GCP (the target system) to synchronize data, this requires the target system to schedule a job in order to maintain a certain level of consistency between the two systems. The other one is PUSH in which, when a new file is created or a specific trigger is launched in AWS S3 (the source system), it can directly move data to the target system or create a transfer job that will grant the sync task.

Storage Transfer Service is a Google Cloud Platform service that has the benefit to make possible the data transfer from cloud to cloud, i.e. it is possible to transfer data from AWS S3 (our example) and from Azure Blob Storage (see https://cloud.google.com/storage-transfer-service#storage-transfer-service for more details) to GCP Cloud Storage. This system allows the creation of scheduled transfer jobs or one-shot jobs.

The two pull/push methods can be seen in our context as:

  • PULL: GCP Storage Transfer Service with a scheduler trigger that reads from AWS S3.
  • PUSH: Lambda Functions triggered when a new file is created in AWS S3 that launches Storage Transfer Service to do the transfer job.

Requirements

To try out the tutorial you’ll need an AWS and GCP account.

PULL

In the pull flow, we will create a scheduled Storage Transfer Service that will copy everything from the S3 bucket and then deletes all files from the source.

In this way, we make sure that no file is going to be transferred more than one time to the destination.

The first steps that we have to take are:

On AWS we have to:

  • Login into AWS console and create a new S3 bucket storage-transfer-service-test-gcp (I’m creating it in eu-west-1). I leave all the other options on default.
  • Upload in the bucket a single test file (test.txt)
  • Go to the security credentials page by clicking on your profile name and selecting “My Security Credentials”
  • Select “Access keys” and then “Create New Access Key”.
  • Save the Access Key ID and Secret Access Key that appears from “Show Access Key (you can also download the key)

On GCP we:

  • Login into the GCP console
  • Create a Cloud Storage bucket (I called mine storage-transfer-service-test-gcp)
  • Enable the “Storage Transfer API”
  • Open the Storage Transfer Service page by looking in the left sidebar for “Data Transfer” → “Transfer Service | cloud”
  • Click on “Create transfer”
  • While selecting the source system click on “Amazon S3 bucket”
  • As for Amazon S3 bucket name use storage-transfer-service-test-gcp or whichever name you have used before
  • For Access Key ID and Secret Access Key insert the ones that we have generated before in the AWS console.
  • Then specify the destination bucket name with the name of the Cloud Storage bucket that you’ve just created (mine isstorage-transfer-service-test-destination).
  • A set of “Transfer options” will appear, I select delete objects from source (later we will go more in detail about the three options).
  • Now we reach the “Configure transfer” in which we can select if we want to run the transfer immediately, if we want to schedule it to run at a specific time every day, or just schedule it.
  • Click on Run now.
  • The job now is being created and starts running.

When the “Completed ” status is reached, we go to the S3 bucket and we see that the files inside have been deleted from the transfer service. Inside the Cloud Storage bucket we will see the files that were before on the S3 bucket have been copied here.

About the three transfer options that we can select, here there are more details about every single one of them (note that they are not exclusive so you can select any subset of them):

  • Overwrite destination with source, even when identical: implies that when a source object and destination object have the same name, the source object will always overwrite the destination, even if the two objects are identical.
  • Delete objects from the source once they are transferred: Any object successfully transferred to the destination will be deleted from the source. After our transfer, the object will only exist at the destination (the one that we have selected).
  • Delete an object from the destination if there is no version in the source: Ensure your destination only has objects that are also found at the source. After the transfer, any objects at the destination will be deleted if there is no object with the same name found at the source

So, using a combination of these can help with the configuration of scheduled jobs that can run every day, maintaining the two bucket synchronized or else.

PUSH

For the push method we want to create a Storage Transfer Service job every time a new file has been added to the AWS S3 bucket. This creates a sort of on-demand synchronization between the two buckets (even if applying different sets of the “transfer options” can create very different situations).

For the creation of S3 and Cloud Storage buckets, the creation of the Access Key, and the upload of a file in the S3 bucket you can follow the first steps of the previous section.

In order to do the “on-demand” transfer we have to:

  • Create a GCP Service Account that gives permission to create a Storage Transfer Service job
  • Write a Lambda function that creates a Storage Transfer Service job through the API
  • Make the Lambda function triggerable when a new file is being created/uploaded on AWS S3

The first thing we have to do is to create a Service Account on GCP. To do that go to GCP IAM page, click on the left on “Service Account” and create a new service account with the name and description that you prefer and “Storage Transfer User” as Role. After the creation press on the three-dot on the action column about your new service account and then click “Create key” and then create the JSON one. A JSON file will be downloaded, this is the one that we will use later on our AWS Lambda Function.

I create a Lambda function calling it StorageTransferServiceStartercreating a new basic role with basic Lambda permission.
A new page will appear in which you can modify the code of your Lambda.
The first thing that we want to do is define the trigger of our Lambda function. To do that, in the “Configuration” - “Designer” view, press “Add trigger”. Specify that you want a S3 trigger for “All object create events” for the bucket that we have created previously.

We have also to give the Lambda function permission to read from S3. To do so go to the “permission” tab, and press in your Role name. On the page that will be opened there is the set of policies attached to this role, now we have to add the policy (Attach policies) called “AmazonS3ReadOnlyAcess”.

Now we want to implement the Lambda function to interact with GCP Storage Transfer Service.
Here there is the code for that:

In the function we have the lambda_handler() which is the method being called when the lambda starts. Here we specify the parameters needed for the Transfer Service job creation. You can see that the parameters access_key and secret_access_key are initialized through os.getenv(PARAM_NAME), this is done through the Lambda ENVIRONMENT_VARIABLES concept, which lets you create some variables that are transparent for the developer but can contain access keys (as in this case) or others.

In the create_transfer_client() there is the logic required for the creation of the transfer job.
Note that using the same value in the transfer job specification for scheduleStartDate and scheduleEndDate creates a one-shot job.
I suggest you do not specify a name for the transfer job otherwise only 1 job will be created since a clash in the name will happen when creating a second job.

Be aware that this Lambda function has two requirements:

  • google-api-python-client
  • oauth2client

In order to deploy it on AWS, you must create your deployment package. To do that I suggest you follow this guide: https://docs.aws.amazon.com/lambda/latest/dg/python-package.html#python-package-dependencies
You must have AWS CLI installed (and you have to run AWS configure the first time) to deploy your Lambda function from your PC.

Now you can try uploading a file in the S3 bucket. Following this the Lambda function will be triggered, the Storage Transfer Service job created and the file moved and delete from the source.

You can see the AWS Lambda code in the repository https://github.com/stefanopassador/AWSLambda-StorageTransferService

The pricing for GCP Storage Transfer Service is (on 27/11/2020) at 0.04$ per GB transferred.

If you have any tips, comments or questions, do not hesitate to contact me.

--

--

Stefano Passador

Degree in Computer Science - Master in Big Data Engineer - IT Enthusiast - Milan, Italy