Streaming uploads to Google Cloud Storage

Stefano Passador
2 min readAug 9, 2023

--

Google Cloud Storage is mainly used to store files, but this requires a few steps:

  1. Data is being collected
  2. Data is being stored (even if it is in memory)
  3. Data is uploaded to Google Cloud Storage
Usual data upload to Cloud Storage (batch)

This is fine as long as data is not too much sensitive to latency, but when we are working with streaming data with low reaction time, the previous pattern puts a constraint.

To overcome partially the issue, Google introduced, is the possibility of starting the uploading of data even when the file is not fully created. The steps required by this are:

  1. Chunk of data 1 is being collected
  2. Chunk of data 1 is being uploaded to GCP
  3. Chunk of data 2 is being collected
  4. Chunk of data 2 is being uploaded to GCP
  5. Chunk of data 3 is being collected
  6. Chunk of data 3 is being uploaded to GCP
Streaming data ingestion to Cloud Storage

The code to do this is very simple, and all the logic for the upload is hidden behind the simple upload_from_file() method (docs here). With the right throughput of data coming in, you will see the file created in Cloud Storage even if the data is still being uploaded. If you do not have continuous data coming in, what you can do is limit the upload speed allowed to the system.

Unfortunately, there is also a catch with this new pattern: checksum.
When uploading a file already created the Google Cloud Storage APIs do a checksum of the file uploaded and the file is discarded when the checksum fails. Otherwise, when uploading the data in a streaming manner, Google Cloud Storage cannot apply any checksum, so a manual one should be applied for consistency reasons. But, even applying the manual checksum, means that possibly corrupted data is being accessed between the upload and the manual checksum.

For supporting documentation, take a look at the Google documentation here.

If you have any questions or suggestions please write that as a comment!

--

--

Stefano Passador
Stefano Passador

Written by Stefano Passador

Degree in Computer Science - Master in Big Data Engineer - IT Enthusiast - Milan, Italy

No responses yet