S3 Cost Saving: Archiving Compressed S3 Data Into Glacier

Totalcloud.io
5 min readJun 15, 2020

Amazon’s S3 has been a popular storage service for many years. One of its sought out features is the storage tiers allow you to move data over. Different tiers come with different benefits. We’ll be looking at two such tiers that we know as S3 standard and S3 Glacier.

Moving data from standard to Glacier is common practice. For one, Glacier is the cheapest storage tier available and two, it’s the best archiving solution.

Glacier is used to store away data that you aren’t in any immediate need of. So let’s say you’ve got older records of customers or patients and you want to store them away for an indefinite amount of time. S3 Glacier is the perfect service for such scenarios.

Glacier charges you in two ways. One for the data transfer and two for the fixed storage, which is billed monthly. One of the ways people try to further reduce the costs of archiving is by compressing the data. This process ensures lesser data being transferred and stored. The only problem is, compressing your data is a bit of a messy process. Glacier doesn’t have any inherent compressing feature. The common solution is to use the S3DataNode part of DataPipelines and its subsequent GZIP command to do the task. This is the messy part. You have to take your data out of S3, load it into a pipeline, run the command, push it to Glacier. All manually — it’s quite a hassle.

We built out an automated no-code workflow to take the same process and push it all into one seamless flow of events that does these different tasks from the same place. With this workflow, compression of your data will be the ideal way you approach your archiving. You could potentially cut your costs by 90% with this neat method. You only need 1 workflow with 8 nodes to make this complex use case a reality. No coding, no configuring on the AWS Console, or anything else.

Why and how archiving can save costs

Archiving has always been a familiar concept ever since the storage of data was introduced to the world. The idea of it is to store data in a way that access to it cannot be frequent, originally because of the sheer number of records being archived. In Amazon’s case, Archiving is a cheaper alternative to storage simply because of the trade-off it put forward.

This trade-off gives you more possibilities on how your data should be organized and how much you can reduce expenses. Your job is to pick the data for archival, this usually means personal records and older data sets. You could also run an analysis of the activity of your data to figure out which one’s worth shipping off to Glacier.

So the very act of archiving acts as a layer of cost-saving for your storage. However, we propose an additional layer- compression of data. Compression reduces your costs after archiving. AWS charges a fixed amount for storage and it is calculated based on how much data is stored. Compression helps reduce numbers in that calculation. Now, there is, yet again, the same trade-off as last time. Since data compression might take some time, decompression and retrieval of data altogether will further add to the waiting for data access. But hey, if you’re gonna get wet, might as well go swimming.

How our workflow compresses data

Data compression is done by loading the S3 data onto a different bucket and into the data pipeline. The workflow uses a bit of custom code for this specific process (since we’ve already created it, you can simply adopt it as a template). We also configure the pipeline on our workflow to enable the compression and then you just wait a while for it to happen. The process itself is no different to normal ZIP compression, we’re just enabling it on a cloud service.

Detailed remarks about the process will be explained in the next section. The compression of data reduces both the overall Glacier storage cost and the data transfer cost.

Process

I’ll expand on the workflow and what each node does. For this process, we only employ 2 AWS services- S3 and Data Pipeline.

Step 1: Triggering the workflow

The trigger nodes determine what set of action leads to the workflow being activated. In this case, it’s a trigger from an external application. This could be a request from a web page or an app, etc.

Step 2: A Custom Code to collate the S3 Data

This node has its custom code present that does the job of collecting the S3 Data from your bucket and preparing it to be redirected. The sourceBucket is from where the data is taken and the targetBucket is where the data will be moved to.

Step 3: Create DataPipeline

This action node creates the data pipeline where the S3 data will be compressed.

Step 4: A custom code to push S3 Data into the pipeline

Step 5: Pipeline Definition

This node configures the compression of the S3 Data that is moved into the pipeline and ensure the transfer of it to S3 Glacier

Step 6: Pipeline Activation

Step 7: Delay

A 600-second delay is set to allow the data transfer to happen before the next node is activated.

Step 8: Delete DataPipeline

This action node deletes the data pipeline after the compression and archiving is successfully done.‍

Conclusion

We have many other use cases similar to this one that we solve with our workflows. Our goal with custom workflows is to give you the freedom to find your solutions to the many complications AWS put forth. If you have a particular use case you want to work on, you can sign up with us and try your hand it.

--

--

Totalcloud.io

TotalCloud helps cloud engineers indulge in no-code AWS automation. We enable engineers to go script-less, saving more than 95% of engineering time.