Plain Text Keys as S3 Filename and S3 files can be grouped on Multitenancy columns #68

abhayachauhan · 2016-04-19T05:21:35Z

Some features I needed for my project - Would love to hear your thoughts and if you are interested in these features?

Also updated callbacks to work with Node 4

Also Added tests for Plain Key as filename in S3

abhayachauhan · 2016-04-19T05:22:22Z

Also updated to be compatible with Node 4 runtime in Lambda

rclark · 2016-04-20T20:04:33Z

Hi! Could you explain a little more what the use-case is for what you're calling a MultiTenancy column? I'm seeing that you're sending data to slightly different S3 locations?

As for the clear-text s3 keys, I would advise against this -- we implemented the hashed filenames as a way to add randomness to the S3 keys. Without this randomness, S3 can run into some very hard throughput limitations that can cripple the incremental backup if write loads on your dynamo table are above ~400 per second. http://docs.aws.amazon.com/AmazonS3/latest/dev/request-rate-perf-considerations.html for more information.

abhayachauhan · 2016-04-21T05:04:19Z

Sorry, I should have provided some context. So an explanation:

The first feature was MultiTenancy column.
This allows us to "group" incremental backups in dynamic prefixes within S3. In the scenario of multitenancy, this allows us to separate client data, and enables us to easily migrate data between environments for one client. Ie - Move all client data from UAT to Production or vice versa.
The second feature - Clear text S3 key was to enable us to easily identify the row we are looking for, opposed to generating an MD5 of the Key to correlating to an S3 key.
The S3 throughput limitation wasn't something I was aware of, but we use GUIDs (v4 schema in most cases), so they are fairly randomised, which I'm expecting to have a similiar impact to MD5.
In cases where we don't have GUIDs, I'm assuming we can fall back on MD5.

Both these features could be worked around, but I this just makes life a little easier. I was interested to find out if you are keen to have these merged in (this PR hasn't been reviewed, just worth starting the conversation)

The last update I made was update to leverage Node v4 runtime on Lambda.

rclark · 2016-04-21T17:21:15Z

I'm hesitant about both of these scenarios because of the potential to cause S3 throttling.

S3 throughput is controlled by partitioning -- each partition can support only so much throughput. Mapping from keys --> partitions is entirely dependent on the characters in the object keys. By grouping database records under particular prefixes (the MultiTenancy column here), you're opening the door for a particularly "hot" client leading to a particularly hot S3 partition. This can lead to S3 throttling writes across your bucket while the hot partition is split into more in order to handle the load
Clear-text keys are nice, but if you don't use GUIDs or randomly distributed IDs in dynamo, then you again run the same risk of hot S3 partitions. Have you looked at the CLI tools to check an incremental record on S3 vs. its state in DynamoDB or lookup the history of S3 incremental record versions for a particular DynamoDB key? I wonder if either of these tools can help your use-case?

abhayachauhan · 2016-04-22T02:33:47Z

Hey Ryan,

Absolutely understand your concerns regarding the throttling, and that is the reason why I've left these as options that can be opted into, opposed to default on.

The idea behind the prefix option is similar - it can also cause throttling issues. MultiTenancyColumn (bad name), can be viewed as a dynamic version of the prefix, based on the data.

MD5 is a great solution for the throttling problem, but only if you don't use the prefix option. But the problem it creates is correlating dynamo keys to their S3 Key if they have been deleted - could be impossible if you dont know the entire key itself.

I have had a look at the tool you linked, and they can be useful in creating an MD5 of the key specified (I'm not sure if it would work for deleted records?)

We are aiming to use this as a DR solution. Which enables us to solve problems where a developer (or security breach) accidently deletes/updates records (or tables). We would need to roll back to a point in time, opposed to knowing the specific key(s) we need to restore.

Out of interest, how reliable has the replicator tool been for you in terms of incremental backups to S3?

Abhaya

rclark · 2016-04-22T15:45:08Z

We implement versioning on the S3 bucket where incremental backups land. With this in hand, the CLI tool is capable of finding the complete history of any dynamodb record, including deleted ones.

Further, we run a separate process that routinely scans the S3 incremental backup and rolls results into a single file. We call it a "snapshot" because it roughly represents the state of the entire table at some point in time. See https://github.com/mapbox/dynamodb-replicator/blob/master/s3-snapshot.js.

These files give us the ability to roll back the entire table to a previous state, though we are more inclined to roll back individual records if needs be, using S3 versioning and history.

Out of interest, how reliable has the replicator tool been for you in terms of incremental backups to S3?

👌 we love it. We've yet to encounter any evidence of data that was dropped from the dynamodb stream --> lambda --> s3 pipeline.

…eadme. Making changes for x-account permissions Allowing the app to be packaged with custom env config Renaming config files to be more forms specific Naming packages does nothing for lambda Updating readme to reflect addition config + scripts improvements to powershell script Add a canned ACL for the S3 upload to that we don't have permission issues cross-account. Remove baked config. Add ACL permission. Alter package script

correct Markdown formatting

abhayachauhan added 3 commits April 18, 2016 16:24

Added ability to leverage MultiTenancy Column in incremental backups

6261bf1

Also updated callbacks to work with Node 4

Added multi key filename

4943187

Updated to Node 4 API in Lambda

b3af446

Also Added tests for Plain Key as filename in S3

JoshuaToth force-pushed the master branch from 155de56 to 156f81f Compare December 23, 2016 00:32

dixia and others added 3 commits January 12, 2017 13:43

correct Markdown formatting

17b8991

Merge pull request #1 from dixia/dixia-patch-1

9480723

correct Markdown formatting

Merge pull request #2 from dixia/master

c27639f

correct Markdown formatting

JoshuaToth force-pushed the master branch from 46a0fa8 to c27639f Compare March 27, 2017 03:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Plain Text Keys as S3 Filename and S3 files can be grouped on Multitenancy columns #68

Plain Text Keys as S3 Filename and S3 files can be grouped on Multitenancy columns #68

Uh oh!

abhayachauhan commented Apr 19, 2016

Uh oh!

abhayachauhan commented Apr 19, 2016

Uh oh!

rclark commented Apr 20, 2016

Uh oh!

abhayachauhan commented Apr 21, 2016 •

edited

Loading

Uh oh!

rclark commented Apr 21, 2016

Uh oh!

abhayachauhan commented Apr 22, 2016

Uh oh!

rclark commented Apr 22, 2016

Uh oh!

Uh oh!

Plain Text Keys as S3 Filename and S3 files can be grouped on Multitenancy columns #68

Are you sure you want to change the base?

Plain Text Keys as S3 Filename and S3 files can be grouped on Multitenancy columns #68

Uh oh!

Conversation

abhayachauhan commented Apr 19, 2016

Uh oh!

abhayachauhan commented Apr 19, 2016

Uh oh!

rclark commented Apr 20, 2016

Uh oh!

abhayachauhan commented Apr 21, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rclark commented Apr 21, 2016

Uh oh!

abhayachauhan commented Apr 22, 2016

Uh oh!

rclark commented Apr 22, 2016

Uh oh!

Uh oh!

abhayachauhan commented Apr 21, 2016 •

edited

Loading