Skip to content

Plain Text Keys as S3 Filename and S3 files can be grouped on Multitenancy columns #68

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

abhayachauhan
Copy link

Some features I needed for my project - Would love to hear your thoughts and if you are interested in these features?

Also Added tests for Plain Key as filename in S3
@abhayachauhan
Copy link
Author

Also updated to be compatible with Node 4 runtime in Lambda

@rclark
Copy link
Contributor

rclark commented Apr 20, 2016

Hi! Could you explain a little more what the use-case is for what you're calling a MultiTenancy column? I'm seeing that you're sending data to slightly different S3 locations?

As for the clear-text s3 keys, I would advise against this -- we implemented the hashed filenames as a way to add randomness to the S3 keys. Without this randomness, S3 can run into some very hard throughput limitations that can cripple the incremental backup if write loads on your dynamo table are above ~400 per second. http://docs.aws.amazon.com/AmazonS3/latest/dev/request-rate-perf-considerations.html for more information.

@abhayachauhan
Copy link
Author

abhayachauhan commented Apr 21, 2016

Sorry, I should have provided some context. So an explanation:

  1. The first feature was MultiTenancy column.
    This allows us to "group" incremental backups in dynamic prefixes within S3. In the scenario of multitenancy, this allows us to separate client data, and enables us to easily migrate data between environments for one client. Ie - Move all client data from UAT to Production or vice versa.
  2. The second feature - Clear text S3 key was to enable us to easily identify the row we are looking for, opposed to generating an MD5 of the Key to correlating to an S3 key.
    The S3 throughput limitation wasn't something I was aware of, but we use GUIDs (v4 schema in most cases), so they are fairly randomised, which I'm expecting to have a similiar impact to MD5.
    In cases where we don't have GUIDs, I'm assuming we can fall back on MD5.

Both these features could be worked around, but I this just makes life a little easier. I was interested to find out if you are keen to have these merged in (this PR hasn't been reviewed, just worth starting the conversation)

The last update I made was update to leverage Node v4 runtime on Lambda.

@rclark
Copy link
Contributor

rclark commented Apr 21, 2016

I'm hesitant about both of these scenarios because of the potential to cause S3 throttling.

  • S3 throughput is controlled by partitioning -- each partition can support only so much throughput. Mapping from keys --> partitions is entirely dependent on the characters in the object keys. By grouping database records under particular prefixes (the MultiTenancy column here), you're opening the door for a particularly "hot" client leading to a particularly hot S3 partition. This can lead to S3 throttling writes across your bucket while the hot partition is split into more in order to handle the load
  • Clear-text keys are nice, but if you don't use GUIDs or randomly distributed IDs in dynamo, then you again run the same risk of hot S3 partitions. Have you looked at the CLI tools to check an incremental record on S3 vs. its state in DynamoDB or lookup the history of S3 incremental record versions for a particular DynamoDB key? I wonder if either of these tools can help your use-case?

@abhayachauhan
Copy link
Author

Hey Ryan,

Absolutely understand your concerns regarding the throttling, and that is the reason why I've left these as options that can be opted into, opposed to default on.

The idea behind the prefix option is similar - it can also cause throttling issues. MultiTenancyColumn (bad name), can be viewed as a dynamic version of the prefix, based on the data.

MD5 is a great solution for the throttling problem, but only if you don't use the prefix option. But the problem it creates is correlating dynamo keys to their S3 Key if they have been deleted - could be impossible if you dont know the entire key itself.

I have had a look at the tool you linked, and they can be useful in creating an MD5 of the key specified (I'm not sure if it would work for deleted records?)

We are aiming to use this as a DR solution. Which enables us to solve problems where a developer (or security breach) accidently deletes/updates records (or tables). We would need to roll back to a point in time, opposed to knowing the specific key(s) we need to restore.

Out of interest, how reliable has the replicator tool been for you in terms of incremental backups to S3?

Abhaya

@rclark
Copy link
Contributor

rclark commented Apr 22, 2016

We implement versioning on the S3 bucket where incremental backups land. With this in hand, the CLI tool is capable of finding the complete history of any dynamodb record, including deleted ones.

Further, we run a separate process that routinely scans the S3 incremental backup and rolls results into a single file. We call it a "snapshot" because it roughly represents the state of the entire table at some point in time. See https://github.com/mapbox/dynamodb-replicator/blob/master/s3-snapshot.js.

These files give us the ability to roll back the entire table to a previous state, though we are more inclined to roll back individual records if needs be, using S3 versioning and history.

Out of interest, how reliable has the replicator tool been for you in terms of incremental backups to S3?

👌 we love it. We've yet to encounter any evidence of data that was dropped from the dynamodb stream --> lambda --> s3 pipeline.

…eadme. Making changes for x-account permissions

Allowing the app to be packaged with custom env config

Renaming config files to be more forms specific

Naming packages does nothing for lambda

Updating readme to reflect addition config + scripts

improvements to powershell script

Add a canned ACL for the S3 upload to that we don't have permission issues cross-account.

Remove baked config. Add ACL permission. Alter package script
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants