Skip to content

Make gzipped data in FileStorage deterministic #1272

@Mr0grog

Description

@Mr0grog

For a while, I’ve been annoyed by intermittent test failures in our tests around GZip compression for FileStorage::S3. Here is a good example: https://app.circleci.com/pipelines/github/edgi-govdata-archiving/web-monitoring-db/2076/workflows/9e189bc7-d909-44bd-93a8-42df47719b2f/jobs/7501 where the output is something like:

Error:
FileStorage::S3Test#test_s3_storage_can_write_a_gzipped_stream:
WebMock::NetConnectNotAllowedError: Real HTTP connections are disabled. ...

You can stub this request with the following snippet:

stub_request(:put, "https://test-bucket.s3.us-west-2.amazonaws.com/something.txt").
  with(
    body: "\x1F\x8B\b\x00K\xC6~h\x00\x03\xF3H\xCD\xC9\xC9WH+\xCA\xCFU\b6V\x04\x00\x91-\xB5\xE6\x0E\x00\x00\x00",
    headers: { ... }.
  to_return(status: 200, body: "", headers: {})

registered request stubs:

stub_request(:put, "https://test-bucket.s3.us-west-2.amazonaws.com/something.txt").
  with(
    body: "\x1F\x8B\b\x00J\xC6~h\x00\x03\xF3H\xCD\xC9\xC9WH+\xCA\xCFU\b6V\x04\x00\x91-\xB5\xE6\x0E\x00\x00\x00",
    headers: { ... })

============================================================
    lib/file_storage/s3.rb:55:in 'FileStorage::S3#save_file'
    test/lib/file_storage/s3_test.rb:126:in 'block in <class:S3Test>'

Today I dug into it a bit and realized this is because Ruby Zlib includes the time the data was compressed in the compression stream, so if the same text is compressed twice in the same second, it will get the same output, but if not, it won’t match, which breaks the tests.

There are two ways to fix this for our tests:

  1. Change the test to match on the PUT URL only and assert that the body decompresses correctly instead of asserting that the compressed bodies are the same:

    test 's3 storage can write a gzipped file' do
    text = 'Hello from S3!'
    s3_put = stub_request(:put, 'https://test-bucket.s3.us-west-2.amazonaws.com/something.txt')
    .with(body: ActiveSupport::Gzip.compress(text), headers: { 'Content-Type' => 'text/plain', 'Content-Encoding' => 'gzip' })
    .to_return(status: 200, body: '', headers: {})
    storage = example_storage(gzip: true)
    storage.save_file('something.txt', text, content_type: 'text/plain')
    assert_requested(s3_put)
    end
    test 's3 storage can write a gzipped stream' do
    text = 'Hello from S3!'
    s3_put = stub_request(:put, 'https://test-bucket.s3.us-west-2.amazonaws.com/something.txt')
    .with(body: ActiveSupport::Gzip.compress(text), headers: { 'Content-Type' => 'text/plain', 'Content-Encoding' => 'gzip' })
    .to_return(status: 200, body: '', headers: {})
    storage = example_storage(gzip: true)
    storage.save_file('something.txt', StringIO.new(text), content_type: 'text/plain')
    assert_requested(s3_put)
    end

  2. Change how we compress to set the mtime and orig_name attributes predictably so we always get the same output for the same input bytes. Unfortunately we currently use ActiveSupport::Gzip.compress to do the compression:

    body: @gzip ? ActiveSupport::Gzip.compress(content.try(:read) || content) : content,

    …and it doesn’t let you set those properties. We’d need to write our own compression helper.

(1) is expedient, but (2) is probably better, since it might help us write less frequently (or not overwrite unnecessarily) to S3.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    Status

    Prioritized

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions