Make gzipped data in `FileStorage` deterministic

For a while, I’ve been annoyed by intermittent test failures in our tests around GZip compression for `FileStorage::S3`. Here is a good example: https://app.circleci.com/pipelines/github/edgi-govdata-archiving/web-monitoring-db/2076/workflows/9e189bc7-d909-44bd-93a8-42df47719b2f/jobs/7501 where the output is something like:

```
Error:
FileStorage::S3Test#test_s3_storage_can_write_a_gzipped_stream:
WebMock::NetConnectNotAllowedError: Real HTTP connections are disabled. ...

You can stub this request with the following snippet:

stub_request(:put, "https://test-bucket.s3.us-west-2.amazonaws.com/something.txt").
  with(
    body: "\x1F\x8B\b\x00K\xC6~h\x00\x03\xF3H\xCD\xC9\xC9WH+\xCA\xCFU\b6V\x04\x00\x91-\xB5\xE6\x0E\x00\x00\x00",
    headers: { ... }.
  to_return(status: 200, body: "", headers: {})

registered request stubs:

stub_request(:put, "https://test-bucket.s3.us-west-2.amazonaws.com/something.txt").
  with(
    body: "\x1F\x8B\b\x00J\xC6~h\x00\x03\xF3H\xCD\xC9\xC9WH+\xCA\xCFU\b6V\x04\x00\x91-\xB5\xE6\x0E\x00\x00\x00",
    headers: { ... })

============================================================
    lib/file_storage/s3.rb:55:in 'FileStorage::S3#save_file'
    test/lib/file_storage/s3_test.rb:126:in 'block in <class:S3Test>'
```

Today I dug into it a bit and realized this is because Ruby Zlib includes the time the data was compressed in the compression stream, so if the same text is compressed twice in the same second, it will get the same output, but if not, it won’t match, which breaks the tests.

There are two ways to fix this for our tests:
 
1. Change the test to match on the `PUT` URL only and assert that the body *decompresses* correctly instead of asserting that the compressed bodies are the same: https://github.com/edgi-govdata-archiving/web-monitoring-db/blob/68688212f1bd70970aca39d20811828f557f9938/test/lib/file_storage/s3_test.rb#L106-L128

2. Change how we compress to set the [`mtime`](https://docs.ruby-lang.org/en/master/Zlib/GzipWriter.html#method-i-mtime-3D) and [`orig_name`](https://docs.ruby-lang.org/en/master/Zlib/GzipWriter.html#method-i-orig_name-3D) attributes predictably so we always get the same output for the same input bytes. Unfortunately we currently use `ActiveSupport::Gzip.compress` to do the compression: https://github.com/edgi-govdata-archiving/web-monitoring-db/blob/68688212f1bd70970aca39d20811828f557f9938/lib/file_storage/s3.rb#L59

    …and it doesn’t let you set those properties. We’d need to write our own compression helper.

(1) is expedient, but (2) is probably better, since it might help us write less frequently (or not overwrite unnecessarily) to S3.

	test 's3 storage can write a gzipped file' do
	text = 'Hello from S3!'

	s3_put = stub_request(:put, 'https://test-bucket.s3.us-west-2.amazonaws.com/something.txt')
	.with(body: ActiveSupport::Gzip.compress(text), headers: { 'Content-Type' => 'text/plain', 'Content-Encoding' => 'gzip' })
	.to_return(status: 200, body: '', headers: {})

	storage = example_storage(gzip: true)
	storage.save_file('something.txt', text, content_type: 'text/plain')
	assert_requested(s3_put)
	end

	test 's3 storage can write a gzipped stream' do
	text = 'Hello from S3!'

	s3_put = stub_request(:put, 'https://test-bucket.s3.us-west-2.amazonaws.com/something.txt')
	.with(body: ActiveSupport::Gzip.compress(text), headers: { 'Content-Type' => 'text/plain', 'Content-Encoding' => 'gzip' })
	.to_return(status: 200, body: '', headers: {})

	storage = example_storage(gzip: true)
	storage.save_file('something.txt', StringIO.new(text), content_type: 'text/plain')
	assert_requested(s3_put)
	end

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Make gzipped data in `FileStorage` deterministic #1272

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Make gzipped data in FileStorage deterministic #1272

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Make gzipped data in `FileStorage` deterministic #1272