Skip to content

Output filer mangles bucket name when using s3:// schema and bucket name contains the characters "s3". #45

@JossWhittle

Description

@JossWhittle

Assume that TESK is deployed with a config file that sets the output endpoint to some s3 instance in http or https format...

[default]
endpoint_url=http://some.endpoint.com

Then in the job json the "url" for outputs is set to the following:

# This works! 
"url": "s3://output",

The s3 schema means "output" gets treated as the bucket name.

# These all fail!  
"url": "s3://outputs3",
"url": "s3://s3output",

# Here a less contrived example to show how this can happen when you don't even intentionally use "s3" to mean "s3"
"url": "s3://shoulders3486output",

The s3 schema is detected but because the bucket name also contains "s3" it falsely triggers this regex:

match = re.search('^([^.]+).s3', self.netloc)

Which mangles the bucket name leading to a bucket not found error.

But we can trick it...

# This works! 
"url": "http://s3.foo.bar.baz/shoulders3486output",

HTTP is detected as the schema, but the netloc part of the url contains "s3" so it is treated as s3 due to this logic:

if 's3' in netloc:
return S3Transput

The bucket name is now part of the URL "path" not the URL "netloc", so it doesn't get mangled.

s3.foo.bar.baz the netloc part is never actually used other than detecting if it's an s3 transfer or http transfer.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions