Skip to content
Open
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
145 changes: 145 additions & 0 deletions docs/safe-haven-services/s3-service.md
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can anyone do pip install?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor points you may not care about just now:

  • Standardise "Important" signposting; they are all the same (unless one wants to use an ⁠md construct), so:

    • Under "Access arrangements", add - after it, like the first instance
    • Under "How to use" replace "Note!" with "Important -"
    • Under R usage replace "Note!" with "Important -" (needs done three times)
  • Remove "n" from "recover then memory"

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, anyone can run pip install as part of access to PyPi via the web proxy.

Original file line number Diff line number Diff line change
@@ -0,0 +1,145 @@
# S3 Service

There is no general-purpose S3 service within Safe Haven Services, unlike the [EIDF S3 service](../services/s3/).

However there is a S3 service in SHS with the following caveats:
* it is only available to the Scottish National Safe Haven, other tenants by arrangement
* it is a read-only service, as a way of providing access to large collections of files
* it is not a storage solution for users wanting to create their own files

## Access arrangements

Access to buckets is via keys (Access Key and Secret Access Key) provided to the user by the Research Coordinator.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested rewording: Some Safe Havens may provide you with access to data via S3. If this applies to your project, your Research Coordinator will provide you with an access key. This documentation will guide you through how to get access to your data from a terminal as well as programmatically via R and Python.

!!! important Files in S3 buckets are read-only. If you need to transform or make changes to any files, you will need to download them to your project space. If you download files, please be mindful of disk space by only downloading what is necessary and deleting them as soon as no longer needed.


## How to use the service
Copy link

@2bPro 2bPro Aug 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

## Environment setup


To access files you need the following information:

* Region is "us-east-1"
* Endpoint URL is "http://nsh-fs02:7070"
* Access key
* Secret access key
* The web proxy variables must be empty
Copy link

@2bPro 2bPro Aug 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your RC will provide you with a bucket name, access key ID (bucket name), and secret access key.


For example when using a command-line or script:

|Environment variable |Value |
|-----------------------|----------------------|
| AWS_DEFAULT_REGION | us-east-1 |
| AWS_ENDPOINT_URL | http://nsh-fs02:7070 |
| AWS_ACCESS_KEY_ID | as provided by RC |
| AWS_SECRET_ACCESS_KEY | as provided by RC |
| http_proxy | no value |

```
export AWS_DEFAULT_REGION=us-east-1
export AWS_ENDPOINT_URL=http://nsh-fs02:7070
export AWS_ACCESS_KEY_ID=put_your_key_here
export AWS_SECRET_ACCESS_KEY=put_your_secret_here
export http_proxy=
```

## Use from the command line
Copy link

@2bPro 2bPro Aug 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

## Accessing data
### Command Line


First, install the `aws` command:
```
pip install aws
```

Check it's installed, run `aws` and it will show some help. If you get `command not found` then run it as `~/.local/bin/aws`.

The general syntax is `aws s3 cp s3://bucket/filename localfilename`, for example:
```
aws s3 cp s3://extraction_5_CT/123/456/789.dcm copy_of_789.dcm
```

If this command fails it might be due to a proxy configuration in your environment. To temporarily turn off the proxy in the current window use this first:
```
export http_proxy=
```
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this the case? This appears to be a workaround that could become confusing if the user then attempts to download other packages in the same terminal. Users will likely run this without understanding what it does or that they would have to run other installation commands in a separate terminal.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if it's technically possible for the web proxy to be configured to pass traffic onto nsh-fs02. Maybe a question for Barry or similar. If so it would reduce confusion, but on the other hand I don't think it's a good idea to put the web proxy between the client and the S3 server because all it does is cause additional unnecessary load on the proxy and slow everything down; there's no benefit from authentication either.

Copy link

@2bPro 2bPro Aug 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had a chat with Barry about this, he recommends the use of the NO_PROXY variable set in bashrc instead, so this would look like export NO_PROXY="$NO_PROXY,nsh-fs02:7070" (I tested and can confirm works). I also asked Susan, and users are never told to go and edit their bashrcs, which we may want to avoid. The other option is for systems to add this to their bashrcs (either to all or on-demand).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last time I tested this the NO_PROXY variable was ignored!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NO_PROXY will work here, if properly configured:

Routes to the proxy

rmacleod@nsh-rc-desktop01:~$ NO_PROXY=''; curl -LI nsh-fs02:7070
HTTP/1.1 503 Service Unavailable
Server: squid
...

Routes directly to nsh-fs02

rmacleod@nsh-rc-desktop01:~$ NO_PROXY=nsh-fs02; curl -LI nsh-fs02:7070
HTTP/1.1 400 Bad Request
Server: VERSITYGW
...

A more generalised solution would be to set NO_PROXY=localhost,127.0.0.1,.nsh.loc and then to fully-qualify the server as nsh-fs02.nsh.loc.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NO_PROXY doesn't work; R ignores it and Python ignores it. awscli obeys it. Python does obey no_proxy, R ignores it.


At this point it will probably complain that it can't locate your credentials. In fact it requires a bit more information in order to find the bucket: the region, endpoint, access key and secret key:
```
export AWS_DEFAULT_REGION=us-east-1
export AWS_ENDPOINT_URL=http://nsh-fs02:7070
export AWS_ACCESS_KEY_ID=put_your_key_here
export AWS_SECRET_ACCESS_KEY=put_your_secret_here
export http_proxy=
aws s3 cp s3://extraction_5_CT/123/456/789.dcm copy_of_789.dcm
```

It should go without saying that the access key details are confidential and must never be shared or allowed to be seen by others. Note that all file accesses are logged.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid repeating this by placing it in the environment setup section. Maybe show this as an example env file and source command.


## Performance tips
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These apply to programmatic methods as well, so why not put them after access methods?


* consume the file directly in memory if possible, don't save it to disk. Saving to disk will waste disk space and it will take 3 times longer to do your processing. See the example code below.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Process S3 files directly in memory wherever possible. Saving files to disk is not recommended as this will harm performance (expected to be up to 3 times slower). If this cannot be avoided, please delete files when no longer needed to recover disk space.

* If you need to save into a file temporarily (e.g. whilst converting to NIFTI) then save into a RAM disk in `/run/user/$(id -u)/` but delete it straight after use to recover then memory.
* If it's too large for RAM then save into a file on the system disk, not in your home directory, in `/tmp/$(id)/` but check the disk has space first (using `df -h /tmp/`) and delete it straight after use to recover the disk space.

## Python Usage

### Setup

```console
python3 -m virtualenv venv
. venv/bin/activate
pip install boto3
```

### Download an object
Copy link

@2bPro 2bPro Aug 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Download a file" or introduce object


```py
import boto3
resource = boto3.resource("s3")
bucket = resource.Bucket("epcc-test")
bucket.download_file("test.txt", "copy_of_test.txt")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Show how the env vars would be used to configure the connection, like you have in R.

```

### Load object into pydicom dataset
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generalise this to "load a file for further processing" or similar, and give as an example passing it to pydicom.


```console
pip install pydicom
```

```py
import boto3
import io
import pydicom

resource = boto3.resource("s3")
bucket = resource.Bucket("Request4_CT")
obj = bucket.Object("1.2.840.113619.2.411.3.4077533701.216.1476084945.95/1.2.840.113619.2.411.3.4077533701.216.1476084945.102/CT.1.2.840.113619.2.411.3.4077533701.216.1476084945.104.99-an.dcm")
dcm_bytes = io.BytesIO(obj.get()["Body"].read())
ds = pydicom.dcmread(dcm_bytes)

print(ds["StudyInstanceUID"])
# (0020,000D) Study Instance UID UI: 1.2.840.113619.2.411.3.4077533701.216.1476084945.95
```

## R Usage

There are three different packages for R, you can install `s3` or `aws.s3` or `paws`.

If they use the environment variables shown above
(aws.s3 seems to use slightly different ones, or ignore them)
you may wish to add the details to your `.Renviron` file, e.g.
```
AWS_ACCESS_KEY_ID=<my access key id>
AWS_SECRET_ACCESS_KEY=<my secret key>
AWS_ENDPOINT_URL=http://nsh-fs02:7070
AWS_DEFAULT_REGION=us-east-1
```

The `paws` package is very comprehensive but slow to install and difficult to use.

The simpler `aws.s3` package can be used like this:
```
library(aws.s3)
my_bucket <- "the bucket name"
my_access_key <- "an access key
my_secret_key <- "abigsecretkey"
my_region <- "us-east-1"
my_endpoint_host <- "nsh-fs02:7070"
my_object_path <- "studyid/seriesid/instanceid-an.dcm"
save_object( my_object_path, file = "output.dcm", bucket = my_bucket, base_url=my_endpoint_host, region="", use_https=FALSE, key = my_access_key, secret = my_secret_key )
```
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No examples of loading files given as with Python. Examples must be equivalent.