Add s3-service.md to document the S3 service inside Safe Haven Services #238

howff · 2025-08-06T14:03:49Z

EIDF Documentation Pull Request

Description

Add s3-service.md to document the S3 service inside Safe Haven Services

Fixes #237 237

Type of change

Please delete options that are not relevant.

What has to be reviewed

s3-service.md

Checklist

Documentation follows the project style guidelines
Ensure Contact details contain Service Emails and Numbers
Self-review of documentation using mkdocs on local system
Spellcheck has been performed
Pre-commit has been run and passed

2bPro · 2025-08-06T15:30:08Z

docs/safe-haven-services/s3-service.md

+There is no general-purpose S3 service within Safe Haven Services, unlike the [EIDF S3 service](../services/s3/).
+
+However there is a S3 service in SHS with the following caveats:
+* it is only available to the Scottish National Safe Haven, other tenants by arrangement
+* it is a read-only service, as a way of providing access to large collections of files
+* it is not a storage solution for users wanting to create their own files
+
+## Access arrangements
+
+Access to buckets is via keys (Access Key and Secret Access Key) provided to the user by the Research Coordinator.


Suggested rewording: Some Safe Havens may provide you with access to data via S3. If this applies to your project, your Research Coordinator will provide you with an access key. This documentation will guide you through how to get access to your data from a terminal as well as programmatically via R and Python.

!!! important Files in S3 buckets are read-only. If you need to transform or make changes to any files, you will need to download them to your project space. If you download files, please be mindful of disk space by only downloading what is necessary and deleting them as soon as no longer needed.

2bPro · 2025-08-06T15:30:14Z

docs/safe-haven-services/s3-service.md

+
+Access to buckets is via keys (Access Key and Secret Access Key) provided to the user by the Research Coordinator.
+
+## How to use the service


## Environment setup

2bPro · 2025-08-06T15:30:16Z

docs/safe-haven-services/s3-service.md

+To access files you need the following information:
+
+* Region is "us-east-1"
+* Endpoint URL is "http://nsh-fs02:7070"
+* Access key
+* Secret access key
+* The web proxy variables must be empty


Your RC will provide you with a bucket name, access key ID (bucket name), and secret access key.

docs/safe-haven-services/s3-service.md

2bPro · 2025-08-06T15:30:24Z

docs/safe-haven-services/s3-service.md

+export http_proxy=
+```
+
+## Use from the command line


## Accessing data ### Command Line

docs/safe-haven-services/s3-service.md

2bPro · 2025-08-06T15:30:42Z

docs/safe-haven-services/s3-service.md

+If this command fails it might be due to a proxy configuration in your environment. To temporarily turn off the proxy in the current window use this first:
+```
+export http_proxy=
+```


Why is this the case? This appears to be a workaround that could become confusing if the user then attempts to download other packages in the same terminal. Users will likely run this without understanding what it does or that they would have to run other installation commands in a separate terminal.

I don't know if it's technically possible for the web proxy to be configured to pass traffic onto nsh-fs02. Maybe a question for Barry or similar. If so it would reduce confusion, but on the other hand I don't think it's a good idea to put the web proxy between the client and the S3 server because all it does is cause additional unnecessary load on the proxy and slow everything down; there's no benefit from authentication either.

Had a chat with Barry about this, he recommends the use of the NO_PROXY variable set in bashrc instead, so this would look like export NO_PROXY="$NO_PROXY,nsh-fs02:7070" (I tested and can confirm works). I also asked Susan, and users are never told to go and edit their bashrcs, which we may want to avoid. The other option is for systems to add this to their bashrcs (either to all or on-demand).

Last time I tested this the NO_PROXY variable was ignored!

NO_PROXY will work here, if properly configured:

Routes to the proxy

rmacleod@nsh-rc-desktop01:~$ NO_PROXY=''; curl -LI nsh-fs02:7070 HTTP/1.1 503 Service Unavailable Server: squid ...

Routes directly to nsh-fs02

rmacleod@nsh-rc-desktop01:~$ NO_PROXY=nsh-fs02; curl -LI nsh-fs02:7070 HTTP/1.1 400 Bad Request Server: VERSITYGW ...

A more generalised solution would be to set NO_PROXY=localhost,127.0.0.1,.nsh.loc and then to fully-qualify the server as nsh-fs02.nsh.loc.

NO_PROXY doesn't work; R ignores it and Python ignores it. awscli obeys it. Python does obey no_proxy, R ignores it.

2bPro · 2025-08-06T15:30:45Z

docs/safe-haven-services/s3-service.md

+At this point it will probably complain that it can't locate your credentials. In fact it requires a bit more information in order to find the bucket: the region, endpoint, access key and secret key:
+```
+export AWS_DEFAULT_REGION=us-east-1
+export AWS_ENDPOINT_URL=http://nsh-fs02:7070
+export AWS_ACCESS_KEY_ID=put_your_key_here
+export AWS_SECRET_ACCESS_KEY=put_your_secret_here
+export http_proxy=
+aws s3 cp  s3://extraction_5_CT/123/456/789.dcm  copy_of_789.dcm
+```
+
+It should go without saying that the access key details are confidential and must never be shared or allowed to be seen by others. Note that all file accesses are logged.


Avoid repeating this by placing it in the environment setup section. Maybe show this as an example env file and source command.

2bPro · 2025-08-06T15:30:49Z

docs/safe-haven-services/s3-service.md

+
+It should go without saying that the access key details are confidential and must never be shared or allowed to be seen by others. Note that all file accesses are logged.
+
+## Performance tips


These apply to programmatic methods as well, so why not put them after access methods?

2bPro · 2025-08-06T15:30:51Z

docs/safe-haven-services/s3-service.md

+
+## Performance tips
+
+* consume the file directly in memory if possible, don't save it to disk. Saving to disk will waste disk space and it will take 3 times longer to do your processing. See the example code below.


Process S3 files directly in memory wherever possible. Saving files to disk is not recommended as this will harm performance (expected to be up to 3 times slower). If this cannot be avoided, please delete files when no longer needed to recover disk space.

docs/safe-haven-services/s3-service.md

2bPro · 2025-08-06T15:31:01Z

docs/safe-haven-services/s3-service.md

+pip install boto3
+```
+
+### Download an object


"Download a file" or introduce object

2bPro · 2025-08-06T15:31:04Z

docs/safe-haven-services/s3-service.md

+import boto3
+resource = boto3.resource("s3")
+bucket = resource.Bucket("epcc-test")
+bucket.download_file("test.txt", "copy_of_test.txt")


Show how the env vars would be used to configure the connection, like you have in R.

2bPro · 2025-08-06T15:31:07Z

docs/safe-haven-services/s3-service.md

+bucket.download_file("test.txt", "copy_of_test.txt")
+```
+
+### Load object into pydicom dataset


Generalise this to "load a file for further processing" or similar, and give as an example passing it to pydicom.

2bPro · 2025-08-06T15:31:14Z

docs/safe-haven-services/s3-service.md

+my_endpoint_host <- "nsh-fs02:7070"
+my_object_path <- "studyid/seriesid/instanceid-an.dcm"
+save_object( my_object_path, file = "output.dcm", bucket = my_bucket, base_url=my_endpoint_host, region="", use_https=FALSE, key = my_access_key, secret = my_secret_key )
+```


No examples of loading files given as with Python. Examples must be equivalent.

…cs.yml

2bPro · 2025-08-08T12:01:00Z

docs/safe-haven-services/s3-service.md

+* A "bucket" name (similar to a folder name)
+* An "access key" (similar to a username)
+* A "private key" (similar to a password)


The user only needs the access key id (or bucket name), and secret access key. I think this is confusing, implies there's two access keys, one of which is "similar to a username", despite this not being used in the configuration?

Actually it is as described, the bucket name is not the access key

When defining the variables below, and as defined in the file received from the RC:
AWS_ACCESS_KEY_ID=bucketname
AWS_SECRET_ACCESS_KEY=password

The bucket name is like the directory name, the access key is like the username and the secret key is like the password, I'm not sure where you're getting the information that the access key is the same as the bucket name.

As specified in my previous comment, I'm getting it from the documentation and the file received from the RC. Maybe an example would help clarify where I'm coming from.

I have a user bprodan in study 2024-0000 and was told by the RC that I will be given access to a bucket called dummydata. According to the user documentation, which follows the standard minio convention, the configuration should be:

variable value

BUCKET_NAME dummydata

ACCESS_KEY_ID bprodan

SECRET_ACCESS_KEY 1234

According to the RC documentation, keys are created at study level, so this should be more like:

variable value

BUCKET_NAME dummydata

ACCESS_KEY_ID 2024-0000

SECRET_ACCESS_KEY 1234

But instead, in the file received from the RC, I have:

variable value

BUCKET_NAME dummydata

ACCESS_KEY_ID dummydata

SECRET_ACCESS_KEY 1234

Currently, the user documentation assumes knowledge of how minio is configured and calling the access key ID a username just to then receive the bucket name would make a user question which value maps to that variable and if that is their username or not.

The bucket name is like the directory name, the access key is like the username and the secret key is like the password, I'm not sure where you're getting the information that the access key is the same as the bucket name.

I agree with this, so perhaps the RC documentation needs to be updated?

As an aside, we are not using MinIO.

The technology is not important in this case, this is about user guidance on how to configure access. I am not disagreeing with Andrew's statement, I'm saying the way the documentation presents this assumes knowledge about what the access key ID variable represents and is unclear about what value is associated with it.

2bPro · 2025-08-08T12:13:54Z

docs/safe-haven-services/s3-service.md

+save_object( my_object_path, file = "downloaded.dcm", bucket = my_bucket, base_url=my_endpoint_host, region="", use_https=FALSE, key = my_access_key, secret = my_secret_key )
+```
+
+Note! You need to have the region set in the environment variable and pass `region=""` to the functions, otherwise you get a `cannot resolve host` error. The `base_url` is just the host and port, no `http://` prefix, and `use_https` is false.


Note formatting is inconsistent within this page as well as compared with the rest of the docs. For important pieces of text, this is probably the most visible.

2bPro · 2025-08-08T12:31:13Z

docs/safe-haven-services/s3-service.md

+os.environ["AWS_ENDPOINT_URL"] = "http://nsh-fs02:7070"
+os.environ["AWS_DEFAULT_REGION"] = "us-east-1"
+os.environ["AWS_ACCESS_KEY_ID"] = "dummydata"
+os.environ["AWS_SECRET_ACCESS_KEY"] = "put_your_secret_here"
+os.environ["http_proxy"] = ""


The user was already told to set the environment variables. Does boto automatically pick these up or is something like this needed?

kavousan

If the answer to the pip install comment is that it's not a problem, this is an approval.

kavousan · 2025-08-08T12:38:56Z

docs/safe-haven-services/s3-service.md

Can anyone do pip install?

Minor points you may not care about just now:

Standardise "Important" signposting; they are all the same (unless one wants to use an ⁠md construct), so:

Under "Access arrangements", add - after it, like the first instance

Under "How to use" replace "Note!" with "Important -"

Under R usage replace "Note!" with "Important -" (needs done three times)

Remove "n" from "recover then memory"

Yes, anyone can run pip install as part of access to PyPi via the web proxy.

Added s3-service.md

e095b49

howff linked an issue Aug 6, 2025 that may be closed by this pull request

[Documentation]: the prototype S3 service in the NSH #237

Open

2bPro reviewed Aug 6, 2025

View reviewed changes

docs/safe-haven-services/s3-service.md Outdated Show resolved Hide resolved

2bPro reviewed Aug 6, 2025

View reviewed changes

docs/safe-haven-services/s3-service.md Outdated Show resolved Hide resolved

2bPro reviewed Aug 6, 2025

View reviewed changes

docs/safe-haven-services/s3-service.md Outdated Show resolved Hide resolved

2bPro reviewed Aug 6, 2025

View reviewed changes

docs/safe-haven-services/s3-service.md Show resolved Hide resolved

2bPro reviewed Aug 6, 2025

View reviewed changes

docs/safe-haven-services/s3-service.md Outdated Show resolved Hide resolved

2bPro reviewed Aug 6, 2025

View reviewed changes

docs/safe-haven-services/s3-service.md Show resolved Hide resolved

2bPro reviewed Aug 6, 2025

View reviewed changes

abrooks added 6 commits August 7, 2025 22:06

Update s3-service.md

4ed95f2

Update s3-service.md, especially for R examples

7fb5fc4

Update s3-service.md, especially for Python examples

c2e863e

Add safe-haven-services/s3-service.md to the navigation page via mkdo…

7852bdd

…cs.yml

Update s3-service.md, env vars specifically for R aws.s3

891ccde

Update s3-service.md, regarding http_proxy

ce0b96c

2bPro reviewed Aug 8, 2025

View reviewed changes

2bPro requested a review from kavousan August 8, 2025 12:15

2bPro reviewed Aug 8, 2025

View reviewed changes

kavousan reviewed Aug 8, 2025

View reviewed changes

Simplified the S3 service docs

f4d9a5a


		Access to buckets is via keys (Access Key and Secret Access Key) provided to the user by the Research Coordinator.

		## How to use the service


		It should go without saying that the access key details are confidential and must never be shared or allowed to be seen by others. Note that all file accesses are logged.

		## Performance tips


		## Performance tips

		* consume the file directly in memory if possible, don't save it to disk. Saving to disk will waste disk space and it will take 3 times longer to do your processing. See the example code below.

variable	value
BUCKET_NAME	`dummydata`
ACCESS_KEY_ID	`bprodan`
SECRET_ACCESS_KEY	`1234`

Add s3-service.md to document the S3 service inside Safe Haven Services #238

Are you sure you want to change the base?

Add s3-service.md to document the S3 service inside Safe Haven Services #238

Uh oh!

Conversation

howff commented Aug 6, 2025 • edited by 2bPro Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

EIDF Documentation Pull Request

Description

Type of change

What has to be reviewed

Checklist

Uh oh!

Choose a reason for hiding this comment

Uh oh!

2bPro Aug 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

2bPro Aug 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

2bPro Aug 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

2bPro Aug 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Routes to the proxy

Routes directly to nsh-fs02

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

2bPro Aug 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

2bPro Aug 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

howff commented Aug 6, 2025 •

edited by 2bPro

Loading

2bPro Aug 6, 2025 •

edited

Loading

2bPro Aug 6, 2025 •

edited

Loading

2bPro Aug 6, 2025 •

edited

Loading

2bPro Aug 8, 2025 •

edited

Loading

2bPro Aug 6, 2025 •

edited

Loading

2bPro Aug 8, 2025 •

edited

Loading