-
Notifications
You must be signed in to change notification settings - Fork 30
Add s3-service.md to document the S3 service inside Safe Haven Services #238
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There is no general-purpose S3 service within Safe Haven Services, unlike the [EIDF S3 service](../services/s3/). | ||
|
||
However there is a S3 service in SHS with the following caveats: | ||
* it is only available to the Scottish National Safe Haven, other tenants by arrangement | ||
* it is a read-only service, as a way of providing access to large collections of files | ||
* it is not a storage solution for users wanting to create their own files | ||
|
||
## Access arrangements | ||
|
||
Access to buckets is via keys (Access Key and Secret Access Key) provided to the user by the Research Coordinator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggested rewording: Some Safe Havens may provide you with access to data via S3. If this applies to your project, your Research Coordinator will provide you with an access key. This documentation will guide you through how to get access to your data from a terminal as well as programmatically via R and Python.
!!! important Files in S3 buckets are read-only. If you need to transform or make changes to any files, you will need to download them to your project space. If you download files, please be mindful of disk space by only downloading what is necessary and deleting them as soon as no longer needed.
|
||
Access to buckets is via keys (Access Key and Secret Access Key) provided to the user by the Research Coordinator. | ||
|
||
## How to use the service |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
## Environment setup
To access files you need the following information: | ||
|
||
* Region is "us-east-1" | ||
* Endpoint URL is "http://nsh-fs02:7070" | ||
* Access key | ||
* Secret access key | ||
* The web proxy variables must be empty |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your RC will provide you with a bucket name, access key ID (bucket name), and secret access key.
export http_proxy= | ||
``` | ||
|
||
## Use from the command line |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
## Accessing data
### Command Line
If this command fails it might be due to a proxy configuration in your environment. To temporarily turn off the proxy in the current window use this first: | ||
``` | ||
export http_proxy= | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this the case? This appears to be a workaround that could become confusing if the user then attempts to download other packages in the same terminal. Users will likely run this without understanding what it does or that they would have to run other installation commands in a separate terminal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if it's technically possible for the web proxy to be configured to pass traffic onto nsh-fs02. Maybe a question for Barry or similar. If so it would reduce confusion, but on the other hand I don't think it's a good idea to put the web proxy between the client and the S3 server because all it does is cause additional unnecessary load on the proxy and slow everything down; there's no benefit from authentication either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had a chat with Barry about this, he recommends the use of the NO_PROXY variable set in bashrc instead, so this would look like export NO_PROXY="$NO_PROXY,nsh-fs02:7070"
(I tested and can confirm works). I also asked Susan, and users are never told to go and edit their bashrcs, which we may want to avoid. The other option is for systems to add this to their bashrcs (either to all or on-demand).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Last time I tested this the NO_PROXY variable was ignored!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NO_PROXY will work here, if properly configured:
Routes to the proxy
rmacleod@nsh-rc-desktop01:~$ NO_PROXY=''; curl -LI nsh-fs02:7070
HTTP/1.1 503 Service Unavailable
Server: squid
...
Routes directly to nsh-fs02
rmacleod@nsh-rc-desktop01:~$ NO_PROXY=nsh-fs02; curl -LI nsh-fs02:7070
HTTP/1.1 400 Bad Request
Server: VERSITYGW
...
A more generalised solution would be to set NO_PROXY=localhost,127.0.0.1,.nsh.loc
and then to fully-qualify the server as nsh-fs02.nsh.loc
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NO_PROXY doesn't work; R ignores it and Python ignores it. awscli obeys it. Python does obey no_proxy, R ignores it.
At this point it will probably complain that it can't locate your credentials. In fact it requires a bit more information in order to find the bucket: the region, endpoint, access key and secret key: | ||
``` | ||
export AWS_DEFAULT_REGION=us-east-1 | ||
export AWS_ENDPOINT_URL=http://nsh-fs02:7070 | ||
export AWS_ACCESS_KEY_ID=put_your_key_here | ||
export AWS_SECRET_ACCESS_KEY=put_your_secret_here | ||
export http_proxy= | ||
aws s3 cp s3://extraction_5_CT/123/456/789.dcm copy_of_789.dcm | ||
``` | ||
|
||
It should go without saying that the access key details are confidential and must never be shared or allowed to be seen by others. Note that all file accesses are logged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid repeating this by placing it in the environment setup section. Maybe show this as an example env file and source command.
|
||
It should go without saying that the access key details are confidential and must never be shared or allowed to be seen by others. Note that all file accesses are logged. | ||
|
||
## Performance tips |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These apply to programmatic methods as well, so why not put them after access methods?
|
||
## Performance tips | ||
|
||
* consume the file directly in memory if possible, don't save it to disk. Saving to disk will waste disk space and it will take 3 times longer to do your processing. See the example code below. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Process S3 files directly in memory wherever possible. Saving files to disk is not recommended as this will harm performance (expected to be up to 3 times slower). If this cannot be avoided, please delete files when no longer needed to recover disk space.
pip install boto3 | ||
``` | ||
|
||
### Download an object |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Download a file" or introduce object
import boto3 | ||
resource = boto3.resource("s3") | ||
bucket = resource.Bucket("epcc-test") | ||
bucket.download_file("test.txt", "copy_of_test.txt") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Show how the env vars would be used to configure the connection, like you have in R.
bucket.download_file("test.txt", "copy_of_test.txt") | ||
``` | ||
|
||
### Load object into pydicom dataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generalise this to "load a file for further processing" or similar, and give as an example passing it to pydicom.
my_endpoint_host <- "nsh-fs02:7070" | ||
my_object_path <- "studyid/seriesid/instanceid-an.dcm" | ||
save_object( my_object_path, file = "output.dcm", bucket = my_bucket, base_url=my_endpoint_host, region="", use_https=FALSE, key = my_access_key, secret = my_secret_key ) | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No examples of loading files given as with Python. Examples must be equivalent.
* A "bucket" name (similar to a folder name) | ||
* An "access key" (similar to a username) | ||
* A "private key" (similar to a password) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The user only needs the access key id (or bucket name), and secret access key. I think this is confusing, implies there's two access keys, one of which is "similar to a username", despite this not being used in the configuration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually it is as described, the bucket name is not the access key
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When defining the variables below, and as defined in the file received from the RC:
AWS_ACCESS_KEY_ID=bucketname
AWS_SECRET_ACCESS_KEY=password
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The bucket name is like the directory name, the access key is like the username and the secret key is like the password, I'm not sure where you're getting the information that the access key is the same as the bucket name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As specified in my previous comment, I'm getting it from the documentation and the file received from the RC. Maybe an example would help clarify where I'm coming from.
I have a user bprodan
in study 2024-0000
and was told by the RC that I will be given access to a bucket called dummydata
. According to the user documentation, which follows the standard minio convention, the configuration should be:
variable | value |
---|---|
BUCKET_NAME | dummydata |
ACCESS_KEY_ID | bprodan |
SECRET_ACCESS_KEY | 1234 |
According to the RC documentation, keys are created at study level, so this should be more like:
variable | value |
---|---|
BUCKET_NAME | dummydata |
ACCESS_KEY_ID | 2024-0000 |
SECRET_ACCESS_KEY | 1234 |
But instead, in the file received from the RC, I have:
variable | value |
---|---|
BUCKET_NAME | dummydata |
ACCESS_KEY_ID | dummydata |
SECRET_ACCESS_KEY | 1234 |
Currently, the user documentation assumes knowledge of how minio is configured and calling the access key ID a username just to then receive the bucket name would make a user question which value maps to that variable and if that is their username or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The bucket name is like the directory name, the access key is like the username and the secret key is like the password, I'm not sure where you're getting the information that the access key is the same as the bucket name.
I agree with this, so perhaps the RC documentation needs to be updated?
As an aside, we are not using MinIO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The technology is not important in this case, this is about user guidance on how to configure access. I am not disagreeing with Andrew's statement, I'm saying the way the documentation presents this assumes knowledge about what the access key ID variable represents and is unclear about what value is associated with it.
save_object( my_object_path, file = "downloaded.dcm", bucket = my_bucket, base_url=my_endpoint_host, region="", use_https=FALSE, key = my_access_key, secret = my_secret_key ) | ||
``` | ||
|
||
Note! You need to have the region set in the environment variable and pass `region=""` to the functions, otherwise you get a `cannot resolve host` error. The `base_url` is just the host and port, no `http://` prefix, and `use_https` is false. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note formatting is inconsistent within this page as well as compared with the rest of the docs. For important pieces of text, this is probably the most visible.
os.environ["AWS_ENDPOINT_URL"] = "http://nsh-fs02:7070" | ||
os.environ["AWS_DEFAULT_REGION"] = "us-east-1" | ||
os.environ["AWS_ACCESS_KEY_ID"] = "dummydata" | ||
os.environ["AWS_SECRET_ACCESS_KEY"] = "put_your_secret_here" | ||
os.environ["http_proxy"] = "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The user was already told to set the environment variables. Does boto automatically pick these up or is something like this needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the answer to the pip install comment is that it's not a problem, this is an approval.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can anyone do pip install?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor points you may not care about just now:
-
Standardise "Important" signposting; they are all the same (unless one wants to use an md construct), so:
- Under "Access arrangements", add - after it, like the first instance
- Under "How to use" replace "Note!" with "Important -"
- Under R usage replace "Note!" with "Important -" (needs done three times)
-
Remove "n" from "recover then memory"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, anyone can run pip install as part of access to PyPi via the web proxy.
EIDF Documentation Pull Request
Description
Add s3-service.md to document the S3 service inside Safe Haven Services
Fixes #237 237
Type of change
Please delete options that are not relevant.
What has to be reviewed
s3-service.md
Checklist