This repository is a template for running Python projects on GPU nodes in NRP Nautilus. The config_k8s.py
script automatically generates a K8s job file and secrets based on your inputs. The instructions below provide a workflow for building and pushing Docker images to the NRP's GitLab container registry. You should follow the steps below inside a Coder workspace or another environment where you can install the dependencies listed in the Prerequisites section. If your workspace has enough resources, you should directly run your code there instead of using this template.
- Fork this repository privately on the NRP's GitLab instance
- Optionally, create a new branch in your fork and follow the steps in the Getting started section on this branch
- Install
kubectl
,git
, anduv
- Save the NRP-provided K8s config to
~/.kube/config
- Create a Personal Access Token with the
read_repository
scope - Create a deploy token for your fork with the
read_registry
scope
In your terminal, clone your fork of this repository and cd
into its directory. Next, follow these steps:
-
Generate a K8s job file named
your_job.yml
with the following command:python config_k8s.py --netid NetID --output your_job.yml --pat GitLabPAT --dt-username DeployTokenUsername --dt-password DeployTokenPassword
--pat
,--dt-username
, and--dt-password
can be omitted if you already created the secretsNetID-gitlab
andNetID-RepoName-regcred
-
Create a virtualenv for your project:
- Update
pyproject.toml
to include your project's dependencies - Run
uv sync
to install them in a new virtualenv - Commit and push your changes
- This will automatically start a CI/CD pipeline on GitLab to build your image and push it to the NRP's container registry
- Navigate to "Build" → "Jobs" in the sidebar of GitLab's web UI to monitor the build job's progress
- Update
-
Add your project's run commands to
run.sh
and add your code to the repo. -
Modify
your_job.yml
as needed:- The job name (line 7)
- Environment variables inside your container's
env
section (line 16) - Your container's resource requests/limits (lines 24-34)
- The branch your job will pull code from (line 48)
-
Once your changes are complete, push them to the current branch of your fork.
-
Once the CI/CD pipeline completes, run your job with the following command:
kubectl create -f your_job.yml
- Run
kubectl get pods | grep <job-name>
to get the name of the pod associated with your job - Run
kubectl logs <pod-name>
to view the output ofrun.sh
- Run
Modify the following files along with your Python code:
run.sh
runs your code when the container startspyproject.toml
contains Python dependenciesDockerfile
is used to build the Docker imageyour_job.yml
specifies the K8s job configuration
Avoid changing entrypoint.sh
as this requires rebuilding the image for changes to take effect. Add commands to run.sh
instead.
Remove unnecessary dependencies from both pyproject.toml
and the Dockerfile
. If this is not enough, you may extend the timeout in .gitlab-ci.yml
.
NRP-provided storage has usage restrictions. Notably, even accidentally storing python dependencies in Ceph may result in a temporary ban from accessing Nautilus resources. Instead, use:
- Hugging Face Hub to efficiently store both datasets and model checkpoints
- wandb or Comet to log experiment results
Given the presence of these alternatives (which are not subject to the same usage restrictions), this template does not support NRP-provided storage.
Will I need to wait for the GitLab CI/CD job to finish after each pushed commit for my next K8s job to access new code?
You only need to wait for the CI/CD pipeline to complete if you've modified pyproject.toml
, the Dockerfile
, or .gitlab-ci.yml
, since these changes require rebuilding the container image. You should avoid modifying entrypoint.sh
, but if you must, you will need to wait for the CI/CD pipeline to complete for your changes to take effect.
In the Dockerfile
, change base
to either runtime
(for more CUDA libraries) or devel
(for all CUDA development tools including nvcc
). This will contain more CUDA binaries and libraries in your container. If you want to reduce the size of your final image, use a multi-stage build to select which CUDA binaries and libraries to copy into the final image.