K8s (Kubernetes) Setup Template

Overview

This repository is a template for running Python projects on GPU nodes in NRP Nautilus. The config_k8s.py script automatically generates a K8s job file and secrets based on your inputs. The instructions below provide a workflow for building and pushing Docker images to the NRP's GitLab container registry. You should follow the steps below inside a Coder workspace or another environment where you can install the dependencies listed in the Prerequisites section. If your workspace has enough resources, you should directly run your code there instead of using this template.

Prerequisites

Fork this repository privately on the NRP's GitLab instance
- Optionally, create a new branch in your fork and follow the steps in the Getting started section on this branch
Install kubectl, git, and uv
Save the NRP-provided K8s config to ~/.kube/config
Create a Personal Access Token with the read_repository scope
Create a deploy token for your fork with the read_registry scope

Getting started

In your terminal, clone your fork of this repository and cd into its directory. Next, follow these steps:

Generate a K8s job file named your_job.yml with the following command:
```
python config_k8s.py --netid NetID --output your_job.yml --pat GitLabPAT --dt-username DeployTokenUsername --dt-password DeployTokenPassword
```
- --pat, --dt-username, and --dt-password can be omitted if you already created the secrets NetID-gitlab and NetID-RepoName-regcred
Create a virtualenv for your project:
- Update pyproject.toml to include your project's dependencies
- Run uv sync to install them in a new virtualenv
- Commit and push your changes
- This will automatically start a CI/CD pipeline on GitLab to build your image and push it to the NRP's container registry
- Navigate to "Build" → "Jobs" in the sidebar of GitLab's web UI to monitor the build job's progress
Add your project's run commands to run.sh and add your code to the repo.
Modify your_job.yml as needed:
- The job name (line 7)
- Environment variables inside your container's env section (line 16)
- Your container's resource requests/limits (lines 24-34)
- The branch your job will pull code from (line 48)
Once your changes are complete, push them to the current branch of your fork.
Once the CI/CD pipeline completes, run your job with the following command:
```
kubectl create -f your_job.yml
```
- Run kubectl get pods | grep <job-name> to get the name of the pod associated with your job
- Run kubectl logs <pod-name> to view the output of run.sh

FAQ

Which files should I modify for my own project?

Modify the following files along with your Python code:

run.sh runs your code when the container starts
pyproject.toml contains Python dependencies
Dockerfile is used to build the Docker image
your_job.yml specifies the K8s job configuration

Avoid changing entrypoint.sh as this requires rebuilding the image for changes to take effect. Add commands to run.sh instead.

How can I prevent my CI/CD pipeline from timing out?

Remove unnecessary dependencies from both pyproject.toml and the Dockerfile. If this is not enough, you may extend the timeout in .gitlab-ci.yml.

Why not include configuration for a PVC (to access CephFS) or `rclone` (to access Ceph S3)?

NRP-provided storage has usage restrictions. Notably, even accidentally storing python dependencies in Ceph may result in a temporary ban from accessing Nautilus resources. Instead, use:

Hugging Face Hub to efficiently store both datasets and model checkpoints
wandb or Comet to log experiment results

Given the presence of these alternatives (which are not subject to the same usage restrictions), this template does not support NRP-provided storage.

Will I need to wait for the GitLab CI/CD job to finish after each pushed commit for my next K8s job to access new code?

You only need to wait for the CI/CD pipeline to complete if you've modified pyproject.toml, the Dockerfile, or .gitlab-ci.yml, since these changes require rebuilding the container image. You should avoid modifying entrypoint.sh, but if you must, you will need to wait for the CI/CD pipeline to complete for your changes to take effect.

How can I install additional CUDA binaries/libraries?

In the Dockerfile, change base to either runtime (for more CUDA libraries) or devel (for all CUDA development tools including nvcc). This will contain more CUDA binaries and libraries in your container. If you want to reduce the size of your final image, use a multi-stage build to select which CUDA binaries and libraries to copy into the final image.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

K8s (Kubernetes) Setup Template

Overview

Prerequisites

Getting started

FAQ

Which files should I modify for my own project?

How can I prevent my CI/CD pipeline from timing out?

Why not include configuration for a PVC (to access CephFS) or `rclone` (to access Ceph S3)?

Will I need to wait for the GitLab CI/CD job to finish after each pushed commit for my next K8s job to access new code?

How can I install additional CUDA binaries/libraries?

About

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 69 Commits
.gitignore		.gitignore
.gitlab-ci.yml		.gitlab-ci.yml
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
config_k8s.py		config_k8s.py
entrypoint.sh		entrypoint.sh
job_template.yml		job_template.yml
pyproject.toml		pyproject.toml
run.sh		run.sh
test_script.py		test_script.py

License

varuniyer/nrp-k8s-setup-template

Folders and files

Latest commit

History

Repository files navigation

K8s (Kubernetes) Setup Template

Overview

Prerequisites

Getting started

FAQ

Which files should I modify for my own project?

How can I prevent my CI/CD pipeline from timing out?

Why not include configuration for a PVC (to access CephFS) or rclone (to access Ceph S3)?

Will I need to wait for the GitLab CI/CD job to finish after each pushed commit for my next K8s job to access new code?

How can I install additional CUDA binaries/libraries?

About

Topics

Resources

License

Stars

Watchers

Forks

Languages

Why not include configuration for a PVC (to access CephFS) or `rclone` (to access Ceph S3)?