Description
Name of Feature or Improvement
RDMA Networks
Description of Problem the Feature Should Solve
RDMA Networks for Efficient LLM Training
Describe the Solution You Would Like to See
Description of the proposed solution.
"workerGroupSpecs": [
{
"replicas": cluster.config.num_workers,
"minReplicas": cluster.config.num_workers,
"maxReplicas": cluster.config.num_workers,
"groupName": f"small-group-{cluster.config.name}",
"rayStartParams": {
"block": "true",
"num-gpus": str(worker_gpu_count),
"resources": worker_resources,
},
"template": V1PodTemplateSpec(
metadata=V1ObjectMeta(
annotations={
"k8s.v1.cni.cncf.io/networks": [,,...]
}
),
spec=get_pod_spec(
cluster,
[get_worker_container_spec(cluster)],
cluster.config.worker_tolerations,
)
),
}
],
Describe Alternatives You Have Considered
Description of any alternative solutions or features you have considered.
Additional Context
Add any other context, screenshots, console logs, etc. about the request here.