Using Lambda's Managed Kubernetes#
Introduction#
This guide walks you through getting started with Lambda's Managed Kubernetes (MK8s) on a 1-Click Cluster (1CC).
MK8s provides a Kubernetes environment with GPU and InfiniBand (RDMA) support, and shared persistent storage across all nodes in a 1CC. Clusters are preconfigured so you can deploy workloads without additional setup.
In this guide, you'll learn how to:
- Access MK8s using
kubectl
. - Grant access to additional users.
- Organize workloads using namespaces.
- Deploy and manage applications.
- Expose services using Ingresses.
- Use shared and node-local persistent storage.
- Monitor GPU usage with the NVIDIA DCGM Grafana dashboard.
This guide includes two examples:
In the first, you'll deploy a vLLM server to serve the Nous Research Hermes 4 model. You'll:
- Create a namespace for the examples.
- Add a PersistentVolumeClaim (PVC) to cache model downloads.
- Deploy the vLLM server.
- Expose it with a Service.
- Configure an Ingress to make it accessible externally.
In the second example, you'll evaluate the multiplication-solving accuracy of the DeepSeek R1 Distill Qwen 7B model using vLLM. You'll:
- Run a batch job that performs the evaluation.
- Monitor GPU utilization during the run.
Prerequisites#
You need the Kubernetes command-line tool, kubectl
, to interact with MK8s.
Refer to the Kubernetes documentation for
installation instructions.
You also need the kubelogin
plugin for kubectl
to authenticate to MK8s.
Refer to the
kubelogin README for installation instructions.
Accessing MK8s#
To access MK8s, you need to:
- Configure firewall rules to allow connections to MK8s.
- Configure
kubectl
to use the providedkubeconfig
file. - Authenticate to MK8s using your Lambda Cloud account.
Configure firewall rules#
To access MK8s, you must first create firewall rules for the MK8s API server and Ingress Controller:
-
Navigate to the Global rules tab on the Firewall settings page in the Lambda Cloud console.
-
In the Rules section, click Edit rules to begin creating a rule.
-
Click Add rule, then set up the following rule:
- Type: Custom TCP
- Protocol: TCP
- Port range:
6443
- Source:
0.0.0.0/0
- Description:
MK8s API server
-
Click Add rule again, then set up the following rule:
- Type: Custom TCP
- Protocol: TCP
- Port range:
443
- Source:
0.0.0.0/0
- Description:
MK8s Ingress Controller
-
Click Update firewall rules.
Configure kubectl
#
You're provided with a kubeconfig
file when MK8s is provisioned. You need to
set up kubectl
to use this kubeconfig
file:
-
Save the file to
~/.kube/config
. Alternatively, set theKUBECONFIG
environment variable to the path of the file. -
(Optional) Restrict access to the file:
Authenticate to MK8s#
-
Run:
A tab or new window opens in your default web browser, and you're prompted to log in to your Lambda Cloud account.
-
Log in to your Lambda Cloud account.
You're prompted to authorize Lambda Managed Kubernetes to access your Lambda account.
-
Click Accept to authenticate to MK8s.
In your terminal, you should see output similar to:
NAME STATUS ROLES AGE VERSION
mk8s-yns66blqnvvvffjc-mk8s--worker--gpu-8x-h100-sxm5gdr-cgkn46g Ready <none> 17d v1.32.3+rke2r1
mk8s-yns66blqnvvvffjc-mk8s--worker--gpu-8x-h100-sxm5gdr-cgwx5c7 Ready <none> 18d v1.32.3+rke2r1
mk8s-yns66blqnvvvffjc-ndcr6-728sh Ready control-plane,etcd,master 18d v1.32.3+rke2r1
mk8s-yns66blqnvvvffjc-ndcr6-p25c9 Ready control-plane,etcd,master 18d v1.32.3+rke2r1
mk8s-yns66blqnvvvffjc-ndcr6-vvrxc Ready control-plane,etcd,master 18d v1.32.3+rke2r1
Grant access to additional users#
You can grant access to MK8s to additional users by using the Teams feature.
Teammates with the Member role have an edit
ClusterRole in MK8s, which
allows read/write access to most objects in a namespace.
Teammates with the Admin role have a cluster-admin
ClusterRole in MK8s,
which allows full control over every resource in all namespaces.
See the
Kubernetes documentation on user-facing ClusterRoles
to learn more about the edit
and cluster-admin
ClusterRoles.
Creating a Pod with access to GPUs and InfiniBand (RDMA)#
Worker (GPU) nodes in MK8s are tainted to prevent non-GPU workloads from being scheduled on them by default. To run GPU-enabled or RDMA-enabled workloads on these nodes, your Pod spec must include the appropriate tolerations and resource requests.
To schedule a Pod on a GPU node using kubectl
, include the following
toleration in your Pod spec:
This toleration matches the taint applied to GPU nodes and allows the scheduler to place your Pod on them.
To allocate GPU resources to your container, specify them explicitly in the
resources.limits
section. For example, to request one GPU:
If your container also requires InfiniBand (RDMA) support, you must request the RDMA device and include the following runtime configuration:
containers:
- name: <CONTAINER-NAME>
image: <CONTAINER-IMAGE>
resources:
limits:
nvidia.com/gpu: "8"
rdma/rdma_shared_device_a: "1"
memory: 16Gi
requests:
nvidia.com/gpu: "8"
rdma/rdma_shared_device_a: "1"
volumeMounts:
- name: dshm
mountPath: /dev/shm
securityContext:
capabilities:
add:
- IPC_LOCK
volumes:
- name: dshm
emptyDir:
medium: Memory
sizeLimit: 16Gi
Note
Setting volumes.emptyDir.sizeLimit
to 16Gi
ensures that sufficient
RAM-backed shared memory is available at /dev/shm
for RDMA and
communication libraries such as NCCL. (See, for example,
NVIDIA's documentation on troubleshooting NCCL shared memory issues.)
Note that if no memory limit is set on a container using a /dev/shm
mount,
the cluster will reject the pod with an error:
ValidatingAdmissionPolicy 'workload-policy.lambda.com' with binding
'workload-policy-binding.lambda.com' denied request:
(requireDevShmMemoryLimit) Pods are not allowed to have containers that
mount /dev/shm and do not configure any memory resource limits (e.g.
spec.containers[*].resources.limits.memory=1536G)
.
Defining a memory limit ensures the kernel memory allocator can function efficiently, helping maintain overall node stability.
Creating Ingresses to access services#
MK8s comes preconfigured with the
NGINX Ingress Controller
and
ExternalDNS.
MK8s includes a pre-provisioned wildcard TLS certificate for
*.<CLUSTER-ZONE>.k8s.lambda.ai
, which is used by the Ingress Controller
by default. This setup allows you to expose services securely via public URLs in
the format https://<SERVICE>.<CLUSTER-ZONE>.k8s.lambda.ai
.
Obtain the CLUSTER-ZONE#
To obtain the <CLUSTER-ZONE>
using kubectl
:
Create an Ingress#
To create an Ingress:
-
Create an Ingress manifest. For example:
apiVersion: networking.k8s.io/v1 kind: Ingress metadata: name: <NAME> namespace: <NAMESPACE> spec: ingressClassName: nginx-public rules: - host: <SERVICE>.<CLUSTER-ZONE>.k8s.lambda.ai http: paths: - path: / pathType: Prefix backend: service: name: <SERVICE> port: name: http tls: - hosts: - <SERVICE>.<CLUSTER-ZONE>.k8s.lambda.ai
-
Apply the Ingress manifest. Replace
<INGRESS-MANIFEST>
with the path to your manifest file:Example output:
-
Verify the Ingress was created. Replace
<NAMESPACE>
and<NAME>
with your values:You should see output similar to:
Name: vllm-ingress Labels: <none> Namespace: mk8s-docs-examples Address: 192.222.48.194,192.222.48.250,192.222.48.39 Ingress Class: nginx-public Default backend: <default> TLS: SNI routes vllm.mk8s-yns66blqnvvvffjc.us-east.k8s.lambda.ai Rules: Host Path Backends ---- ---- -------- vllm.mk8s-yns66blqnvvvffjc.us-east.k8s.lambda.ai / vllm-service:http (10.42.2.16:8000) Annotations: <none> Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Sync 22s (x2 over 55s) nginx-ingress-controller Scheduled for sync Normal Sync 22s (x2 over 55s) nginx-ingress-controller Scheduled for sync Normal Sync 22s (x2 over 55s) nginx-ingress-controller Scheduled for sync
Shared and node-local persistent storage#
MK8s provides two StorageClasses for persistent storage:
-
lambda-shared
: Shared storage backed by a Lambda Filesystem on a network-attached storage cluster. It's accessible from all nodes and provides robust, durable storage. The Lambda Filesystem can also be accessed externally via the Lambda S3 Adapter. -
lambda-local
: Local NVMe-backed storage on the node. It's fast and useful for scratch space but isn't accessible from other nodes. Data is lost if the node or NVMe drive fails.
These StorageClasses let you persist data across pod restarts or rescheduling.
However, only lambda-shared
persists across node failures.
To use persistent storage in MK8s, workloads must request a
PersistentVolumeClaim (PVC). You can specify the size, access mode, and
StorageClass in the PVC. By default, the lambda-shared
StorageClass is used.
Volumes created from PVCs bind immediately, rather than waiting for a pod to consume them. This ensures that volume provisioning and scheduling happen up front.
Persistent storage is useful for:
- Saving model checkpoints.
- Storing datasets.
- Writing logs shared across workloads.
To create and manage PVCs:
-
Create a PVC manifest file. For example, to create a PVC using the
lambda-shared
storage class and a capacity of 128 GiB: -
Apply the manifest:
Replace
<PVC-MANIFEST>
with the path to your YAML file. -
Verify that the PVC was created:
Example output:
Tip
If your container only needs to read data from a PVC, such as loading a static dataset or pretrained model weights, you can mount the volume as read-only to prevent accidental writes:
Example 1: Deploy a vLLM server to serve Hermes 4#
In this example, you'll deploy a vLLM server in MK8s to serve Nous Research's Hermes 4 LLM.
Before you begin, make sure you've
set up kubectl
access to the MK8s cluster.
Create a Namespace to group resources#
First, create a namespace for this example and the following example:
-
Run:
Expected output:
Create a PVC to cache downloaded models#
Next, create a PVC to cache model files downloaded from Hugging Face:
-
Apply the
huggingface-cache.yaml
PVC manifest:Expected output:
-
Confirm the PVC was created:
Expected output:
Deploy a vLLM server in the cluster#
-
Apply the
vllm-deployment-lks.yaml
manifest:Expected output:
Create a Service to expose the vLLM server#
-
Apply the
vllm-service.yaml
manifest:Expected output:
Create the Ingress to expose the vLLM service publicly#
To expose the vLLM server over the internet:
-
In the manifest file, replace
<CLUSTER-ZONE>
with your actual cluster zone. -
Apply the manifest:
-
Confirm the Ingress was created:
Expected output:
MK8s will automatically create a DNS record and obtain a TLS certificate, enabling secure access to the vLLM service.
Note
It can take up to an hour for the DNS record to propagate to your DNS servers. To check the propagation status, run:
If the record has propagated, you'll see three IP addresses. These are the IPs of your 1CC head nodes.
Submit a prompt to the vLLM server#
-
To verify that the vLLM server is working, submit a prompt using
curl
. Replace<CLUSTER-ZONE>
with the value you obtained from thekubectl get configmap cluster-configuration
command.
Clean up the example resources#
-
To delete the Ingress, Service, and Deployment created in this example:
Note
If you don't plan to continue to the next example or no longer need the cached model files, you can also delete the PVC using:
Example 2: Evaluate multiplication-solving abilities of the DeepSeek R1 Distill Qwen 7B model#
This example assumes you've already completed the following steps in the previous example:
- Set up
kubectl
access to your MK8s cluster. - Created the
mk8s-docs-examples
namespace. - Created the
huggingface-cache
PVC.
Run a Job to evaluate the multiplication-solving accuracy of the model#
-
Apply the
multiplication-eval-deepseek-r1-distilll-qwen-7b.yaml
manifest:kubectl apply -f https://docs.lambda.ai/assets/code/multiplication-eval-deepseek-r1-distilll-qwen-7b.yaml
Expected output:
View the Job logs#
To follow the Job logs as the job runs:
You should see output from vLLM as the job runs.
When the job finishes, you should see a line displaying the accuracy of the job:
Processed prompts: 100%|██████████| 1000/1000 [00:10<00:00, 97.89it/s, est. speed input: 1652.09 toks/s, output: 28255.96 toks/s]
Model deepseek-ai/DeepSeek-R1-Distill-Qwen-7B accuracy: 0.8900
Monitor 1CC utilization during evaluation#
To monitor 1CC utilization as the evaluation Job runs:
-
Navigate to
https://grafana.<CLUSTER-ZONE>.k8s.lambda.ai
. Replace<CLUSTER-ZONE>
with the actual cluster zone of your MK8s cluster. -
At the login prompt, click Sign in with lambda.
-
In the left nav, click Dashboards.
-
Click NVIDIA DCGM Exporter Dashboard.
Clean up the example resources#
The Job is configured to delete itself five minutes after it completes. If you want to delete it immediately, run:
kubectl delete -f https://docs.lambda.ai/assets/code/multiplication-eval-deepseek-r1-distilll-qwen-7b.yaml
If you're finished with the examples and no longer need the cached model data,
you can also delete the huggingface-cache
PVC: