Using the Lambda Inference API#

🚀 Private Endpoints Beta

Interested in dedicated, secure API endpoints for your custom models? Sign up for Private Endpoints Beta access.

The Lambda Inference API enables you to use large language models (LLMs) without the need to set up a server. No limits are placed on the rate of requests. The Lambda Inference API can be used as a drop-in replacement for applications currently using the OpenAI API. See, for example, our guide on integrating the Lambda Inference API into VS Code.

To use the Lambda Inference API, first generate a Cloud API key from the dashboard. You can also use a Cloud API key that you've already generated.

In the examples below, you can replace llama-4-maverick-17b-128e-instruct-fp8 with any of the available models. You can obtain a list of the available models using the /models endpoint. Replace <API-KEY> with your actual Cloud API key.

For the Python examples, first create and activate a Python virtual environment. Then install the OpenAI Python API library:

pip install openai

Creating chat completions#

The /chat/completions endpoint generates assistant responses based on a conversation history. It accepts a list of messages, each with a role like user, assistant, or system, and returns a model-generated reply. Some models, including Llama 4 variants, support multimodal inputs such as images.

Text-only chat completion#

PythonCurl

Run:

from openai import OpenAI

# Set API credentials and endpoint
openai_api_key = "<API-KEY>"
openai_api_base = "https://api.lambda.ai/v1"

# Initialize the OpenAI client
client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

# Choose the model
model = "llama-4-maverick-17b-128e-instruct-fp8"

# Create a multi-turn chat completion request
chat_completion = client.chat.completions.create(
    messages=[{
        "role": "system",
        "content": "You are an expert conversationalist who responds to the best of your ability."
    }, {
        "role": "user",
        "content": "Who won the world series in 2020?"
    }, {
        "role": "assistant",
        "content": "The Los Angeles Dodgers won the World Series in 2020."
    }, {
        "role": "user",
        "content": "Where was it played?"
    }],
    model=model,
)

# Print the full chat completion response
print(chat_completion)

You should see output similar to:

ChatCompletion(id='chatcmpl-5dbe7101-bc9f-4f05-a6c2-16f9dc153203', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='The 2020 World Series was played at Globe Life Field in Arlington, Texas. The Dodgers defeated the Tampa Bay Rays in the series, winning 4 games to 2. Globe Life Field was the home stadium of the Texas Rangers, but it was used as a neutral site due to COVID-19 pandemic restrictions.', refusal=None, role='assistant', annotations=None, audio=None, function_call=None, tool_calls=None), content_filter_results={'hate': {'filtered': False}, 'self_harm': {'filtered': False}, 'sexual': {'filtered': False}, 'violence': {'filtered': False}, 'jailbreak': {'filtered': False, 'detected': False}, 'profanity': {'filtered': False, 'detected': False}})], created=1743968155, model='llama-4-maverick-17b-128e-instruct-fp8', object='chat.completion', service_tier=None, system_fingerprint='', usage=CompletionUsage(completion_tokens=65, prompt_tokens=69, total_tokens=134, completion_tokens_details=None, prompt_tokens_details=None))

Run:

# Set API credentials and endpoint
API_KEY="<API-KEY>"
API_BASE="https://api.lambda.ai/v1"

# Choose the model
MODEL="llama-4-maverick-17b-128e-instruct-fp8"

# Send a multi-turn chat request to the Lambda Inference API using the specified
# model and print the response as formatted JSON
curl -sS "$API_BASE/chat/completions" \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
        "model": "'"$MODEL"'",
        "messages": [
          {
            "role": "system",
            "content": "You are an expert conversationalist who responds to the best of your ability."
          },
          {
            "role": "user",
            "content": "Who won the world series in 2020?"
          },
          {
            "role": "assistant",
            "content": "The Los Angeles Dodgers won the World Series in 2020."
          },
          {
            "role": "user",
            "content": "Where was it played?"
          }
        ]
      }' | jq .

You should see output similar to:

{
  "id": "chatcmpl-ff1c7776-e842-4d80-af86-2952d3b2d595",
  "object": "chat.completion",
  "created": 1743968336,
  "model": "llama-4-maverick-17b-128e-instruct-fp8",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The 2020 World Series was played at Globe Life Field in Arlington, Texas. The Dodgers played against the Tampa Bay Rays, winning the series 4 games to 2. Globe Life Field was the home stadium of the Texas Rangers, but it was used as a neutral site due to COVID-19 pandemic restrictions."
      },
      "finish_reason": "stop",
      "content_filter_results": {
        "hate": {
          "filtered": false
        },
        "self_harm": {
          "filtered": false
        },
        "sexual": {
          "filtered": false
        },
        "violence": {
          "filtered": false
        },
        "jailbreak": {
          "filtered": false,
          "detected": false
        },
        "profanity": {
          "filtered": false,
          "detected": false
        }
      }
    }
  ],
  "usage": {
    "prompt_tokens": 69,
    "completion_tokens": 65,
    "total_tokens": 134,
    "prompt_tokens_details": null,
    "completion_tokens_details": null
  },
  "system_fingerprint": ""
}

Multimodal input completion#

You can provide image inputs in two ways:

Using a local file: Encode the image as base64 and include it as a data: URI.
Using an image URL: Provide a publicly accessible image URL.

Using a local file#

PythonCurl

Run:

import base64
from openai import OpenAI

# Set API credentials and endpoint
openai_api_key = "<API-KEY>"
openai_api_base = "https://api.lambda.ai/v1"

# Initialize the OpenAI client
client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

# Choose the model
model = "llama-4-maverick-17b-128e-instruct-fp8"

# Path to your local image file
# Image source: https://commons.wikimedia.org/wiki/File:Cute-kittens-12929201-1600-1200.jpg
image_path = "Cute-kittens-12929201-1600-1200.jpg"

# Encode the image as base64 and create a data URI
def encode_image_to_data_uri(path):
    with open(path, "rb") as f:
        encoded = base64.b64encode(f.read()).decode("utf-8")
    return f"data:image/jpeg;base64,{encoded}"
image_data_uri = encode_image_to_data_uri(image_path)

# Create a message with both text and image input
message = {
    "role": "user",
    "content": [
        { "type": "text", "text": "What is shown in this picture?" },
        { "type": "image_url", "image_url": { "url": image_data_uri } },
    ],
}

# Send the message to the model
chat_response = client.chat.completions.create(
    model=model,
    messages=[message],
)

# Print the model's response
print("Chat completion output:", chat_response.choices[0].message.content)

You should see output similar to:

Chat completion output: The image shows a kitten with a wide open mouth and large blue eyes, giving the impression that it is meowing or making a sound. The kitten's fur is predominantly white with brown and black spots, and its ears are perked up. The background of the image is blurred, but it appears to be a light-colored surface, possibly a carpet or blanket. Overall, the image captures a playful and endearing moment of a young cat.

Run:

# Set API credentials and endpoint
API_KEY="<API-KEY>"
API_BASE="https://api.lambda.ai/v1"

# Choose the model
MODEL="llama-4-maverick-17b-128e-instruct-fp8"

# Path to your local image file
# Image source: https://commons.wikimedia.org/wiki/File:Cute-kittens-12929201-1600-1200.jpg
IMAGE_PATH="Cute-kittens-12929201-1600-1200.jpg"

# Encode the image as base64 and construct the data URI
BASE64_IMAGE=$(base64 -w 0 "$IMAGE_PATH")
DATA_URI="data:image/jpeg;base64,$BASE64_IMAGE"

# Create the JSON payload file
PAYLOAD_FILE=$(mktemp)
cat > "$PAYLOAD_FILE" <<EOF
{
  "model": "$MODEL",
  "messages": [
    {
      "role": "user",
      "content": [
        { "type": "text", "text": "What is shown in this picture?" },
        { "type": "image_url", "image_url": { "url": "$DATA_URI" } }
      ]
    }
  ]
}
EOF

# Make the API request
curl -sS "$API_BASE/chat/completions" \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d @"$PAYLOAD_FILE" | jq -r '.choices[0].message.content'

# Clean up
rm -f "$PAYLOAD_FILE"

You should see output similar to:

The image depicts a small, spotted kitten with its mouth open, as if it is meowing or hissing. The kitten has large, round eyes and is lying on a surface that appears to be a carpet or rug. The overall atmosphere of the image is one of cuteness and playfulness, capturing the kitten's adorable expression and energetic demeanor.

Using an image URL#

PythonCurl

Run:

from openai import OpenAI

# Set API credentials and endpoint
openai_api_key = "<API-KEY>"
openai_api_base = "https://api.lambda.ai/v1"

# Initialize the OpenAI client
client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

# Choose the model
model = "llama-4-maverick-17b-128e-instruct-fp8"

# Provide an image URL
# Image source: https://commons.wikimedia.org/wiki/File:Cute-kittens-12929201-1600-1200.jpg
image_url = "https://upload.wikimedia.org/wikipedia/commons/8/8f/Cute-kittens-12929201-1600-1200.jpg"

# Create a message with both text and image input
message = {
    "role": "user",
    "content": [
        {"type": "text", "text": "What is shown in this picture?"},
        {"type": "image_url", "image_url": {"url": image_url}},
    ],
}

# Send the message to the model
chat_response = client.chat.completions.create(
    model=model,
    messages=[message],
)

# Print the model's response
print("Chat completion output:", chat_response.choices[0].message.content)

You should see output similar to:

Chat completion output: The image depicts a small kitten with distinctive features, including large blue eyes and an open mouth revealing its tongue. The kitten's fur is predominantly white, with brown and black markings on its face, body, and legs. Its ears are perked up, and it appears to be lying down or sitting on a surface that is not clearly visible due to the close-up nature of the photograph. The background of the image is dark and out of focus, drawing attention to the kitten as the main subject. Overall, the image presents a cute and endearing portrayal of a young feline.

Run:

# Set API credentials and endpoint
API_KEY="<API-KEY>"
API_BASE="https://api.lambda.ai/v1"

# Choose the model
MODEL="llama-4-maverick-17b-128e-instruct-fp8"

# Provide an image URL
# Image source: https://commons.wikimedia.org/wiki/File:Cute-kittens-12929201-1600-1200.jpg
IMAGE_URL="https://upload.wikimedia.org/wikipedia/commons/8/8f/Cute-kittens-12929201-1600-1200.jpg"

# Make the API request
curl -sS "$API_BASE/chat/completions" \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
        "model": "'"$MODEL"'",
        "messages": [
          {
            "role": "user",
            "content": [
              { "type": "text", "text": "What is shown in this picture?" },
              { "type": "image_url", "image_url": { "url": "'"$IMAGE_URL"'" } }
            ]
          }
        ]
      }' | jq -r '.choices[0].message.content'

You should see output similar to:

The image depicts a kitten with a distinctive appearance, characterized by its large blue eyes, open mouth, and white fur with brown and black spots. The kitten's facial expression suggests that it is meowing or hissing, conveying a sense of emotion or reaction to its surroundings. The overall atmosphere of the image is one of cuteness and playfulness, as the kitten's adorable features and posture evoke a strong affectionate response from the viewer.

Creating completions#

The /completions endpoint takes a single text string (a prompt) as input, then outputs a response. In comparison, the /chat/completions endpoint takes a list of messages as input.

To use the /completions endpoint:

PythonCurl

Run:

# Set API credentials and endpoint
openai_api_key = "<API-KEY>"
openai_api_base = "https://api.lambda.ai/v1"

# Initialize the OpenAI client
client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

# Choose the model
model = "llama-4-maverick-17b-128e-instruct-fp8"

# Create a text completion request using a prompt
response = client.completions.create(
  prompt="Computers are",
  temperature=0,
  model=model,
)

# Print the full completion response
print(response)

You should see output similar to:

Completion(id='cmpl-651020a3-26f2-47a9-9e73-6bc0f36e25f0', choices=[CompletionChoice(finish_reason='length', index=0, logprobs=Logprobs(text_offset=None, token_logprobs=None, tokens=None, top_logprobs=None), text=' being used more and more in education - IELTS Writing Essay Sample\nComputers')], created=1743968497, model='llama-4-maverick-17b-128e-instruct-fp8', object='text_completion', system_fingerprint=None, usage=CompletionUsage(completion_tokens=16, prompt_tokens=4, total_tokens=20, completion_tokens_details=None, prompt_tokens_details=None))

Run:

# Set API credentials and endpoint
API_KEY="<API-KEY>"
API_BASE="https://api.lambda.ai/v1"

# Send a text completion request to the Lambda API using the specified model and
# prompt. The response is piped to jq for pretty-printing as formatted JSON.
curl -sS "$API_BASE/completions" \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
        "model": "llama-4-maverick-17b-128e-instruct-fp8",
        "prompt": "Computers are",
        "temperature": 0
      }' | jq .

You should see output similar to:

{
  "id": "cmpl-d8ea14eb-7398-43e6-82d1-8dec587e1e04",
  "object": "text_completion",
  "created": 1743968623,
  "model": "llama-4-maverick-17b-128e-instruct-fp8",
  "choices": [
    {
      "text": " being used more and more in education. Some people say that this is a positive",
      "index": 0,
      "finish_reason": "length",
      "logprobs": {
        "tokens": null,
        "token_logprobs": null,
        "top_logprobs": null,
        "text_offset": null
      }
    }
  ],
  "usage": {
    "prompt_tokens": 4,
    "completion_tokens": 16,
    "total_tokens": 20,
    "prompt_tokens_details": null,
    "completion_tokens_details": null
  }
}

Listing models#

The /models endpoint lists the models available for use through the Lambda Inference API.

To use the /models endpoint:

PythonCurl

Run:

from openai import OpenAI

# Set API credentials and endpoint
openai_api_key = "<API-KEY>"
openai_api_base = "https://api.lambda.ai/v1"

# Initialize the OpenAI client
client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

# List available models from the Lambda Inference API and print the result
models = client.models.list()
print(models)

You should see output similar to:

SyncPage[Model](data=[Model(id='llama3.3-70b-instruct-fp8', created=1724347380, object='model', owned_by='lambda'), Model(id='llama-4-maverick-17b-128e-instruct-fp8', created=1724347380, object='model', owned_by='lambda'), Model(id='llama3.2-3b-instruct', created=1724347380, object='model', owned_by='lambda'), Model(id='llama-4-scout-17b-16e-instruct', created=1724347380, object='model', owned_by='lambda'), Model(id='hermes3-8b', created=1724347380, object='model', owned_by='lambda'), Model(id='llama3.1-nemotron-70b-instruct-fp8', created=1724347380, object='model', owned_by='lambda'), Model(id='llama3.1-70b-instruct-fp8', created=1724347380, object='model', owned_by='lambda'), Model(id='llama3.2-11b-vision-instruct', created=1724347380, object='model', owned_by='lambda'), Model(id='lfm-40b', created=1724347380, object='model', owned_by='lambda'), Model(id='hermes3-405b', created=1724347380, object='model', owned_by='lambda'), Model(id='qwen25-coder-32b-instruct', created=1724347380, object='model', owned_by='lambda'), Model(id='llama3.1-8b-instruct', created=1724347380, object='model', owned_by='lambda'), Model(id='deepseek-llama3.3-70b', created=1724347380, object='model', owned_by='lambda'), Model(id='deepseek-r1-671b', created=1724347380, object='model', owned_by='lambda'), Model(id='hermes3-70b', created=1724347380, object='model', owned_by='lambda'), Model(id='llama3.1-405b-instruct-fp8', created=1724347380, object='model', owned_by='lambda')], object='list')

Run:

# Set API credentials and endpoint
API_KEY="<API-KEY>"
API_BASE="https://api.lambda.ai/v1"

# List available models from the Lambda Inference API and pretty-print the
# response as JSON
curl "$API_BASE/models" -H "Authorization: Bearer $API_KEY" | jq .

You should see output similar to:

{
  "object": "list",
  "data": [
    {
      "id": "llama3.3-70b-instruct-fp8",
      "object": "model",
      "created": 1724347380,
      "owned_by": "lambda"
    },
    {
      "id": "llama-4-maverick-17b-128e-instruct-fp8",
      "object": "model",
      "created": 1724347380,
      "owned_by": "lambda"
    },
    {
      "id": "llama3.2-3b-instruct",
      "object": "model",
      "created": 1724347380,
      "owned_by": "lambda"
    },
    {
      "id": "llama-4-scout-17b-16e-instruct",
      "object": "model",
      "created": 1724347380,
      "owned_by": "lambda"
    },
    {
      "id": "hermes3-8b",
      "object": "model",
      "created": 1724347380,
      "owned_by": "lambda"
    },
    {
      "id": "llama3.1-nemotron-70b-instruct-fp8",
      "object": "model",
      "created": 1724347380,
      "owned_by": "lambda"
    },
    {
      "id": "llama3.1-70b-instruct-fp8",
      "object": "model",
      "created": 1724347380,
      "owned_by": "lambda"
    },
    {
      "id": "llama3.2-11b-vision-instruct",
      "object": "model",
      "created": 1724347380,
      "owned_by": "lambda"
    },
    {
      "id": "lfm-40b",
      "object": "model",
      "created": 1724347380,
      "owned_by": "lambda"
    },
    {
      "id": "hermes3-405b",
      "object": "model",
      "created": 1724347380,
      "owned_by": "lambda"
    },
    {
      "id": "qwen25-coder-32b-instruct",
      "object": "model",
      "created": 1724347380,
      "owned_by": "lambda"
    },
    {
      "id": "llama3.1-8b-instruct",
      "object": "model",
      "created": 1724347380,
      "owned_by": "lambda"
    },
    {
      "id": "deepseek-llama3.3-70b",
      "object": "model",
      "created": 1724347380,
      "owned_by": "lambda"
    },
    {
      "id": "deepseek-r1-671b",
      "object": "model",
      "created": 1724347380,
      "owned_by": "lambda"
    },
    {
      "id": "hermes3-70b",
      "object": "model",
      "created": 1724347380,
      "owned_by": "lambda"
    },
    {
      "id": "llama3.1-405b-instruct-fp8",
      "object": "model",
      "created": 1724347380,
      "owned_by": "lambda"
    }
  ]
}

Note

Currently, the following models are available:

Next steps#

In addition to the Inference API, Lambda also offers Lambda Chat, a clean, powerful chat interface that lets you interact with the models hosted by the API directly from your browser. For more information, see the Lambda Chat documentation.