Skip to main content
Version: 0.0.0

Generate an embedding

Setup your environment

Create an API key on the Credentials page under your account.

Set the following environment variables.

export OPENAI_API_KEY="esecret_YOUR_API_KEY"

You can find more details about authentication here.

Select a model


Start with the 70B version, and then work your way down to the smaller models.

Anyscale supports the following models:

Query a model


If you are starting a project from scratch, use the OpenAI Python SDK instead of cURL or Python.

Embedding models

Here is an example of how to query embedding model thenlper/gte-large.

import openai

client = openai.OpenAI(
base_url = "",
api_key = "esecret_YOUR_API_KEY"

# Note: not all arguments are currently supported and will be ignored by the backend.
embedding = client.embeddings.create(
input="Your text string goes here",

The output looks like the following:

'data': [
{'embedding': [...],
'index': 0,
'object': 'embedding'
'model': 'thenlper/gte-large',
'object': 'list',
'usage': {
'prompt_tokens': 7,
'total_tokens': 7
'id': 'thenlper/gte-large-UEpQEaduAoaC6rq5n1yxkYNalVukLBhMzkG7IV_GPgU',
'created': 1701325873

Rate limiting

Anyscale Endpoints rate limits work a little differently than other comparable platforms. The limits are based on the number of concurrent requests in flight, not on the number of tokens or requests per second. Meaning you aren't limited in the number of requests you send, but based on how many you send at once.

The current default limit is 30 concurrent requests. Reach out to if you have a use case that needs more.