Skip to main content
Version: 0.0.0

Query a model

Setup your environment

Create an API key on the Credentials page under your account.

Set the following environment variables.

export OPENAI_BASE_URL="https://api.endpoints.anyscale.com/v1"
export OPENAI_API_KEY="esecret_YOUR_API_KEY"

You can find more details about authentication here.

Select a model

tip

If you are building an app on top of the models, start with the 70B version and then work your way down to the smaller models.

Anyscale supports the following models.

Query a model

tip

If you are starting a project from scratch, use the OpenAI Python SDK instead of cURL or Python.

Chat models

Here is an example of how to query meta-llama/Llama-2-70b-chat-hf.

import openai

client = openai.OpenAI(
base_url = "https://api.endpoints.anyscale.com/v1",
api_key = "esecret_YOUR_API_KEY"
)

# Note: not all arguments are currently supported and will be ignored by the backend.
chat_completion = client.chat.completions.create(
model="meta-llama/Llama-2-70b-chat-hf",
messages=[{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Say 'Test'."}],
temperature=0.7
)
print(chat_completion.model_dump())

Example output:

{
'id': 'meta-llama/Llama-2-70b-chat-hf-2e0b4e62c2d704f3c850f80662e530d9',
'choices': [
{
'finish_reason': 'stop',
'index': 0,
'message': {
'content': ' Sure! Test. Is there anything else I can assist you with?',
'role': 'assistant',
'function_call': None,
'tool_calls': None
}
}],
'created': 1699982193,
'model': 'meta-llama/Llama-2-70b-chat-hf',
'object': 'text_completion',
'system_fingerprint': None,
'usage': {
'completion_tokens': 16,
'prompt_tokens': 32,
'total_tokens': 48
}
}

Completion models

Here is an example of how to query a completion model. You can query any chat model in this approach as long as the prompt has proper formatting.

import openai

client = openai.OpenAI(
base_url = "https://api.endpoints.anyscale.com/v1",
api_key = "esecret_YOUR_API_KEY"
)

completion = client.completions.create(
model="meta-llama/Llama-2-7b-chat-hf",
prompt="<s>[INST] Tell me a joke. [/INST]",
max_tokens=100
)
completion.model_dump()

Notice how the BOS token <s> and [INST] and [/INST] tags must appear with the question when using completion API.

Example output:

{'choices': [{'finish_reason': 'stop',
'index': 0,
'logprobs': {'text_offset': None,
'token_logprobs': None,
'tokens': None,
'top_logprobs': None},
'text': " Sure, here's a short and simple one for you:\n"
'\n'
"Why don't scientists trust atoms?\n"
'\n'
'Because they make up everything!\n'
'\n'
'I hope you found that amusing! Do you have any other '
'topic or question you would like me to assist you '
'with?'}],
'created': 1701975112,
'id': 'meta-llama/Llama-2-7b-chat-hf-8dPyTvkTnZl5f2aBhKU-giErWT7RV5vXDQP20GBWx_k',
'model': 'meta-llama/Llama-2-7b-chat-hf',
'object': 'text_completion',
'system_fingerprint': None,
'usage': {'completion_tokens': 62, 'prompt_tokens': 15, 'total_tokens': 77}
}

Rate limiting

Anyscale Endpoints rate limits work a little differently than other comparable platforms. The limits depend on the number of concurrent requests in flight, not on the number of tokens or requests per second. You aren't limited in the number of requests you send, but based on how many you send at once.

The current default limit is 30 concurrent requests. Reach out to endpoints-help@anyscale.com if you have a use case that needs more.