Skip to main content
Version: Latest

Data format

The design of the dataset format matches the OpenAI chat-based fine-tuning format.

info

Anyscale validates your dataset when you submit your fine-tuning job. Errors return immediately in your API call.

File format

The input files should be in .jsonl (JSON Lines) format. Each line is a list of messages in JSON format.

Message format

Each message is a dictionary with key "messages", and its value is a list of dictionaries.

Each dictionary in the list with the following keys:

  • role: Can take one of three possible values:

    1. system
    2. user
    3. assistant
  • content: Text corresponding to that role.

The system message is optional, but each conversation must have at least one pair of user and assistant messages.

For example, here is a message:

{
"messages": [
{ "role": "system", "content": "You are a helpful assistant" },
{ "role": "user", "content": "Hi" },
{ "role": "assistant", "content": "Hello, How can I help you?" },
{ "role": "user", "content": "How is the weather today?" },
{ "role": "assistant", "content": "get_weather_status()" }
]
}

Conversation patterns

Conversations can follow one of the following patterns:

  1. s/u/a/u/a/... pattern (system, user, assistant, user, assistant, ...)
  2. u/a/u/a/... pattern (user, assistant, user, assistant, ...)

Examples

s/u/a/u/a pattern

[
{ "role": "system", "content": "You are a helpful assistant" },
{ "role": "user", "content": "Hi" },
{ "role": "assistant", "content": "Hello, How can I help you?" },
{ "role": "user", "content": "How is the weather today?" },
{ "role": "assistant", "content": "get_weather_status()" }
]

s/u/a/u/a pattern with an empty system message

[
{ "role": "system", "content": "" },
{ "role": "user", "content": "Hi" },
{ "role": "assistant", "content": "Hello, How can I help you?" },
{ "role": "user", "content": "How is the weather today?" },
{ "role": "assistant", "content": "get_weather_status()" }
]

u/a/u/a pattern

[
{ "role": "user", "content": "Hi" },
{ "role": "assistant", "content": "Hello, How can I help you?" },
{ "role": "user", "content": "How is the weather today?" },
{ "role": "assistant", "content": "get_weather_status()" }
]
info

📌 Remember that you should query the model the same way it has been trained. For example, if your training dataset only includes single turn conversations, the model may not generalize to multi-turn chat conversations.

Validation data

You can provide optional validation data to help prevent overfitting. If you supply validation data, you gain access to metrics like validation loss and perplexity after each epoch. The model checkpoint featuring the lowest validation perplexity becomes the selected best model for serving after fine-tuning completes.

If you don't provide validation data, the system serves the final model checkpoint created at the end of the training run.

info

📌 The validation dataset should be smaller than the provided fine-tuning, or training, dataset to ensure efficient validation.

📌 As of the current version, there is no early stopping mechanism implemented in the fine-tuning process.

Number of epochs

The n_epochs parameter, which is optional, determines the number of iterations the learning algorithm performs across the entire training dataset.

When not specified, the system automatically sets this parameter according to the dataset size. Generally:

  • A larger number of epochs suits smaller datasets to achieve convergence.
  • Fewer epochs are typically adequate for training larger datasets.
info

📌 If you observe that the validation loss is still improving with continued training, you can manually specify n_epochs. To do this, set it to 1 or 2 epochs higher than the number automatically selected by the system.

Context length

You can optionally specify the context length. If unspecified, the system automatically opts for the smallest supported length exceeding 95% of your dataset sequence lengths, enhancing training efficiency. Sequences exceeding this context length face truncation.

To assess your dataset sequence length distribution and identify the minimal supported context length, execute the following Python script. Without a specified context length, the system applies a logic to determine the most suitable context length for your dataset:

import json
import numpy as np
from collections import defaultdict


DATA_PATH = "<put_your_own_dataset.jsonl>"
SUPPORTED_CONTEXT_LENGTHS = [512, 1024, 2048, 4096, 8192, 16384, 32768]

# Import the tokenizer
from transformers import LlamaTokenizerFast
tokenizer = LlamaTokenizerFast.from_pretrained("hf-internal-testing/llama-tokenizer")
tokenizer.pad_token = tokenizer.eos_token

# Load the dataset
with open(DATA_PATH, 'r', encoding='utf-8') as f:
items = [json.loads(line) for line in f]


# Utility function for proper formatting of the data
def convert_message_list_to_text(messages: list) -> str:
B_INST, E_INST = "[INST]", "[/INST]"
B_SYS, E_SYS = "<<SYS>>\n", "\n<</SYS>>\n\n"
text = ""

if messages[0]["role"] == "system":
messages = [
{
"role": messages[1]["role"],
"content": B_SYS
+ messages[0]["content"]
+ E_SYS
+ messages[1]["content"],
}
] + messages[2:]

assert all([msg["role"] == "user" for msg in messages[::2]]) and all(
[msg["role"] == "assistant" for msg in messages[1::2]]
), (
"model only supports 'system','user' and 'assistant' roles, "
"starting with user and alternating (u/a/u/a/u...)"
)

texts = []
for prompt, answer in zip(messages[::2], messages[1::2]):
texts.append(f"{B_INST} {(prompt['content']).strip()} {E_INST} {(answer['content']).strip()} ")

text = "</s><s>".join(texts)
# add the bos and eos token at the beginning of the first turn and the end of the last turn
text = "<s>" + text + " </s>"
# During training last message should be from assistant (not from a user)
assert (
messages[-1]["role"] == "assistant"
), f"Last message must be from assistant, got {messages[-1]['role']}"

return text


# Utility functions for calculating the statistics of the number of tokens in the dataset
def print_token_statistics(stats) -> None:
for key in stats:
print(f"Statistics for {key}:")
if isinstance(stats[key], dict):
for stat_key, stat_value in stats[key].items():
print(f"\t{stat_key}: {stat_value:.3f}")
else:
print(f"\t{stats[key]}")
print("")

def get_tokenized_stats(items: list, print_stats: bool = True):

counters = defaultdict(list)
for batch in items:
messages = batch["messages"]

# add message count
counters["message"].append(len(messages))

# add the number of tokens of this message to the token counter
text = convert_message_list_to_text(messages)
tokens = tokenizer(text)['input_ids']
counters["token"].append(len(tokens))

stats = {}
for key, value in counters.items():
stats[key] = {
"max": float(np.max(value)),
"min": float(np.min(value)),
"median": float(np.median(value)),
"mean": float(np.mean(value)),
"p95": float(np.percentile(value, 95)),
"p5": float(np.percentile(value, 5)),
}
stats["ds_size"] = len(items)

if print_stats:
print_token_statistics(stats)

return stats

# Auto calculate the context length
stats = get_tokenized_stats(items, print_stats=True)
for ctx_length in SUPPORTED_CONTEXT_LENGTHS:
if ctx_length > stats["token"]["p95"]:
break

print("Automatically selected context length: ", ctx_length)

Guidelines for dataset size

Minimum and maximum dataset requirements:

  • Minimum: You must include at least 8 examples. However, for effective fine-tuning, use a minimum of 100 examples.
  • Maximum: The maximum dataset size depends on your model type and context length. It's calculated using the formula dataset_size x n_epochs. Exceeding this maximum results in an error.

Determining your model's maximum dataset size: See the table below to find the appropriate maximum number of examples supported for your chosen model and context length.

Model512 tokens1024 tokens2048 tokens4096 tokens8192 tokens16384 tokens32768 tokens
meta-llama/Meta-Llama-3-8B-Instruct90k32k15k5k5k5k2.5k
meta-llama/Meta-Llama-3-70B-Instruct25k10k5k5k3k1.5k
mistralai/Mistral-7B-Instruct-v0.1150k50k25k10k5k
mistralai/Mixtral-8x7B-Instruct-v0.125k10k5k5k2.5k1k1k

To validate the size of your dataset, run the following Python script after running the preceding code snippet:

MODEL_SIZE = "llama-8b" # or 8x7b, 70b, ...
DS_MAX_SIZE_LIMITS = {
"mistral-7b": {
512: 150_000,
1024: 50_000,
2048: 25_000,
4096: 10_000,
8192: 5_000,
},
"llama-8b": {
512: 90_000,
1024: 32_000,
2048: 15_000,
4096: 5_000,
8192: 5_000,
16384: 5_000,
32768: 2_500,
},
"llama-70b": {
512: 25_000,
1024: 10_000,
2048: 5_000,
4096: 5_000,
8192: 3_000,
16384: 1_500,
},
"mixtral-8x7b": {
512: 25_000,
1024: 10_000,
2048: 5_000,
4096: 5_000,
8192: 2_500,
16384: 1_000,
32768: 1_000,
},
}
CONTEXT_LENGTH = ctx_length

ds_max_size = DS_MAX_SIZE_LIMITS[MODEL_SIZE][CONTEXT_LENGTH]
if len(items) > ds_max_size:
raise ValueError(
f"Dataset size ({len(items)}) exceeds the maximum allowable size ({ds_max_size})"
)

If you would like to fine-tune on a larger dataset, reach out to endpoints-help@anyscale.com.

Token counting

The actual number of trained tokens may exceed the theoretical count, calculated as the number of tokens in the dataset multiplied by the number of epochs. Variability in sequence length affects this number.

To estimate trained tokens, a rule of thumb formula is:

trained_tokens = dataset_size x n_epochs x alpha

Here, alpha is a factor that stochastically depend on the batch size and sequence length variability. Less sequence variation reduces the values of alpha. Also, smaller batch sizes reduce the values of alpha, but decreases throughput. alpha is typically between 1.0 and 3.0 for typical datasets.

📌 Anyscale picks the batch size per device to maximize throughput at the given context length, so that you don't have to think about the optimal setting. Anyscale doesn't expose the particular batch size to you.

To get a close approximation of the number of trained tokens for your dataset before running the fine-tuning job, you can use the following code snippet after running the preceding code. By playing with the batch size you can see how it affects the number of trained tokens.

# We will use ray data for batched data iteration
import ray # pip install ray[data]
import pandas as pd

# You can change the batch size per device here
BSIZE_PER_DEVICE = 16

# Creating a ray dataset for easier processing
df = pd.DataFrame.from_dict(items)
ds = ray.data.from_pandas(df)


def batched_convert_messages_to_text(batch: pd.DataFrame) -> pd.DataFrame:
"""Converts a batch of messages (list of roles + content) to plain text."""
df = []
for _, b in batch.iterrows():
text = convert_message_list_to_text(list(b["messages"]))
df.append({"input": text})

return pd.DataFrame(df)

def collate_fn(batch: dict):
return tokenizer(
list(batch["input"]),
padding="longest",
max_length=CONTEXT_LENGTH,
truncation=True,
return_tensors="pt",
)


# Data preprocssing pipeline
flattened_ds = ds.map_batches(
batched_convert_messages_to_text, batch_size=16, batch_format="pandas"
)

data_set_tokens_per_epoch = 0
trained_tokens_per_epoch = 0
for batch in flattened_ds.iter_torch_batches(
batch_size=BSIZE_PER_DEVICE, collate_fn=collate_fn
):
trained_tokens_per_epoch += batch["input_ids"].numel()
data_set_tokens_per_epoch += batch["attention_mask"].sum().item()

print("Num tokens in dataset per epoch: ", data_set_tokens_per_epoch)
print("Num tokens trained per epoch: ", trained_tokens_per_epoch)
print("Padding inflation ratio: ", trained_tokens_per_epoch / data_set_tokens_per_epoch)