GET STARTED
MEASUREMENT
AVATAR
GET STARTED
MEASUREMENT
AVATAR
Fine-tune models for better results and efficiency.
Fine-tuning lets you get more out of the models available by providing:
Text generation models have been pre-trained on a vast amount of text. To use the models effectively, we include instructions and sometimes several examples in a prompt. Using demonstrations to show how to perform a task is often called "few-shot learning."
Fine-tuning improves on few-shot learning by training on many more examples than can fit in the prompt, letting you achieve better results on a wide number of tasks. Once a model has been fine-tuned, you won't need to provide as many examples in the prompt. This saves costs and enables lower-latency requests.
At a high level, fine-tuning involves the following steps:
Visit our pricing page to learn more about how fine-tuned model training and usage are billed.
Fine-tuning is currently available for the following models:
You can also fine-tune a fine-tuned model, which is useful if you acquire additional data and don't want to repeat the previous training steps.
We expect gpt-4o-mini to be the right model for most users in terms of performance, cost, and ease of use.
Fine-tuning text generation models can make them better for specific applications, but it requires a careful investment of time and effort. We recommend first attempting to get good results with prompt engineering, prompt chaining (breaking complex tasks into multiple prompts), and function calling, with the key reasons being:
Prompt engineering guide provides a background on some of the most effective strategies and tactics for getting better performance without fine-tuning. You may find it helpful to iterate quickly on prompts in our playground.
Some common use cases where fine-tuning can improve results:
One high-level way to think about these cases is when it's easier to "show, not tell". In the sections to come, we will explore how to set up data for fine-tuning and various examples where fine-tuning improves the performance over the baseline model.
Another scenario where fine-tuning is effective is reducing cost and/or latency by replacing a more expensive model like gpt-4o
with a fine-tuned gpt-4o-mini
model. If you can achieve good results with gpt-4o
, you can often reach similar quality with a fine-tuned gpt-4o-mini
model by fine-tuning on the gpt-4o
completions, possibly with a shortened instruction prompt. After your job is completed, the model should be available right away for inference use.
Once you have determined that fine-tuning is the right solution (i.e. you've optimized your prompt as far as it can take you and identified problems that the model still has), you'll need to prepare data for training the model. You should create a diverse set of demonstration conversations that are similar to the conversations you will ask the model to respond to at inference time in production.
Each example in the dataset should be a conversation in the same format as our Chat Completions API, specifically a list of messages where each message has a role, content, and optional name. At least some of the training examples should directly target cases where the prompted model is not behaving as desired, and the provided assistant messages in the data should be the ideal responses you want the model to provide.
In this example, our goal is to create a chatbot that occasionally gives sarcastic responses, these are three training examples (conversations) we could create for a dataset:
Examples in the chat format can have multiple messages with the assistant role. The default behavior during fine-tuning is to train on all assistant messages within a single example. To skip fine-tuning on specific assistant messages, a weight
key can be added disable fine-tuning on that message, allowing you to control which assistant messages are learned. The allowed values for weight
are currently 0 or 1. Some examples using weight
for the chat format are below.
We generally recommend taking the set of instructions and prompts that you found worked best for the model prior to fine-tuning, and including them in every training example. This should let you reach the best and most general results, especially if you have relatively few (e.g. under a hundred) training examples.
If you would like to shorten the instructions or prompts that are repeated in every example to save costs, keep in mind that the model will likely behave as if those instructions were included, and it may be hard to get the model to ignore those "baked-in" instructions at inference time.
It may take more training examples to arrive at good results, as the model has to learn entirely through demonstration and without guided instructions.
To fine-tune a model, you are required to provide at least 10 examples. We typically see clear improvements from fine-tuning on 50 to 100 training examples with gpt-4o-mini
and gpt-3.5-turbo
, but the right number varies greatly based on the exact use case.
We recommend starting with 50 well-crafted demonstrations and seeing if the model shows signs of improvement after fine-tuning. In some cases that may be sufficient, but even if the model is not yet production quality, clear improvements are a good sign that providing more data will continue to improve the model. No improvement suggests that you may need to rethink how to set up the task for the model or restructure the data before scaling beyond a limited example set.
After collecting the initial dataset, we recommend splitting it into a training and test portion. When submitting a fine-tuning job with both training and test files, we will provide statistics on both during the course of training. These statistics will be your initial signal of how much the model is improving. Additionally, constructing a test set early on will be useful in making sure you are able to evaluate the model after training, by generating samples on the test set.
Token limits depend on the model you select. Here is an overview of the maximum inference context length and training examples context length for gpt-4o-mini
and gpt-3.5-turbo
models:
MODEL
Inference context length
Training examples context length
gpt-4o-2024-08-06
128,000 token
65,536 tokens (128k coming soon)
gpt-4o-mini-2024-07-18
128,000 token
65,536 tokens (128k coming soon)
gpt-3.5-turbo-0125
16,385 token
16,385 tokens
gpt-3.5-turbo-1106
16,385 token
16,385 tokens
gpt-3.5-turbo-0613
16,385 token
4,096 tokens
Examples longer than the default will be truncated to the maximum context length which removes tokens from the end of the training example(s). To be sure that your entire training example fits in context, consider checking that the total token counts in the message contents are under the limit.
For detailed pricing on training costs, as well as input and output costs for a deployed fine-tuned model, visit our pricing page. Note that we don't charge for tokens used for training validation. To estimate the cost of a specific fine-tuning training job, use the following formula:
(base training cost per 1M input tokens ÷ 1M) × number of tokens in the input file × number of epochs trained
For a training file with 100,000 tokens trained over 3 epochs, the expected cost would be:
gpt-4o-mini-2024-07-18
after the free period ends on October 31, 2024.gpt-3.5-turbo-0125
.Once you have compiled a dataset and before you create a fine-tuning job, it is important to check the data formatting. To do this, we created a simple Python script which you can use to find potential errors, review token counts, and estimate the cost of a fine-tuning job.
The maximum file upload size is 1 GB, though we do not suggest fine-tuning with that amount of data since you are unlikely to need that large of an amount to see improvements.
After ensuring you have the right amount and structure for your dataset, and have uploaded the file, the next step is to create a fine-tuning job. We support creating fine-tuning jobs via the fine-tuning UI.
After you've started a fine-tuning job, it may take some time to complete. Your job may be queued behind other jobs in our system, and training a model can take minutes or hours depending on the model and dataset size.
Fine-tuning is also possible with images in your JSONL files. Just as you can send one or many image inputs to chat completions, you can include those same message types within your training data. Images can be provided either as HTTP URLs or data URLs containing base64 encoded images.
Here's an example of an image message on a line of your JSONL file. Below, the JSON object is expanded for readibility, but typically this JSON would appear on a single line in your data file:
assistant
role.Your images can get skipped for the following reasons:
If you set the detail parameter for an image to low, the image is resized to 512 by 512 pixels and is only represented by 85 tokens regardless of its size. This will reduce the cost of training.
detail
parameter of image_url
to low
, high
, or auto
for each image. This will also affect the number of tokens per image that the model sees during training time, and will affect the cost of training.Direct Preference Optimization (DPO) fine-tuning allows you to fine-tune models based on prompts and pairs of responses. This approach enables the model to learn from human preferences, optimizing for outputs that are more likely to be favored. Note that we currently support text-only DPO fine-tuning.
Each example in your dataset should contain:
The data should be formatted in JSONL format, with each line representing an example in the following structure:
Currently, we only train on one-turn conversations for each example, where the preferred and non-preferred messages need to be the last assistant message.
Now that we have explored the basics of fine-tuning, let’s look at going through the fine-tuning lifecycle for a few different use cases.