Fine-tuning

Fine-tune models for better results and efficiency.

Fine-tuning lets you get more out of the models available by providing:

  • Higher quality results than prompting
  • Ability to train on more examples than can fit in a prompt
  • Token savings due to shorter prompts
  • Lower latency requests

Text generation models have been pre-trained on a vast amount of text. To use the models effectively, we include instructions and sometimes several examples in a prompt. Using demonstrations to show how to perform a task is often called "few-shot learning."

Fine-tuning improves on few-shot learning by training on many more examples than can fit in the prompt, letting you achieve better results on a wide number of tasks. Once a model has been fine-tuned, you won't need to provide as many examples in the prompt. This saves costs and enables lower-latency requests.

At a high level, fine-tuning involves the following steps:

  1. Prepare and upload training data
  2. Train a new fine-tuned model
  3. Evaluate results and go back to step 1 if needed
  4. Use your fine-tuned model

Visit our pricing page to learn more about how fine-tuned model training and usage are billed.

Which models can be fine-tuned?

Fine-tuning is currently available for the following models:

  • gpt-4o-2024-08-06
  • gpt-4o-mini-2024-07-18
  • gpt-4-0613
  • gpt-3.5-turbo-0125
  • gpt-3.5-turbo-1106
  • gpt-3.5-turbo-0613

You can also fine-tune a fine-tuned model, which is useful if you acquire additional data and don't want to repeat the previous training steps.

We expect gpt-4o-mini to be the right model for most users in terms of performance, cost, and ease of use.

When to use fine-tuning

Fine-tuning text generation models can make them better for specific applications, but it requires a careful investment of time and effort. We recommend first attempting to get good results with prompt engineering, prompt chaining (breaking complex tasks into multiple prompts), and function calling, with the key reasons being:

  • There are many tasks at which our models may not initially appear to perform well, but results can be improved with the right prompts - thus fine-tuning may not be necessary
  • Iterating over prompts and other tactics has a much faster feedback loop than iterating with fine-tuning, which requires creating datasets and running training jobs
  • In cases where fine-tuning is still necessary, initial prompt engineering work is not wasted - we typically see best results when using a good prompt in the fine-tuning data (or combining prompt chaining / tool use with fine-tuning)

Prompt engineering guide provides a background on some of the most effective strategies and tactics for getting better performance without fine-tuning. You may find it helpful to iterate quickly on prompts in our playground.

Common use cases

Some common use cases where fine-tuning can improve results:

  • Setting the style, tone, format, or other qualitative aspects
  • Improving reliability at producing a desired output
  • Correcting failures to follow complex prompts
  • Handling many edge cases in specific ways
  • Performing a new skill or task that's hard to articulate in a prompt

One high-level way to think about these cases is when it's easier to "show, not tell". In the sections to come, we will explore how to set up data for fine-tuning and various examples where fine-tuning improves the performance over the baseline model.

Another scenario where fine-tuning is effective is reducing cost and/or latency by replacing a more expensive model like gpt-4o with a fine-tuned gpt-4o-mini model. If you can achieve good results with gpt-4o, you can often reach similar quality with a fine-tuned gpt-4o-minimodel by fine-tuning on the gpt-4o completions, possibly with a shortened instruction prompt. After your job is completed, the model should be available right away for inference use.

Preparing your dataset

Once you have determined that fine-tuning is the right solution (i.e. you've optimized your prompt as far as it can take you and identified problems that the model still has), you'll need to prepare data for training the model. You should create a diverse set of demonstration conversations that are similar to the conversations you will ask the model to respond to at inference time in production.

Each example in the dataset should be a conversation in the same format as our Chat Completions API, specifically a list of messages where each message has a role, content, and optional name. At least some of the training examples should directly target cases where the prompted model is not behaving as desired, and the provided assistant messages in the data should be the ideal responses you want the model to provide.

Example format

In this example, our goal is to create a chatbot that occasionally gives sarcastic responses, these are three training examples (conversations) we could create for a dataset:

Multi-turn chat examples

Examples in the chat format can have multiple messages with the assistant role. The default behavior during fine-tuning is to train on all assistant messages within a single example. To skip fine-tuning on specific assistant messages, a weight key can be added disable fine-tuning on that message, allowing you to control which assistant messages are learned. The allowed values for weight are currently 0 or 1. Some examples using weight for the chat format are below.

Crafting prompts

We generally recommend taking the set of instructions and prompts that you found worked best for the model prior to fine-tuning, and including them in every training example. This should let you reach the best and most general results, especially if you have relatively few (e.g. under a hundred) training examples.

If you would like to shorten the instructions or prompts that are repeated in every example to save costs, keep in mind that the model will likely behave as if those instructions were included, and it may be hard to get the model to ignore those "baked-in" instructions at inference time.

It may take more training examples to arrive at good results, as the model has to learn entirely through demonstration and without guided instructions.

Example count recommendations

To fine-tune a model, you are required to provide at least 10 examples. We typically see clear improvements from fine-tuning on 50 to 100 training examples with gpt-4o-mini and gpt-3.5-turbo, but the right number varies greatly based on the exact use case.

We recommend starting with 50 well-crafted demonstrations and seeing if the model shows signs of improvement after fine-tuning. In some cases that may be sufficient, but even if the model is not yet production quality, clear improvements are a good sign that providing more data will continue to improve the model. No improvement suggests that you may need to rethink how to set up the task for the model or restructure the data before scaling beyond a limited example set.

Train and test splits

After collecting the initial dataset, we recommend splitting it into a training and test portion. When submitting a fine-tuning job with both training and test files, we will provide statistics on both during the course of training. These statistics will be your initial signal of how much the model is improving. Additionally, constructing a test set early on will be useful in making sure you are able to evaluate the model after training, by generating samples on the test set.

Token limits

Token limits depend on the model you select. Here is an overview of the maximum inference context length and training examples context length for gpt-4o-mini and gpt-3.5-turbo models:

MODEL

Inference context length

Training examples context length

gpt-4o-2024-08-06

128,000 token

65,536 tokens (128k coming soon)

gpt-4o-mini-2024-07-18

128,000 token

65,536 tokens (128k coming soon)

gpt-3.5-turbo-0125

16,385 token

16,385 tokens

gpt-3.5-turbo-1106

16,385 token

16,385 tokens

gpt-3.5-turbo-0613

16,385 token

4,096 tokens

Examples longer than the default will be truncated to the maximum context length which removes tokens from the end of the training example(s). To be sure that your entire training example fits in context, consider checking that the total token counts in the message contents are under the limit.

Estimate costs

For detailed pricing on training costs, as well as input and output costs for a deployed fine-tuned model, visit our pricing page. Note that we don't charge for tokens used for training validation. To estimate the cost of a specific fine-tuning training job, use the following formula:

(base training cost per 1M input tokens ÷ 1M) × number of tokens in the input file × number of epochs trained

For a training file with 100,000 tokens trained over 3 epochs, the expected cost would be:

  • ~$0.90 USD with gpt-4o-mini-2024-07-18 after the free period ends on October 31, 2024.
  • ~$2.40 USD with gpt-3.5-turbo-0125.

Check data formatting

Once you have compiled a dataset and before you create a fine-tuning job, it is important to check the data formatting. To do this, we created a simple Python script which you can use to find potential errors, review token counts, and estimate the cost of a fine-tuning job.

The maximum file upload size is 1 GB, though we do not suggest fine-tuning with that amount of data since you are unlikely to need that large of an amount to see improvements.

Create a fine-tuned model

After ensuring you have the right amount and structure for your dataset, and have uploaded the file, the next step is to create a fine-tuning job. We support creating fine-tuning jobs via the fine-tuning UI.

After you've started a fine-tuning job, it may take some time to complete. Your job may be queued behind other jobs in our system, and training a model can take minutes or hours depending on the model and dataset size.

Vision fine-tuning

Fine-tuning is also possible with images in your JSONL files. Just as you can send one or many image inputs to chat completions, you can include those same message types within your training data. Images can be provided either as HTTP URLs or data URLs containing base64 encoded images.

Here's an example of an image message on a line of your JSONL file. Below, the JSON object is expanded for readibility, but typically this JSON would appear on a single line in your data file:

Image dataset requirements

Size

  • Your training file can contain a maximum of 50,000 examples that contain images (not including text examples).
  • Each example can have at most 10 images.
  • Each image can be at most 10 MB.

Format

  • Images must be JPEG, PNG, or WEBP format.
  • Your images must be in the RGB or RGBA image mode.
  • You cannot include images as output from messages with the assistant role.

Help

What to do if your images get skipped

Your images can get skipped for the following reasons:

  • contains CAPTCHAs, contains people, contains faces, contains children
    • Remove the image. For now, we cannot fine-tune models with images containing these entities.
  • inaccessible URL
    • Ensure that the image URL is publicly accessible.
  • image too large
    • Please ensure that your images fall within our dataset size limits.
  • invalid image format
    • Please ensure that your images fall within our dataset format.

Reducing training cost

If you set the detail parameter for an image to low, the image is resized to 512 by 512 pixels and is only represented by 85 tokens regardless of its size. This will reduce the cost of training.

Other considerations for vision fine-tuning

  • To control the fidelity of image understanding, set the detail parameter of image_url to low, high, or auto for each image. This will also affect the number of tokens per image that the model sees during training time, and will affect the cost of training.

Preference fine-tuning

Direct Preference Optimization (DPO) fine-tuning allows you to fine-tune models based on prompts and pairs of responses. This approach enables the model to learn from human preferences, optimizing for outputs that are more likely to be favored. Note that we currently support text-only DPO fine-tuning.

Preparing your dataset for DPO

Each example in your dataset should contain:

  • A prompt, like a user message.
  • A preferred output (an ideal assistant response).
  • A non-preferred output (a suboptimal assistant response).

The data should be formatted in JSONL format, with each line representing an example in the following structure:

Currently, we only train on one-turn conversations for each example, where the preferred and non-preferred messages need to be the last assistant message.

Fine-tuning examples

Now that we have explored the basics of fine-tuning, let’s look at going through the fine-tuning lifecycle for a few different use cases.