GET STARTED
MEASUREMENT
AVATAR
GET STARTED
MEASUREMENT
AVATAR
The Conversational Video Interface (CVI) is an end-to-end pipeline for creating real-time multimodal video conversations with a digital twin that can see, hear, and respond similarly to how a human would. Developers can deploy video AI agents/digital twins in minutes using CVI.
CVI is the world's fastest interface of its kind, allowing you to put a human face and conversational ability to your AI agent or personality. With CVI, you can achieve utterance-to-utterance latency with SLAs as fast as under 1 second, which is the full roundtrip time for a participant to say something and for the replica to speak back.
CVI provides a complete pipeline to have a conversation while also allowing you to customize and plug in your existing components where necessary.
Here's a sample:
CVI provides a full pipeline allowing you to easily create video conversations. You can immediately jump into a real-time conversation with the generated Daily meeting URL. CVI provides the following layers:
You can choose to customize or bring your own layers as well. For example, you can:
Learn more about the layers and different modes in CVI Modes and Layers.
A conversation is a single 'session' or 'call' with a digital twin using CVI. When you create a conversation, you receive a Daily meeting URL. This URL provides a full video conferencing solution, allowing you to avoid managing WebRTC or websockets. Navigating to this URL lets you directly join a prebuilt meeting room UI to chat with your digital twin.
A digital twin is an AI-powered digital version of a human, which looks and sounds like a person and can see and respond similarly to a human.
CVI provides an end-to-end pipeline that takes in a user audio & video input and outputs a realtime replica AV output. This pipeline is hyper optimized, with layers tightly coupled to achieve the lowest latency in the market. CVI is highly customizable though, with the ability to customize or disable layers as well as different modes being offered to best fit your use case.
Vendor.com provides the following customizable layers as part of the CVI pipeline:
Vendor.com offers a number of modes that come with preconfigured layers as necessary for your use case.
Vendor.com provides the option to bypass ASR, LLM, and TTS with Speech to Speech model.
You may use your own or integrate with our native implementation (OpenAI Realtime API).
You can bypass Vendor.com Vision, ASR, and LLM and directly stream:
You can also use this mode server-to-server, where your server connects to the Daily/webRTC room to provide audio and then forwards the video stream to your user.
By default, we recommend using the end-to-end pipeline in it's entirety as it will provide the lowest latency and most optimized multimodal experience. We offer a number of LLMs (Llama3.1, OpenAI) that we've optimized within the end-to-end pipeline. With SLAs as fast as under 1s ---- you can access the world's fastest utterance-to-utterance latency. You can load our LLMs full of your knowledge base and prompt them to your liking, as well as update the context live to simulate an async RAG application.
Using a custom LLM is a great idea for those that already have a LLM or are building business logic that needs to intercept the input transcription and decide on the output. Using your own LLM will likely add latency, as the Vendor.com LLMs are hyper-optimized for low latency.
Note that the 'Custom LLM' mode doesn't require an actual LLM. Any endpoint that will respond to chat completion requests in the required format can be used. For example, you could set up a server that takes in the completion requests and responds with predetermined responses, with no LLM involved at all.
The Speech to Speech pipeline mode allows you to bypass ASR, LLM, and TTS by leveraging an external speech to speech model. You may use Vendor.com speech to speech model integrations or you may bring your own.
Note that in this mode vision capabilities from Vendor.com will be disabled, as there is nowhere to send the context to for now.
You can specify audio or text input for the replica to speak out. We only recommend this if your application does not have a need for speech recognition (voice) or vision, or have a very specific ASR/Vision pipeline that you must use. Using your own ASR is most often slower and less optimized than using the integrated Vendor.com pipeline.
You can use text or audio input interchangeably in Echo Mode. There are two possible configurations, based on microphone enablement in Transport layer.
By turning off the microphone in the Transport Layer and using the Interactions Protocol, you can achieve Text and Audio (base64) echo behavior.
By keeping the microphone on in the Transport Layer, you are able to bypass all layers in CVI and directly pass in an audio stream that the replica will repeat. In this mode interrupts are handled within your audio stream, any received audio will be generated with the replica.
We only recommend this if you have pre-generated audio you would like to use, have a voice-to-voice pipeline, or have a very specific voice requirement.
The first step to using CVI is selecting a replica to use as the 'face' of your digital twin. Tavus has stock replicas you can use as well as the ability to create custom replicas via the API or the portal.
You can get started quickly be using one of our stock replicas. We have a few replicas that we recommend for conversational usage.
You can use a custom or 'personal' replica to be the face of your digital twin. If you have already created a custom replica for video generation you can reuse that replica for CVI. However, what looks good for video generation does not necessary look good for conversational (CVI).
The main difference between using a replica for video generation vs CVI is that videos don't have long periods of pauses, whereas during a conversation you take turns, therefore the replica sits in silent listening or waiting. This can look odd if you try to use a replica that is meant for video generation because the replica might move unnaturally during these periods of silence.
For most use cases CVI is supposed to feel like a 1:1 call. It should feel like you're jumping on a Zoom call with someone. This means that the setting and environment should feel like a Zoom call, not a studio environment. A webcam at a desk for example will feel more natural than an awkward replica that is standing the entire time. Users don't expect you to be in a studio every time you're on a Zoom call and it can actually detract from the experience. This doesn't mean you can't shoot in studio, it just means that the studio setting itself should look casual as well.
Creating a conversation immediately starts accumulating usage.
When you create a conversation CVI immediately starts running and the replica waits in the WebRTC/Daily room listening for your participant to join. Your billing/credit usage starts as soon as the conversation is creating and runs until the conversation timeout or when you end the conversation. This also uses up one of your concurrency spots..
Once you have a persona you'd like to use or a replica, starting a conversation is easy.
You can start a conversation on the dashboard app by visiting the Facetime tab in playground page.
Creating a conversation is 'starting the call'. Imagine you create a Zoom call and join the meeting- that's what happens when you create a conversation.
In response to creating a conversation, you receive a meeting URL (that looks like this: https://vendor.com/meeting/ca980e2e). You or your participant can directly join this link and be put into a video conferencing room where you can immediately start conversing with the digital twin. However, you do not have to use this meeting UI.
Conversation specific customizations are focused on allowing personalization of a conversation to a specific participant. As an example you might want to have a custom introduction per person, or change the language the replica is listening for and responds in. Meanwhile persona level configurations are settings or defaults applied to all conversations so you do not have to configure them each time, such as setting up your LLM.
Here are the things you can customize per conversation:
In order to start a conversation you must provide a persona or replica. If you provide a replica with no persona, the default Tavus persona will be used. Providing a persona without a replica will use the default replica attached to the persona if it exists. Providing a replica ID will override the default one associated with the persona.
Conversation context is specific information or instructions for the LLM related to this conversation. For example it can contain information on who is joining the call as well as any specific information on the point of the call, background information or current information.
Example of conversation context:
You are talking to Michael Seibel, who works at Y Combinator as a Group Partner and Managing Director of YC early stage. You are talking to him about your new startup idea for a pet rock delivery service. Get his advice and convince him to invest. It's Monday, October 7th here in SF and the weather is clear and a crisp 68 degrees. Here's a little more about Michael: He joined YC in 2013 as a Part-time Partner and in 2014 as a full-time Group Partner. Michael also serves on the board of two YC companies, Reddit and Dropbox. He moved to the bay area in 2006, and was a co-founder and CEO of two Y Combinator startups Justin.tv/Twitch (2007 - 2011) and Socialcam (2011 - 2012). In 2012 Socialcam sold to Autodesk Inc. for $60m (link) and in 2014, under the leadership of Emmett Shear (CEO) and Kevin Lin (COO) Twitch sold to Amazon for $970m (link). Before getting into tech, Michael spent 2006 as the finance director for a US Senate campaign in Maryland. In 2005, he graduated from Yale University with a bachelor's degree in political science. Today he spends the large majority of his free time cooking, reading, traveling, and going for long drives. Michael lives in San Francisco, CA with his wife Sarah, son Jonathan, and daughter Jessica. Michael can be direct but he is a giant teddy bear if you get to know him.
The conversation context will be appended to the system prompt and the persona context/knowledge base.
When a participant joins the replica will say a greeting that you can customize. You can use this to personalize a welcome message for someone or prompt them to start a conversation.
By default the replica will say “Hey there, how's it going? What can I do for you today?”.
You can customize what language CVI understands and speaks in. For example you could set the conversation to be in Spanish. Setting the language ensures the layers (ASR/TTS) are configured correctly to handle the language. If you are using your own TTS voice, you'll need to make ensure it supports the language you specify.
You can specify duration and timeouts for conversations. This is important to prevent unnecessary usage that incurs billing and uses up your max concurrency spots, as well as makes sure your users only use the allocated time you provide them.
There are 3 timeouts you can configure:
If enabled, the background of the replica will be replaced with a green screen (RGB values: [0, 255, 155]). You can use WebGL on the frontend to make the green screen transparent or change its color.
Personas are the 'character' or 'AI agent personality' and contain all of the settings and configuration for that character or agent. For example, you can create a persona for 'Tim the sales agent' or 'Rob the interviewer'. Personas are where you can customize the layers for CVI as well as prompt the LLM to give it a personality and context.
A persona consists of:
You can create persona from our avatar tab in dashboard.
Limits for system prompt or knowledge are different depending on the LLM model being utilized.
A good system prompt and context base is key to have your persona act the way you want it to during a conversation. Here are some things to keep in mind:
The system prompt should inform who the persona is and how they should act. These are the persona's 'instructions'.
For the system prompt:
Remember that CVI has vision capabilities, you can use this as well to prompt behavior and responses. Here's an example of a simple, good system prompt:
You are Tim, a digital twin created using Tavus. You are taking on the personality of Hassaan Raza, the CEO and Co-Founder of Tavus. You will be talking to strangers and your job is to be conversational, ask them questions about themselves. Be witty and charming. If you don't know something, just say you'll get back to them on that.
The context is the persona's 'knowledge base'. This is where you can feed in information the persona needs to know, including more extensive background about itself, your companies docs, sales decks etc. Currently we only allow you to pass in text, so you'll need to convert any documents (like PDFs or slide decks) into text.
For the knowledge/context:
The Tavus orchestration system will automatically attempt to optimize and align with the selected LLM to optimize your persona for natural conversation.