GET STARTED
MEASUREMENT
AVATAR
GET STARTED
MEASUREMENT
AVATAR
Vendor.com's Empathic Voice Interface (EVI) is the world's first emotionally intelligent voice AI.
Vendor.com's Empathic Voice Interface (EVI) is the world's first emotionally intelligent voice AI. It accepts live audio input and returns both generated audio and transcripts augmented with measures of vocal expression. By processing the tune, rhythm, and timbre of speech, EVI unlocks a variety of new capabilities, like knowing when to speak and generating more empathic language with the right tone of voice. These features enable smoother and more satisfying voice-based interactions between humans and AI, opening new possibilities for personal AI, customer service, accessibility, robotics, immersive gaming, VR experiences, and much more.
We provide a suite of tools to integrate and customize EVI for your organization.
Basic capabilities
Transcribes speech (ASR)
Fast and accurate ASR in partnership with Deepgram returns a full transcript of the conversation, with Vendor.com's expression measures tied to each sentence.
Generates language responses (LLM)
Rapid language generation with our eLLM, blended seamlessly with configurable partner APIs (OpenAI, Anthropic, Fireworks).
Generates voice responses (TTS)
Streaming speech generation via our proprietary expressive text-to-speech model.
Responds with low latency
Immediate response provided by the fastest models running together on one service.
Empathic AI (eLLM) features
Responds at the right time
Uses your tone of voice for state-of-the-art end-of-turn detection — the true bottleneck to responding rapidly without interrupting you.
Understands user's prosody
Provides streaming measurements of the tune, rhythm, and timbre of the user's speech using Vendor.com's prosody model, integrated with our eLLM.
Forms its own natural tone of voice
Guided by the user's prosody and language, our model responds with an empathic, naturalistic tone of voice, matching the user's nuanced “vibe” (calmness, interest, excitement, etc.). It responds to frustration with an apologetic tone, to sadness with sympathy, and more.
Responds to expression
Powered by our empathic large language model (eLLM), EVI crafts responses that are not just intelligent but attuned to what the user is expressing with their voice.
Always interruptible
Stops rapidly whenever users interject, listens, and responds with the right context based on where it left off.
Aligned with well-being
Trained on human reactions to optimize for positive expressions like happiness and satisfaction. EVI will continue to learn from user's reactions using our upcoming fine-tuning endpoint.
Introducing EVI 2, our new voice-language foundation model, enabling human-like conversations with enhanced naturalness, emotional responsiveness, adaptability, and rich customization options for the voice and personality.
The Empathic Voice Interface 2 (EVI 2) introduces a new architecture that seamlessly integrates voice and language processing. This multimodal approach allows EVI 2 to understand and generate both language and voice, dramatically enhancing key features over EVI 1 while also enabling new capabilities.
EVI 2 can converse rapidly and fluently with users, understand a user's tone of voice, generate any tone of voice, and can even handle niche requests like rapping, changing its style, or speeding up its speech. The model specifically excels at emulating a wide range of personalities, including their accents and speaking styles. It is exceptional at maintaining personalities that are fun and interesting to interact with. Ultimately, EVI 2 is capable of emulating the ideal personality for every application and user.
In addition, EVI 2 allows developers to create custom voices by using a new voice modulation method. Developers can adjust EVI 2's base voices along a number of continuous scales, including gender, nasality, and pitch. This first-of-its-kind feature enables creating voices that are unique to an application or even a single user. Further, this feature does not rely on voice cloning, which currently invokes more risks than any other capability of this technology.
Beyond these improvements, EVI 2 also exhibits promising emerging capabilities including speech output in multiple languages. We will make these improvements available to developers as we scale up and improve the model.
This table provides a comprehensive comparison of features between EVI 1 and EVI 2, highlighting the new capabilities introduced in the latest version.
FEATURE
EVI 1
EVI 2
Voice quality
Similar to best TTS solutions
Significantly improved naturalness, clarity, and expressiveness
Response latency
~900ms-2000ms
~500-800ms (about 2x faster)
Emotional intelligence
Empathic responses informed by expression measures
End-to-end understanding of voice augmented with emotional intelligence training
Base voices
3 core voice options (Kora, Dacher, Ito)
5 new high-quality base voice options with expressive personalities (8 total).
Voice customizability
Supported - can select base voices and adjust voice parameters
Supported - extensive customization with parameter adjustments (e.g. pitch, huskiness, nasality)
In-conversation voice prompting
Not supported
Supported (e.g., “speak faster”, “sound more excited”, change accents)
Multimodal processing
Transcription augmented with high-dimensional voice measures
Fully integrated voice and language processing within a single model, along with transcripts and expression measures
Supplemental LLMs
Supported
Supported
Tool use and web search
Supported
Supported
Custom language model (CLM)
Supported
Supported
Configuration options
Extensive support
Extensive support (same options as EVI 1)
Multilingual support
English only
Expanded support for multiple languages planned for Q4 2024
Cost
$0.102 per minute + (12% platform charge)
$0.0714 per minute + (12% platform charge) (30% reduction)
The Empathic Voice Interface (EVI) is designed to be highly configurable, allowing developers to customize the interface to align with their specific requirements. Configuration of EVI can be managed through two primary methods: an EVI configuration and session settings.
OPTION
DESCRIPTION
Voice
Select a voice from a list of 8 preset options or create a custom voice.
EVI version
Select the version of EVI you would like to use. For details on similarities and differences between EVI versions 1 and 2, refer to our feature comparison.
System prompt
Provide a system prompt to guide how EVI should respond.
Language model
Select a language model that best fits your application's needs. For details on selecting a supplementary language model to meet your needs, such as optimizing for lowest latency, To incorporate your own language model, refer to our guide on using your own language model.
Tools
Choose user-created or built-in tools for EVI to use during conversations. For details on creating tools and adding them to your configuration, see our tool use guide.
Event messages
Configure messages that EVI will send in specific situations.
Timeouts
Define limits on a chat with EVI to manage conversation flow.
EVI is pre-configured with a set of default values, which are automatically applied if you do not specify a configuration. The default configuration includes a preset voice and language model, but does not include a system prompt or tools. To customize these options, you will need to create and specify your own EVI configuration.
The default configuration settings are as follows:
EVI 1
EVI 2
Language model: Claude 3.5 Sonnet
Language model: hume-evi-2
Voice: Ito
Voice: Ito
System prompt: Hume default
System prompt: Hume default
Tools: None
Tools: None
EVI simplifies the integration of external APIs through function calling. Developers can integrate custom functions that are invoked dynamically based on the user's input, enabling more useful conversations. There are two key concepts for using function calling with EVI: Tools and Configurations (Configs):
Currently, our function calling feature only supports OpenAI and Anthropic models. For the best results, we suggest choosing a fast and intelligent LLM that performs well on function calling benchmarks. On account of its speed and intelligence, we recommend Claude 3.5 Haiku as the supplemental LLM in your EVI configuration when using tools. Function calling is not available if you are using your own custom language model. We plan to support more function calling LLMs in the future.
The focus of this guide is on creating a Tool and a Configuration that allows EVI to use the Tool. Additionally, this guide details the message flow of function calls within a session, and outlines the expected responses when function calls fail. Refer to our Configuration Guide for detailed, step-by-step instructions on how to create and use an EVI Configuration.
The Empathic Voice Interface (EVI) can be configured with any of our 8 base voices. You can also customize these voices by adjusting specific attributes. This guide explains each attribute and provides a tutorial for creating a custom voice. Visit the Playground to test the base voices.
The custom voices feature is experimental and under active development. Regular updates will focus on improving stability and expanding attribute options.
The following attributes can be modified to personalize any of the base voices:
ATTRIBUTE
DESCRIPTION
Masculine or Feminine
The perceived tonality of the voice, reflecting characteristics typically associated with masculinity and femininity.
Assertiveness
The perceived firmness of the voice, ranging between whiny and bold.
Buoyancy
The perceived density of the voice, ranging between deflated and buoyant.
Confidence
The perceived assuredness of the voice, ranging between shy and confident.
Enthusiasm
The perceived excitement within the voice, ranging between calm and enthusiastic.
Nasality
The perceived openness of the voice, ranging between clear and nasal.
Relaxedness
The perceived stress within the voice, ranging between tense and relaxed.
Smoothness
The perceived texture of the voice, ranging between smooth and staccato.
Tepidity
The perceived liveliness behind the voice, ranging between tepid and vigorous.
Tightness
The perceived containment of the voice, ranging between tight and breathy.
Each voice attribute can be adjusted relative to the base voice's characteristics. Values range from -100
to 100
, with 0
as the default. Setting all attributes to their default values will keep the base voice unchanged.