Empathic Voice Interface (EVI)

Vendor.com's Empathic Voice Interface (EVI) is the world's first emotionally intelligent voice AI.

Vendor.com's Empathic Voice Interface (EVI) is the world's first emotionally intelligent voice AI. It accepts live audio input and returns both generated audio and transcripts augmented with measures of vocal expression. By processing the tune, rhythm, and timbre of speech, EVI unlocks a variety of new capabilities, like knowing when to speak and generating more empathic language with the right tone of voice. These features enable smoother and more satisfying voice-based interactions between humans and AI, opening new possibilities for personal AI, customer service, accessibility, robotics, immersive gaming, VR experiences, and much more.

We provide a suite of tools to integrate and customize EVI for your organization.

evi_diagram

Overview of EVI features

Basic capabilities

Transcribes speech (ASR)

Fast and accurate ASR in partnership with Deepgram returns a full transcript of the conversation, with Vendor.com's expression measures tied to each sentence.

 

Generates language responses (LLM)

Rapid language generation with our eLLM, blended seamlessly with configurable partner APIs (OpenAI, Anthropic, Fireworks).

 

Generates voice responses (TTS)

Streaming speech generation via our proprietary expressive text-to-speech model.

 

Responds with low latency

Immediate response provided by the fastest models running together on one service.

Empathic AI (eLLM) features

Responds at the right time

Uses your tone of voice for state-of-the-art end-of-turn detection — the true bottleneck to responding rapidly without interrupting you.

 

Understands user's prosody

Provides streaming measurements of the tune, rhythm, and timbre of the user's speech using Vendor.com's prosody model, integrated with our eLLM.

 

Forms its own natural tone of voice

Guided by the user's prosody and language, our model responds with an empathic, naturalistic tone of voice, matching the user's nuanced “vibe” (calmness, interest, excitement, etc.). It responds to frustration with an apologetic tone, to sadness with sympathy, and more.

 

Responds to expression

Powered by our empathic large language model (eLLM), EVI crafts responses that are not just intelligent but attuned to what the user is expressing with their voice.

 

Always interruptible

Stops rapidly whenever users interject, listens, and responds with the right context based on where it left off.

 

Aligned with well-being

Trained on human reactions to optimize for positive expressions like happiness and satisfaction. EVI will continue to learn from user's reactions using our upcoming fine-tuning endpoint.

Empathic Voice Interface 2 (EVI 2)

Introducing EVI 2, our new voice-language foundation model, enabling human-like conversations with enhanced naturalness, emotional responsiveness, adaptability, and rich customization options for the voice and personality.

The Empathic Voice Interface 2 (EVI 2) introduces a new architecture that seamlessly integrates voice and language processing. This multimodal approach allows EVI 2 to understand and generate both language and voice, dramatically enhancing key features over EVI 1 while also enabling new capabilities.

EVI 2 can converse rapidly and fluently with users, understand a user's tone of voice, generate any tone of voice, and can even handle niche requests like rapping, changing its style, or speeding up its speech. The model specifically excels at emulating a wide range of personalities, including their accents and speaking styles. It is exceptional at maintaining personalities that are fun and interesting to interact with. Ultimately, EVI 2 is capable of emulating the ideal personality for every application and user.

In addition, EVI 2 allows developers to create custom voices by using a new voice modulation method. Developers can adjust EVI 2's base voices along a number of continuous scales, including gender, nasality, and pitch. This first-of-its-kind feature enables creating voices that are unique to an application or even a single user. Further, this feature does not rely on voice cloning, which currently invokes more risks than any other capability of this technology.

Key improvements

  1. Improved voice quality: EVI 2 uses an advanced voice generation model connected to our eLLM, which can process and generate both text and audio. This results in more natural-sounding speech with better word emphasis, higher expressiveness, and more consistent vocal output.
  2. Faster responses: The integrated architecture of EVI 2 reduces end-to-end latency by 40% vs EVI 1, now averaging around 500ms. This significant speed improvement enables more responsive and human-like conversations.
  3. Enhanced emotional intelligence: By processing voice and language in the same model, EVI 2 can better understand the emotional context of user inputs and generate more empathic responses, both in terms of content and vocal tone.
  4. Custom voices and personality: EVI 2 offers new control over the AI's voice characteristics. Developers can adjust various parameters to tailor EVI 2's voice to their specific application needs. EVI 2 also supports in-conversation voice prompting, allowing users to dynamically modify EVI's speaking style (e.g., “speak faster”, “sound excited”) during interactions.
  5. Cost-effectiveness: Despite its advanced capabilities, EVI 2 is 30% more cost-effective than its predecessor, with pricing reduced from $0.1020 to $0.0714 per minute.

Beyond these improvements, EVI 2 also exhibits promising emerging capabilities including speech output in multiple languages. We will make these improvements available to developers as we scale up and improve the model.

Feature comparison: EVI 1 vs EVI 2

This table provides a comprehensive comparison of features between EVI 1 and EVI 2, highlighting the new capabilities introduced in the latest version.

FEATURE

EVI 1

EVI 2

Voice quality

Similar to best TTS solutions

Significantly improved naturalness, clarity, and expressiveness

Response latency

~900ms-2000ms

~500-800ms (about 2x faster)

Emotional intelligence

Empathic responses informed by expression measures

End-to-end understanding of voice augmented with emotional intelligence training

Base voices

3 core voice options (Kora, Dacher, Ito)

5 new high-quality base voice options with expressive personalities (8 total).

Voice customizability

Supported - can select base voices and adjust voice parameters

Supported - extensive customization with parameter adjustments (e.g. pitch, huskiness, nasality)

In-conversation voice prompting

Not supported

Supported (e.g., “speak faster”, “sound more excited”, change accents)

Multimodal processing

Transcription augmented with high-dimensional voice measures

Fully integrated voice and language processing within a single model, along with transcripts and expression measures

Supplemental LLMs

Supported

Supported

Tool use and web search

Supported

Supported

Custom language model (CLM)

Supported

Supported

Configuration options

Extensive support

Extensive support (same options as EVI 1)

Multilingual support

English only

Expanded support for multiple languages planned for Q4 2024

Cost

$0.102 per minute + (12% platform charge)

$0.0714 per minute + (12% platform charge) (30% reduction)

Configuring EVI

The Empathic Voice Interface (EVI) is designed to be highly configurable, allowing developers to customize the interface to align with their specific requirements. Configuration of EVI can be managed through two primary methods: an EVI configuration and session settings.

Configuration options

OPTION

DESCRIPTION

Voice

Select a voice from a list of 8 preset options or create a custom voice.

EVI version

Select the version of EVI you would like to use. For details on similarities and differences between EVI versions 1 and 2, refer to our feature comparison.

System prompt

Provide a system prompt to guide how EVI should respond.

Language model

Select a language model that best fits your application's needs. For details on selecting a supplementary language model to meet your needs, such as optimizing for lowest latency, To incorporate your own language model, refer to our guide on using your own language model.

Tools

Choose user-created or built-in tools for EVI to use during conversations. For details on creating tools and adding them to your configuration, see our tool use guide.

Event messages

Configure messages that EVI will send in specific situations.

Timeouts

Define limits on a chat with EVI to manage conversation flow.

Default configuration options

EVI is pre-configured with a set of default values, which are automatically applied if you do not specify a configuration. The default configuration includes a preset voice and language model, but does not include a system prompt or tools. To customize these options, you will need to create and specify your own EVI configuration.

The default configuration settings are as follows:

EVI 1

EVI 2

Language model: Claude 3.5 Sonnet

Language model: hume-evi-2

Voice: Ito

Voice: Ito

System prompt: Hume default

System prompt: Hume default

Tools: None

Tools: None

Tool use

EVI simplifies the integration of external APIs through function calling. Developers can integrate custom functions that are invoked dynamically based on the user's input, enabling more useful conversations. There are two key concepts for using function calling with EVI: Tools and Configurations (Configs):

  • Tools are resources that EVI uses to do things, like search the web or call external APIs. For example, tools can check the weather, update databases, schedule appointments, or take actions based on what occurs in the conversation. While the tools can be user-defined, Hume also offers natively implemented tools, like web search, which are labeled as “built-in” tools.
  • Configurations enable developers to customize an EVI's behavior and incorporate these custom tools. Setting up an EVI configuration allows developers to seamlessly integrate their tools into the voice interface. A configuration includes prompts, user-defined tools, and other settings.
tool_call

Currently, our function calling feature only supports OpenAI and Anthropic models. For the best results, we suggest choosing a fast and intelligent LLM that performs well on function calling benchmarks. On account of its speed and intelligence, we recommend Claude 3.5 Haiku as the supplemental LLM in your EVI configuration when using tools. Function calling is not available if you are using your own custom language model. We plan to support more function calling LLMs in the future.

The focus of this guide is on creating a Tool and a Configuration that allows EVI to use the Tool. Additionally, this guide details the message flow of function calls within a session, and outlines the expected responses when function calls fail. Refer to our Configuration Guide for detailed, step-by-step instructions on how to create and use an EVI Configuration.

Voices

The Empathic Voice Interface (EVI) can be configured with any of our 8 base voices. You can also customize these voices by adjusting specific attributes. This guide explains each attribute and provides a tutorial for creating a custom voice. Visit the Playground to test the base voices.

The custom voices feature is experimental and under active development. Regular updates will focus on improving stability and expanding attribute options.

Voice attributes

The following attributes can be modified to personalize any of the base voices:

ATTRIBUTE

DESCRIPTION

Masculine or Feminine

The perceived tonality of the voice, reflecting characteristics typically associated with masculinity and femininity.

Assertiveness

The perceived firmness of the voice, ranging between whiny and bold.

Buoyancy

The perceived density of the voice, ranging between deflated and buoyant.

Confidence

The perceived assuredness of the voice, ranging between shy and confident.

Enthusiasm

The perceived excitement within the voice, ranging between calm and enthusiastic.

Nasality

The perceived openness of the voice, ranging between clear and nasal.

Relaxedness

The perceived stress within the voice, ranging between tense and relaxed.

Smoothness

The perceived texture of the voice, ranging between smooth and staccato.

Tepidity

The perceived liveliness behind the voice, ranging between tepid and vigorous.

Tightness

The perceived containment of the voice, ranging between tight and breathy.

Each voice attribute can be adjusted relative to the base voice's characteristics. Values range from -100 to 100, with 0 as the default. Setting all attributes to their default values will keep the base voice unchanged.