So how does Orai Work?

Published in

Everything Gonna Be Orai — The Official Orai Blog

5 min readApr 15, 2019

Human speech is incredibly complex to analyze. There are so many nuances that it is quite hard even for humans to keep track of all of them. This makes the task of machine-based feedback incredibly tricky. For Orai, our journey started by talking with speech coaches and linguistics experts on how they give feedback. We supplemented that with a healthy diet of research papers and data experiments. Based on a few months of this process we outlined five core factors that we are confident will help people improve when provided with a good feedback → practice loop. Fast forward two years, these core factors have helped thousands of people across the globe improve their communication skills by 2–3x.

Let’s dig a bit deeper into what these factors are and our process in building them.

Pacing

Our first approach with pacing was straightforward. Our initial research suggested pacing on average should be between 120–145 words per minute. However, we tested this hypothesis on thousands of TED Talks and found that this approach was quite narrow. More research led us to the following questions:

How does the content impact the ideal pace of your speech?
How should you vary your pace to not sound monotonous?

After lots of brainstorming and data crunching, we devised a system which evaluates pacing on three factors:

Pace Variation
Speech Content Complexity
Average Pace (based on speech type)

Using these factors to break down good vs bad pace gave us a more representative result and 95% of speech clips we tagged to have good or bad pacing were classified correctly.

2) Fillers

In the past year, we asked 2000 users to compare speeches on perceived speaker confidence. 93% of users rated speeches with a higher number of fillers to sound less confident. Intuitively this also makes sense. Humans use fillers instead of pauses when they need time to think. However further research made us realize there are two types of fillers:

Filled Pauses: Words such as “um” or “uh” used instead of pauses.
Filler Words: Words such as “basically”, “actually”, “you know” or the pesky “like”. These are also known as clutch words. They are often overused to transition between points or when the speaker is not quite sure if they are communicating effectively.

We collected and tagged datasets of thousands of filler words and filled pauses from real-world speech. Using this we built a system which detects filler words in realtime and has state of the art accuracy of 92%! Now we can even figure out if the word “like” is used appropriately or as a filler word.

3) Facial Expressions:

Along with verbal communication, we realized an essential part of communication is also your non-verbals. Your hand gestures and body language play a very important role in how your message is perceived. However, we also saw an increasing shift of day to day communication in teams from in-person meetings to video conferencing systems. This led us to focus our efforts on facial expression feedback. There has been extensive research on the impact of positive facial expressions on the receptivity of your message, and quite often in web-calls, we end up not smiling enough.

We used advanced machine learning to train models to detect facial expressiveness with a state of the art 88% accuracy. We map various emotions to figure out if your emotions were positive, neutral or negative.

4) Energy

We’ve all been to conferences where an incredibly dull speaker takes the podium and makes every minute seem like an hour. The speaker may have great content, but the inability to express it kills the speaker’s (and their organization’s) credibility instantly. So it was critical for us to give feedback on a speaker’s energy variation. Technically it was a fascinating problem to solve. How do we figure out if a person sounds monotone? And how do we provide feedback to help a person improve?

Our first approach led us to create a “pitch + volume” based system to detect the perceived energy of a system. This allowed us to measure energy variation on a speech by speech basis and give both positive and negative feedback on their energy variation.

5) Feedback

With all of this data, we needed to figure out a way to give a user digestible bite-sized feedback. And in order for feedback to be effective, you have to give both positive and negative feedback. We thus built a unique natural language generation engine which crunches all of Orai’s AI-driven speech analysis and provides human-like feedback every time you practice on Orai. We also added support for 10 different English accents so that people from diverse parts of the world are able to get the most out of Orai. And along with that, we were able to package everything in such a way that >95% of recordings done on Orai returned results within 10 seconds of a user tapping the stop button.

P.S: Over time we’ve also added feedback on spoken clarity and use of pauses. Both of these came as suggestions from users and are proving to be strong feedback metrics. We’re also working on a lot of exciting new feedback areas which we believe will us help speakers around the world become even more confident in their communication.

So how does Orai Work?

Written by Paritosh Gupta