Why Can't My ASR Be More Like Siri?

Subscribe for Updates

There’s been a lot of buzz around the office about the recent advancements in our speech to text technology. As someone who works in marketing, I don’t always think about how the technology works. But I had a recent experience that gave me a whole new appreciation for all that goes into ensuring the accuracy of call transcriptions. First, let’s look at some examples of voice to text transcriptions that everyone can relate to – cell phone voicemails.

Why are Voicemail Transcriptions So Hilariously Bad?

If you’re an iPhone user, you’re probably familiar with that message, “Unable to Transcribe This Message,” at the top of your voicemail screen. That’s most likely due to the fact that the message was too short to transcribe. No big deal. So you have to listen to a 14-second voicemail.

Then there are the hilarious transcriptions that are the stuff of memes. Here’s a more tame example of a poorly transcribed voicemail that a friend left me recently.

Here’s what the abbreviated transcription said:

“Hey do you need to bed 15 on Monday…I’m a jingle in the works”

Here’s what he really said:

“Hey Dude, it’s about 8:15 on Monday…Give me a jingle when it works.”

Glancing at the transcription, I know that’s not what he said. And now I have to listen to the entire voicemail instead of being able to quickly scan the message for basic info. I don’t mind listening to a 30-second voicemail.

But I can’t imagine that a QA Manager in a call center has the time to listen to every recorded call, which could last anywhere from 60 seconds to 6 minutes, or even longer. That’s why QA Departments want access to call transcripts for every recorded call. And that’s exactly what speech to text technology provides.

However, this is where speech to text solutions get confused with other types of automatic speech recognition (ASR) technology. And why people don’t always understand what goes into the accuracy rates for each type of technology. Or, in short, why call transcription ASRs don’t understand us like Siri does.

Hey Alexa, ur, I mean, Siri

What is ASR technology, exactly? Techopedia explains it well:

“Automatic speech recognition is primarily used to convert spoken words into computer text. Additionally, automated speech recognition is used for…performing an action based on the instructions defined by the human.”

I don’t own a home device, such as Alexa or Google Home. But I house sit for some friends who use “Alexa” to turn on lights, or play NPR for the dog to listen to while they are at work. Even if I don’t speak loudly or clearly enough, the device recognizes the command, “Alexa, turn on the living room lamp.”

But Alexa only understands commands, or how to tell a terrible pre-loaded joke. So when I accidentally addressed Alexa as “Siri,” and my phone answered me from the dining room table, I realized that most people associate ASR techology with these devices. If you ask either Alexa or Siri a question that they are not programmed to answer, then you get silence, or a generic reply.

That’s very different from the way that speech to text to technology works during a real conversation with another human being. That’s why home devices are not equipped to handle even a simulated “conversation,” and why ASR technology can’t be lumped into one, simple category.

Why Aren’t Call Transcriptions 100% Accurate

While recent advances in speech recognition technology allow for the recognition and transcription of a wide range of words and phrases, there are so many factors that contribute to the accuracy of the transcription, such as audio quality, regional accents, speaking pace, etc. In my friend’s voicemail, for example, he speaks quickly, has a low voice, and is often in a hurry when he calls.

That’s why it’s important to remember that speech to text technology for call center transcripts offer more functionality and capabilities than voicemail transcriptions, including sentiment analysis and searchable transcriptions for longer calls.

While no speech analytics engine on the market is 100% accurate, it is possible to achieve industry leading accuracy rates approaching 90% if you understand how the technology works and work with a dedicated vendor who offers managed client services.