We explored many options before settling on Watson for this initial release of Lexicon.

Initially we were focused solely on HoloLens, so we explored Microsoft’s built-in dictation and keyword recognizers. You can’t run those simultaneously, so it’s difficult to give the user instant feedback while matching keywords. We explored SRGS grammars, but those felt outdated compared to new machine learning models.

We were also considering our most interesting challenge, which was matching speech with gaze. Gaze moves very quickly, so any delay in speech results means the user may have already looked away by the time we understood what was said. We needed a way to align these two sources of input.

So the search moved to speech to text services that provided per-word timestamps. When we started this project in the summer of 2017 IBM Watson was the only service we could find that offered this. Shortly after that, Google released support for per-word timestamps in their cloud service. While we did take a look at this, it quickly became clear that Google’s grpc framework was not Unity-friendly.

Meanwhile, IBM Watson had an open-source Unity SDK, and a VR Speech Sandbox sample project that showed us we were thinking along similar lines. The Unity extension didn’t support HoloLens, but we added the support as a pull request and that was quickly brought into the official SDK. IBM has done a great job of embracing these technologies and the developer community, and they have been really fun to work with.

Watson speech to text also gave us something we didn’t realize we wanted until we tried it, and that’s custom language models. You can train the service on a collection of words and phrases that users might say, and this improves transcription for those phrases. Once you’ve used it, it’s hard to imagine how you ever got along without it. And while other services such as Microsoft Azure have a similar offering, Watson’s pricing (1000 minutes free per month) allows a developer to use this feature at almost zero cost for personal use.

With the issue of speech input solved, it was easy to pick Watson Conversation for the NLP side of this. We wanted to be able to design the entire lexicon inside Unity, and Watson’s APIs let us do this. Now, with Lexicon, we can build our intents and entities in the Unity editor and train the Conversation service, while simultaneously training the custom language model for speech to text.

Watson Conversation works with surprisingly little training data. In most of our experiments we’ve used 3-5 sample phrases for intents, and this is generally a great start for a wide range of user input.

So now Lexicon is fully utilizing the cloud, making this a truly cross-platform input solution. Almost any device with a microphone and an internet connection should be supported. We’re definitely thinking ahead to new devices like Magic Leap One and a future Apple headset. It’s exciting that we could be designing interfaces right now that will work on this future class of devices.

We’ll be keeping a close eye on competing services in the upcoming years. It’s possible that Lexicon will eventually support multiple cloud services that you can train simultaneously, allowing you to test and mix and match to meet your application’s needs. We also plan to open up the API so you can use local dictation solutions where appropriate. But for now, we think Watson is a fantastic starting point, and we think you should give it a try!