Adding voice control in Russian to the VR project / Habr

Adding voice control in Russian to the VR project

The topic of virtual, augmented reality, and metaverses is gaining momentum. But what it is, what it should look like, how to use it, no one really knows yet. However, just like the transition from desktop to mobile applications, migration to VR will also bring new patterns of user interaction. Yes, it is already possible to “touch” objects in virtual reality, but this is not enough to fully solve user tasks. It seems that voice control in VR will become even more relevant than in mobile devices — in the form of, for example, voice commands or text data input.

Below I will describe step by step how you can add voice control in Russian to a VR project.

But first, let’s talk in more detail about how speech recognition technologies are useful in VR:

  • Six years ago, researchers came to the conclusion that voice input is three times faster and more accurate than manual input using a smartphone keyboard. This is especially true for VR devices.
  • In addition, for voice input, the smartphone does not need to be held in your hands and brought to your head. VR devices are already on the head, and voice input looks more natural.
  • In VR projects, we are more immersed in the virtual world. Interaction with this world should be similar to our interaction with the real one. Voice control contributes to this more than selecting menu items using game controllers.
  • Unified patterns of interaction with the user have not yet been formed. For people who are just starting to use VR devices, voice interfaces can become the most intuitive.

Let’s go directly to practice.

Selection of tools

Most programmers use Unity for their VR projects. There is also an opinion that 90% of AR/VR development companies use C#. But this is some kind of strange statement, on what it is based, I have not found. Perhaps it was written in some ancient press release from Unity. However, a recent article on Forbes states that 72% of the top 1000 mobile apps are made on Unity, which seems to be true. Well, in addition, AR/VR/XR, Metaverse, Web3, and everything that is fashionable now are also mentioned there.

We have decided on the development environment, now we need to choose a Russian speech recognition service for our voice control. What do we need from the speech recognition service? First of all, recognition should be streaming. In order not to wait for the end of the phrase and not to solve the problem of determining the end of the phrase themselves. And then SmartSpeech comes to our aid. The advantage is that it also has partial recognition for an unfinished phrase.

What is SmartSpeech? This is a speech services platform developed by the team of SberDevices. SmartSpeech speech recognition (ASR) and speech synthesis (TTS) technologies are used, for example, in our family of virtual assistants Salyut, as well as in third-party projects. You can test speech recognition from SmartSpeech using a telegram bot, which we have already written about on Habra.

And finally, we will test our voice control using Oculus Quest 2, the best-selling VR headset at the moment.

Connecting the recognition service

First of all, you need to connect the SmartSpeech service. And after receiving Client Id and Client Secret, you can test the operation of the service.

Developer’s office in SmartMarket Studio

To send requests to the speech recognition service, we first get a token for authentication.

For this:

  • We will prepare the authorization data — we will encode a string of the form in Base 64: <Client ID>:<Client Secret>.
  • Generate an RqUID, for example:
uuidgen | tr 'A-Z' 'a-z'
curl --location --request POST "" 
    --header "Authorization: Basic <Your Base64 encoded credentials>" 
    --header "RqUID: <Your RqUID>" 
    --header "Content-Type: application/x-www-form-urlencoded" 
    --data-urlencode "scope=SBER_SPEECH"

A json of the form should come: {"access_token":"","expires_at":}

Everything is ready to work. To test speech recognition, you can send an audio recording file to the service. To do this, we will need a file prepared in advance and a newly received token for authentication.

curl -X POST 
    -H "Authorization: Bearer <token>" 
    -H "Content-Type: audio/x-pcm;bit=16;rate=16000" 
    --data-binary @./audio.pcm

Recognition in Unity

But we are interested in streaming recognition in Unity. Work with the service is carried out using the gRPC protocol. gRPC is a cross—platform and cross-language remote Procedure Call (RPC) system. It works on top of HTTP/2, Protobuf is used as the interface description language.

The .proto files themselves, as well as examples for working with SmartSpeech in other programming languages, can be found in the documentation.

First you need to generate client code in C# to work on gRPC. For example, about using the protocol utility:

mkdir -p grpc/client
protoc -I ./smartspeech-master/recognition/v1/ 
			 -I ./smartspeech-master/task/v1/ 
       recognition.proto {storage,task}.proto

Next, for Unity, you need to download the necessary libraries, for example, from the project . However, Unity support there is designated as “experimental”.

On the downloads page, we find the package we need for Unity. In our case it is grpc_unity_package.2.47.0-dev202204190851. The library files, as well as the generated client code for gRPC, are transferred to our Unity project:

And finally, you need to write client code that will transmit data to SmartSpeech via gRPC. The entire sample code can be downloaded from the link. Here we will note the main points.

Connecting the necessary libraries:

using Grpc.Core;
using Smartspeech.Recognition.V1;

Objects that are used for gRPC interaction:

Channel _channel = 
  new Channel(address, 
              ChannelCredentials.Create(new SslCredentials(), credentials));

SmartSpeech.SmartSpeechClient _client = 
  new SmartSpeech.SmartSpeechClient(_channel);

RecognitionOptions _options = new RecognitionOptions();
_options.SampleRate = _sampleRate;
_options.AudioEncoding = RecognitionOptions.Types.AudioEncoding.PcmS16Le;

We simply get the token for authentication via http. We take the data for recognition from the microphone and transmit it via gRPC. Moreover, before you start sending data, you need to send the recognition options with the first request:

var call = _client.Recognize();

var streamRequest = new RecognitionRequest();
streamRequest.Options = _options;
await call.RequestStream.WriteAsync(streamRequest);

And then, in the loop, the data from the microphone:

audioSource.clip.GetData(samples, curDataIndex);

curDataIndex += bufferSize;
if (curDataIndex >= audioSource.clip.samples)
    curDataIndex = 0;

buffer = GetSamplesWaveData(samples, bufferSize);
streamRequest.AudioChunk = Google.Protobuf.ByteString.CopyFrom(buffer);
await call.RequestStream.WriteAsync(streamRequest);

Asynchronously taking the recognition results:

var response = call.ResponseStream.Current;
bool eou = response.Eou;
var results = response.Results;
string text = results[0].Text;
string normalizedText = results[0].NormalizedText;

Ready-made application

As a result, we should have something similar to a video. A demo project in Unity for Oculus Quest 2 can be downloaded here.

Register in the SmartSpeech service, try, create your own applications. If you have any questions, leave them in the comments, we will try to answer them promptly.

Perhaps the attention that the topic of metaverses and virtual reality is attracting now is too inflated. However, more and more companies are starting to do this. An international consulting company has released a report in which it notes the great potential of the metaverse concept. She predicts that by 2030 the market will be estimated at $5 trillion. So we can start moving in this direction right now.

Vrtual Reality Company | Outsource IT Services

Go to our cases Get a free quote