Chris
Harrison

Direction-of-Voice (DoV) Estimation for Intuitive Speech Interaction

Where a person is looking is an important social cue in human-human interaction, allowing someone to address a particular person in conversation or denote an area of interest. For several decades, human-computer interaction researchers have looked at using gaze data to ease and enhance interactions with computing systems, ranging from social robots to smart environments. However, to capture gaze direction, special sensors must either be worn on the head (unlikely for consumer adoption) or external cameras are used (which can be privacy invasive).

In this research, we explored the use of speech as a directional communication modality. In addition to receiving and processing spoken content, we propose that devices also infer the Direction of Voice (DoV). Note this is different from Direction of Arrival (DoA) algorithms, which calculate from where a voice originated. In contrast, DoV calculates the direction along which a voice was projected.

Such DoV estimation innately enables voice commands with addressability, in a similar way to gaze, but without the need for cameras. This allows users to easily and naturally interact with diverse ecosystems of voice-enabled devices, whereas today’s voice interactions suffer from multi-device confusion. With DoV estimation providing a disambiguation mechanism, a user can speak to a particular device and have it respond; e.g., a user could ask their smartphone for the time, laptop to play music, smartspeaker for the weather, and TV to play a show. Another benefit of DoV estimation is the potential to dispense with wakewords (e.g., “Hey Siri”, “OK Google”) if devices are confident that they are the intended target for a command. This would also enable general commands – e.g., “up” – to be innately device-context specific (e.g., window blinds, thermostat, television).

Our approach relies on fundamental acoustic properties of both human speech and multipath effects in human environments. Our machine learning model leverages features derived from these phenomena to predict both angular direction of voice, and more coarsely, if a user is facing or not facing a device. Our software is lightweight, able to run on a wide variety of consumer devices without having to send audio to the cloud for processing, helping to preserve privacy.

Download

Reference

Ahuja, K., Kong, A., Goel, M. and Harrison, C. 2020. Direction-of-Voice (DoV) Estimation for Intuitive Speech Interaction with Smart Devices Ecosystems. In Proceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology (October 20 - 23, 2020). UIST '20. ACM, New York, NY.

© Chris Harrison