It has been a few years since I participated in research in the domain of Data Vocalization. Although I have since graduated from Cornell and pursued a career in industry as a Software Engineer, I’m still inspired by the problems I investigated in this area and the potential this area of research has to change the way we interact with data. So, I wanted to take a moment to reflect on this work and hopefully provide an introduction to the field of data vocalization in hopes that others may be inspired to investigate this domain or start to think about how they interact with voice assistant technologies in a new way.

What is Data Vocalization?

In recent years, we’ve seen an increase in computing devices that permeate our daily lives. Many of us now carry a voice assistant with us in our pockets in the form of Apple’s Siri or Amazon’s Alexa installed on our mobile phones. Furthermore, many of us now have smart devices with these voice interfaces in our homes, always ready to listen and help us fetch information and convey it in a meaningful way. These interactions inform us about the weather forecast for the day, or tell us about how the S&P index has performed over the last day and provide us another useful way to use computing resources without even looking at a screen.

The area of data vocalization isolates this problem and asks the question, “how do we best process and convey data-related queries through voice interfaces?” It is quite an interesting area to explore, as it has the potential to make these voice interfaces more effective, efficient, and powerful. It’s also incredible relevant as these voice interfaces can fill the gap in situations where visual interfaces are insufficient. For example, for those who are visually impaired, looking at a computer screen may not be an option. Yet the need to process and understand data to make informed decisions - whether in personal or business life - is still incredibly important. Additionally, voice interfaces can supplement or provide better descriptions of data in many scenarios.

In this sense, data vocalization is complementary to the well-studied problem of data visualization. However, it presents its own unique challenges that motivate its consideration as a separate field of study. During my time researching this problem with Immanuel Trummer at Cornell, we focused on developing approaches and validating methods of data vocalization with a emphasis on relational data and time series data - two very common types of data in the real world.

What is hard about Data Vocalization?

One prominent challenge of data vocalization is simply the limit of human cognition to process voice output. For example, as we increase the length of a spoken output to describe a data set, there is an increased likelihood that a user will simply forget what was previously spoken. So, voice interfaces must be consise in how it chooses to describe data so as not to drone on and lose the interest or attention of a user. This issue is also present in the field of data visualation (i.e. there are only so many pixels on a screen with which to summarize a time series plot), but it is especially prominent in data vocalization where users have one shot to comprehend the spoken output (unless they request the voice assistant to read back the response a second time).

Along the same lines, we also have potentially many ways we can represent a data set through voice output. So, we must develop models for human cognition to align with how humans process voice communication in order to decide which output is best. Thus, our data vocalization backends can evaluate different output plans or representations before deciding on a single representation to deliver to the user. Similarly, users may have different preferences or cognition capabilities, which may affect whether we customize a voice description of a data set for specific users.

Similar to data visualization, we also have the issue of processing very large datasets. E.g., suppose you asked a voice assistant to summarize the price of the S&P minute-by-minute price ticks across the last 30 years (or more!). Users expect voice interfaces to produce a spoken result quickly, and thus the backend processors for these voice interfaces must deliver results quickly with even large data sets. With this, we may try out sampling approaches or pre-processing of data sets to assist with on-demand requests. Perhaps we need to leverage database indices, which can be useful data structures to assist in identifying larger groupings of data.

Applications to developing voice-based systems

One of the interesting implications of our data vocalization research that has become clearer to me as I’ve tackled software engineering problems in industry relates to the DRY principal (i.e. “Do Not Repeat Yourself”). As new techniques are researched and developed to tackle the problem of data vocalization, we can build more generic models for representing more and more types of data without additional development efforts. For example, after building the prototype for our experimental data vocalization system, I realized how easy it was to then plug-and-play different data sets to allow interacting with new and exciting data domains.

In our initial testing, we used an academic dataset from Yelp to evaluate how our system would work in a theoretical scenario where someone asked a voice assistant for an overview of nearby restaurants. However, there was nothing really unique about the properties of the restaurants table that required our system to limit itself to restaurant-only data. The only key feature was knowing which database table the voice query was targeting and which columns of the relational data were relevant to be output (i.e. rating, cusine_type, price_range, etc.). One can then imagine a new exciting dataset becomes available that users want to interact with. We then a generic approach to, say, import a dataset for home sales listings and allow users to explore new and unique voice queries to evaluate the market for homes in specific neighborhoods.

An alternate approach in the software engineering world would be to develop application-specific plugins that support specific use cases. For example, once could imagine a weather plugin that supports formatting weather data into a voice template (i.e. “the high is 60 degrees at 1pm and then it drops to 52 degrees by 7pm”). However, when a new use case or dataset is identified, this requires designing a whole new plugin to handle this dataset. The former approach with additional research in data vocalization could lead to an explosion in the utility of these voice interfaces in handling real world data.

In Closing

I am excited by the potential for these voice interfaces to become more commonplace and more useful in our daily lives. Hopefully, with additional research in data vocalization, we will see an explosion of their utility and usefulness.

Many thanks to Professor Immanuel Trummer, Jiancheng Zhu, and Ramya Narasimha, with whom I collaborated on two papers at Cornell on the topic of Data Vocalization. I was fortunate enough to join Professor Trummer’s team during my time at Cornell, and I am a much better engineer today due to this experience collaborating with this group.