Building Conversations

Matt Carroll Aug 20, 2017

TL;DR - Don’t teach your users how to talk.

The state of conversing with computers

Conversational interfaces are at an interesting point. Like VR, there are two things happening in the space: 1) new technology has been developed that has the potential to turn the space from a gimmick to a major new surface of computing and 2) the technology has begun to mature and best practices and development trends have started to emerge. The difference between VR and conversational interfaces is that the barrier to entry for conversational interfaces is extremely low. No new hardware is required to use or develop for conversational interfaces. What conversational interfaces does have that VR doesn’t is a reputation of overpromising and under delivering, but tools to deliver on those promises are here now and you can start fulfilling them today.

Speaking naturally

The promise of conversational interfaces has been, for a long time, a natural, universal and very efficient way of interacting with a computer: human language. The issue has been, how do you understand human language? Endless switch or elseif statements? An human-incomprehensible regex? String parsing until your eyes bleed?

How was I going to do it? Machine learning.

Interpreting human language is a perfect problem for machine learning. There is a pattern to language, but it’s stuck with tens of thousands of years of legacy compatibility, conventions and history. The nuances and structure are not directly compatible with 1s and 0s. Machine learning can take the complex signals in human speech and match the incomprehensibly complex set of variables involved with human speech to determine the meaning behind speech; translating the most powerful communication construct devised by man into something a computer can understand.

So what?

It’s still early and it’s far from perfect, but it’s here and you can start today. Right now, you can make something that lets someone talk to your app naturally and your app can respond sensically. I work on Dialogflow but similar natural language understanding and processing tools work the same way: you provide a list of examples of a query a user might say to your application and a machine learning model is created for the user’s intention. For every type of request you’d like your app to respond to you create a “intent” with examples of users queries. Then, when a user’s query actually comes in, the query is run against every intent model in your agent and the model with the highest score is matched and your app proceeds assuming the intention of the user is the same one that you defined for the matched intent.

Its doesn’t sound like much, but this is what moves a conversation from “press 1 for customer service” to “Here is a link to reset your password.” Intent recognition is a key building block to maintaining the user’s illusion of a “real” conversation. This building block can be used to classify the initial request of the user or even change the topic mid-conversation as well as complex interactions when multiple intents are expressed over the course of the conversation. Natural language understanding tools can enable your conversational interface move from tens or hundreds of user interactions into one conversation and that ends with one answer.

What do you build?

Conversational apps only work when they are the easiest way of accomplishing a task. Only go down this road if having a conversation is the best way of interacting with your users. Displaying data and trends is cannot be done well with a conversation but quick lookups can be. The difference between surfacing a menu of services and prices and “How much for a haircut?” is a very important one. For example, when using Kayak’s Google Assistant app to search for a flight, Kayak will detail the cheapest flight between two airports for given dates, but will not list out every or even the top few options.. Be sure your use case is best solved with a conversational interface before going down this path.

How do you start?

  1. Create a persona
    People expect the other end of a conversation to have a personality because most, if not all, of the conversations they’ve ever had had a human being on the other end before. Keeping this conversational expectation is very important for user’s perception of the conversation as being natural. Think about how your app is perceived, and match the persona to the core values of your app or business. If it’s a friendly small business maybe mirror the demeanor of the owner, if it’s a powerful analytics tool maybe take on the personality of a sharp analyst. It may help to write a brief persona profile.
  2. Write “happy path” dialog
    Start with the happy path for what you want a conversation with a user to actually look like. Write out the ideal conversations you’d like a user to have with your app, for every type of query your app will respond to. There are many conversation mocking tools available for a wide variety of platforms.
  3. Create your intents
    Use your happy paths and create a intent for the initial user’s query to match the intent i.e. for a password reset add examples for “I’d like to reset my password”, “password reset”, “I forgot my password”, etc.
  4. Fill out your conversation
    This is the hardest part; you need to fill out other intents to include all followups and clarifying questions and interactions. Make sure this dialog matches your persona. Your personal can also smooth out problems before they begin and allow for creative, friendly responses that can help the user understand your app.
  5. Handle misunderstandings
    Error messages in conversations are not ok. In conversation you don’t say, “I can’t do that” unless you are sure. You first ask clarifying question, fully understand what’s being communicated and then answer what they are asking. Misunderstandings can be the end of the conversation but often the beginning by clarifying earlier statements and stating the same ideas in different ways.
  6. Test and iterate
    Start by testing the app yourself. Try to break it with strange queries, nonsense and emulate frustrated and unhappy users. Fix issues you come across and by changing, adding and removing intent examples to refine your conversation. Next share it with your team and friends to further improve and refine. Make sure the correct intent is matched; if it isn’t add or remove examples as necessary.

Fulfilling the conversation

After you’ve got your conversation structured you need to actually fulfill the action requested by your users. Most platforms allow you to define a URL in the natural language tool called a webhook. This webhook is called with the structured information about the user’s request including information about the intent and any relevant parameters configured in your natural language understanding tool. While natural language understanding tools remove the need to parse queries, you’ll still need to do some work to fulfill user’s request based on the user’s intention.

Structuring your code

Most conversational platforms use a similar paradigm as web page requests. The first concept is request handlers. Most of the time you want to map the intent to a function which takes the structured input from the natural language understanding platform, do whatever is needed (database lookups, API call, calculations, etc.) and then return a conversational response as a string or a data structure for chat platform where the request originated (i.e. Google Assistant, Slack, Facebook, etc.). The ideal structure usually falls into four sections: intent to function handler mapping, functions handlers, response generators and response templates as seen below:

Conversation design code organization

This approach should work well for most small to medium sized applications. Depending on your language/platform and size of the application this each section might be one file, a directory or a microservice.

Intent to function mapping

This section can be very simple for smaller apps, and is where most of the boilerplate for importing platform specific libraries occurs. This is where the initial request will be handled and routed to the correct handler function and calls the function with the correct parameters.

Function Handlers

Function handlers receive the request parameters and get all the necessary information to process the request including API calls, database lookup and calculations. The function handler then calls the correct response generator will all the parameters that the response generator functions needs to compile a request.

Response Generators

The response generator uses the parameters to determine the correct response template to use and all the parameter needs to fill in the templates (like product names, the temperature or whatever information the user requested). This can be useful when you’d like to randomize what request a user receives or change the response template based on a parameter.

These generators should use data structures that allow for different languages (English, Japanese, German, etc.) as well as different responses based on where the request is coming from: a phone, a voice-only interface, a platform requiring specific response structure or other requests that would require a specific or specifically structured response.

Response Templates

The response template include all the possible responses your application can return to the user. For parameters and wildcard parameter template string are use so that the response generator can fill in the required information.

The response is then bubbled up back through the stack, to the natural language understanding provider to the conversational platform and back to the user to read or hear. Remember to design these templates will different languages in mind (English, Japanese, German, etc.). If you are starting a multi-lingual agent from scratch you may want to consider having different response template files for each language and region (en-US, en-GB, zh-CN, de-CH, etc.).

Considering Voice and Graphical Elements in Conversation

If you are targeting a voice platform (Google Home, Alexa, telephone call, etc.) special precautions have to be taken when crafting your responses. Users can be easily frustrated by long answers to their questions and can quickly become impatient if your responses are not laser-focused on answering their query. Some of the following guidelines can also be useful when developing for voice-based interfaces:

  • Use contractions (is not → isn’t, can not → can’t, etc.) goes a long way in making a response feel like talking to a human.
  • Include support for no-input responses (platform dependant)
  • Test your app on the target devices, speech to text can be dependant on device

Most messaging platforms also include the ability to add graphical elements to chat like cards, lists, suggestion chips and many others. The most important guideline to follow for visual interfaces is to keep related visual and dialog elements close together. For example: If you are both answering and question and asking one of your own is a single response, display the answer to their questions first (keeping their question and answer visual close together) and then asking your own question (putting your question closest to their response). Most designs are very straightforward with few elements to work with but some of these guidelines can be helpful:

  • Text displayed should be a subset of TTS, when it makes sense
  • Don’t unnecessarily repeat information
  • Take advantage of information that is already present

Go build something!

Developing conversations isn’t a simple task, but machine learning has made it possible to have a conversation with your user much more naturally than before and can be the best way to accomplish some tasks. In that light, now is the time to start building conversational interfaces to enable users to interact with your users more easily and open up your app to new use cases and audiences.