CHATBOT: Architecture, Design, & Development
By Jack Cahn
Thesis Advisor: Dr. Boon Thau Loo
Engineering Advisor: Dr. Jean Gallier
Senior Thesis (EAS499)
University of Pennsylvania
School of Engineering and Applied Science
Department of Computer and Information Science
April 26, 2017
Table of Contents
1.
2.
Introduction .........................................................................................................................................................3
Chatbot Overview ...............................................................................................................................................4
2.1.
Background & History ..............................................................................................................................4
2.1.1.
What is a Chatbot? ..............................................................................................................................4
2.1.2.
Chatbot History ...................................................................................................................................4
2.2.
Evaluation ...................................................................................................................................................5
2.2.1.
Evaluation Perspectives .......................................................................................................................5
2.2.2.
PARADISE Framework ......................................................................................................................5
2.2.3.
Other Evaluation Methods ...................................................................................................................6
3. Architecture & Design ........................................................................................................................................7
3.1.
Speech-to-Text Conversion........................................................................................................................7
3.1.1.
Large Vocabulary Speech Recognition ...............................................................................................7
3.1.2.
ASR Process Model.............................................................................................................................8
3.1.3.
Restricted Boltzmann Machine (RBM) Implementation .....................................................................9
3.2.
Natural Language Processing ...................................................................................................................9
3.2.1.
Dialogue Act (DA) Recognition ........................................................................................................ 10
3.2.2.
Bayesian Approaches to DA Models................................................................................................. 11
3.2.3.
Non-Bayesian Approaches to DA Models ........................................................................................ 11
3.2.4.
Intent Identification ........................................................................................................................... 12
3.2.5.
Information Extraction ...................................................................................................................... 13
3.2.6.
Statistical Methods for Information Extraction ................................................................................. 14
3.3.
Response Generation ............................................................................................................................... 16
3.3.1.
Rule-Based Models ........................................................................................................................... 17
3.3.2.
Information Retrieval (IR)-Based Models ......................................................................................... 18
3.3.3.
Statistical Machine Translation Generative Models .......................................................................... 20
3.3.4.
Sequence to Sequence (Seq2Seq) Model .......................................................................................... 22
3.3.5.
Reinforcement Learning with Seq2Seq bots ..................................................................................... 23
3.4.
Knowledge Base Creation........................................................................................................................ 25
3.4.1.
Human-Annotated Corpora ............................................................................................................... 25
3.4.2.
Discussion Forums ............................................................................................................................ 26
3.4.3.
Email Conversations.......................................................................................................................... 26
3.5.
Dialogue Management ............................................................................................................................. 27
3.5.1.
Communication Strategies................................................................................................................. 27
3.5.2.
Language Tricks ................................................................................................................................ 28
3.5.3.
Dialogue Design Principles ............................................................................................................... 29
3.5.4.
Human Imitation Strategies ............................................................................................................... 29
3.5.5.
Personality Development................................................................................................................... 30
3.6.
Text to Speech........................................................................................................................................... 31
3.6.1.
Text Analysis ..................................................................................................................................... 31
3.6.2.
Waveform Synthesis.......................................................................................................................... 32
4. Applications & Development ........................................................................................................................... 33
4.1.
IBM Watson Case Study ......................................................................................................................... 33
4.1.1.
Natural Language Processing ............................................................................................................ 33
4.1.2.
Response Generation ......................................................................................................................... 33
4.1.3.
Knowledge Base ................................................................................................................................ 34
4.2.
Security Considerations ........................................................................................................................... 35
4.2.1.
Security Flaws in Chatbot Platforms ................................................................................................. 35
4.2.2.
Malicious Chatbots ............................................................................................................................ 35
4.3.
Applications .............................................................................................................................................. 35
4.3.1.
Virtual Personal Assistants (VPAs) ................................................................................................... 36
4.3.2.
Consumer Domain-Specific Bots ...................................................................................................... 36
5. References .......................................................................................................................................................... 38
2
1. Introduction
Chatbots are “online human-computer dialog system[s] with natural language.” [1] The first
conceptualization of the chatbot is attributed to Alan Turing, who asked “Can machines think?” in
1950. [3] Since Turing, chatbot technology has improved with advances in natural language
processing and machine learning. Likewise, chatbot adoption has also increased, especially with
the launch of chatbot platforms by Facebook [93], Kik [94], Slack [95], Skype [96], WeChat [97],
Line [98], and Telegram [99]. By September 2016, Facebook Messenger hosted 30,000 bots and
had 34,000 developers on its platform. [100] The Kik Bot Shop announced in August 2016 that
the 20,000 bots created on its platform had “exchanged over 1.8 million messages.” [101]
This paper is a literature review of the design choices, architecture, and algorithms used in
chatbots. Section 1 will describe chatbot function and history in more detail and discuss the
methods used to evaluate chatbots. Section 2, will walk through chatbot functionality step-by-step,
beginning with automatic speech recognition (ASR) algorithms, natural language processing (NLP)
functionality, response generation approaches, knowledge base creation strategies, and dialogue
management (DM) algorithms, and concluding with a discussion of text to speech algorithms.
Section 3 will focus on chatbot development, beginning with a case study of IBM Watson’s chatbot
functionality and concluding with a discussion of security considerations and chatbot applications.
3
2. Chatbot Overview
2.1.
Background & History
2.1.1.
What is a Chatbot?
A chatbot is an “online humn-computer dialog system with natural language.” [1] Sansonnet
et. al. [2] provides a basic framework which outlines the functions expected from modern chatbots:
A. Dialogic Agent: must understand the user, i.e. provide the function of comprehension. Bots
are provided with a textual (or oral, see Section 2.1) input, which are analyzed with natural
language processing tools, and used to generate appropriate responses. [2]
B. Rational Agent: must have access to an external base of knowledge and common sense
(e.g. via corpora of data) such that it can provide the function of competence, answering
user questions. Should store context-specific information (e.g. user’s name, etc.). [2]
C. Embodied Agent: should “provide the function of presence…once regarded as very
optional…this function proves to be crucial in the case of ordinary users.” Even the earliest
bots were given names (ELIZA, ALICE, CHARLIE, etc.) in order to satisfy this condition.
Today, developers are focused on the use of language tricks to create personas for chatbots
in order to build trust with users and give the impression of an embodied agent. [2]
2.1.2.
Chatbot History
In 1950, Alan Turing asked the question “Can machines think?” Turing conceptualized the
problem as an “imitation game” (now called the Turing Test), in which an “interrogator” asked
questions to human and machine subjects, with the goal of identifying the human. If the human
and machine are indistinguishable, we say the machine can think. [3] In 1966, Joseph
Weizenbaum at MIT created the first chatbot that, arguably, came close to imitating a human:
ELIZA. Given an input sentence, ELIZA would identify keywords and pattern match those
keywords against a set of pre-programmed rules to generate appropriate responses. [4]
Since ELIZA, there has been progress in the development of increasingly intelligent
chatbots. In 1972, Kenneth Colby at Stanford created PARRY, a bot the impersonated a
paranoid schizophrenic. [5] In 1995, Richard Wallace created A.L.I.C.E, a significantly more
complex bot that generated responses by pattern matching inputs against <pattern> (input)
<template> (output) pairs stored in documents in a knowledge base. These documents were
written in Artificial Intelligence Markup Language (AIML), an extension of XML, which is
still in use today. ALICE is a three-time winner of the Loebner prize, a competition held each
year which attempts to run the Turing Test, and awards the most intelligent chatbot. [6]
Modern chatbots include: Amazon’s Echo and Alexa, Apple’s Siri, and Microsoft’s
Cortana. [7] The architectures and retrieval processes of these bots take advantage of advances
in machine learning to provide advanced “information retrieval” processes, in which responses
4
are generated based on analysis of the results of web searches. Others have adopted “generative”
models to respond; they use statistical machine translation (SMT) techniques to “translate”
input phrases into output responses. Seq2Seq, an SMT algorithm that used recurrent neural
networks (RNNs) to encode and decode inputs into responses is a current best practice. [8]
2.2.
Evaluation
2.2.1.
Evaluation Perspectives
There are a number of different perspectives on how to evaluate chatbot performance. From
an information retrieval (IR) perspective, chatbots have specific functions: there are virtual
assistants, question-answer and domain-specific bots. Evaluators should ask questions and make
requests of the chatbot, evaluating effectiveness by measuring accuracy, precision, recall, and Fscore relative to the correct chatbot response. [9]
From a user experience perspective, the goal of the bot is, arguably, to maximize user
satisfaction. Evaluators should survey users (typically, measured through questionnaires on
platforms such as Amazon Mechanical Turk), who will rank bots based on usability and
satisfaction. From a linguistic perspective, bots should approximate speech, and be evaluated by
linguistic experts on their ability to generate full, grammatical, and meaningful sentences. [9]
Finally, from an artificial intelligence perspective, the bot that appears most convincingly
human (e.g. passes the Turing Test best) is the most effective. [9]
2.2.2.
PARADISE Framework
One of the most widely used frameworks which fuses these perspectives is the PARAdigm for
DIalogue System Evaluation (PARADISE). First and foremost, PARADISE estimates subjective
factors such as: (i) ease of usage, (ii) clarity, (iii) naturalness, (iv) friendliness, (v) robustness
regarding misunderstandings, and (vi) willingness to use the system again. It does so by collecting
user ratings through the distribution of questionnaires. Second, PARADISE seeks to objectively
quantify bot effectiveness by maximizing task success and minimizing dialogue costs. [10]
Maximizing Task Effectiveness: To maximize task effectiveness, bots must deliver correct
results (i.e. the information retrieval perspective). Walker et. al., the creators of PARADISE,
introduce the concept of an attribute value matrix (AVM) to measure effectiveness. They propose
human subjects follow a script with a bot, which is supposed to achieve certain desired outcomes
(e.g. the generation of informative responses). These “correct” responses are coded into a scenario
key. Meanwhile, the results of the bot being tested are coded into the AVM. [10]
From the AVM and scenario key, a confusion matrix M is created, which encodes the number
of times the bot retrieved the right vs. wrong answer. [10] To measure task success, Walker et. al.
compute the kappa coefficient of the bot based on the confusion matrix:
�=
�(�)− �(�)
1−�(�)
[10]
5
P(A) is the proportion of time where the AVM agrees with the correct result from the scenario
key, and P(E) is the probability of agreement by chance; a bot generating random answers would
have � = 0 whereas a human would have � = 1. [10]
Bates et. al. argue that the task effectiveness metric is too punitive: it considers only one correct
answer, and reduces the score for bots that do not reach this answer. Instead a “Dialogue Breadth
Test” should be run, in which 10 answers are accepted to the first utterance, and 10 answers
accepted at the next turn (e.g. 100 overall acceptable second utterances). Such a modification
would allow the system to better capture acceptable variability. [11]
Minimizing Dialogue Costs: Walker et. al. define two types of dialogue costs: efficiency costs
and qualitative costs. Efficiency cost metrics include: (i) total elapsed time, (ii) total number of
system turns, (iii) total number of system turns per task, and (iv) total elapsed time per turn.
Qualitative cost metrics include the: (i) number of re-prompts, (ii) number of user barge-ins, (iii)
number of inappropriate system responses, (iv) concept accuracy, and (v) turn correction ratio. [12]
Overall performance is the difference between task effectiveness and costs. Specifically:
����������� = � ∗ �(�) − ∑��=1 �� ∗ �(�� ) [10]
� is a weight on k, each cost function is weighted by �� , and � is a Z score normalization
function. [10]
2.2.3.
Other Evaluation Methods
There exist a wide variety of evaluation methods outside of the PARADISE framework which
are used to evaluate the performance of commercially-available chatbots. Kuligowska et. al., for
example, propose an evaluation framework which they apply to 29 polish-speaking chatbots. [102]
First, Kuligowska et. al. measure the “visual look” of the chatbot, rewarding bots that resemble
living people (e.g. fulfill the embodied agent requirement). [102] Van Vugt et al. found in 2010
that bots that appear human increase user engagement and the willingness of individuals to interact
with the bot. [103] Using this approach, bodiless chatbots such as PayU’s Wirtualny Doradca and
cartoon-like bots such as IKEA’s Ania are penalized. [102]
Second, Kuligowska et. al. evaluate the implementation of the chatbot system on its host
website. To receive a 5/5 rating, websites had to implement both a built-in window users could use
to interact with the bot, and a pullout side tab which users could toggle when they wanted to talk
to the bot. Websites with only a pullout tab receive a rating of 4/5, while those with only a fixed
built-in window received a rating of 3/5. [102]
Third, Kuligowska et. al. evaluate the ability of each bot to produce speech. Bots with unique
custom voices were rated as 5/5, chatbots with standard text-to-speech modules but without a
custom voice module received a 4/5, those without shutdown options received a 3/5, and those
chatbots that could only communicate via text not speech received a 1/5 rating. [102]
6
Fourth, Kuligowska et. al.’s framework sought to assess the knowledge base of each bot, both
in terms of general knowledge, and domain-specific knowledge relevant to the bot’s website (e.g.
IKEA’s chatbot should have knowledge of IKEA products). To assess general knowledge,
Kuligowska et. al. asked each bot a set of standard general knowledge questions, and assigned
binary scores to each bot depending on whether the bot answered appropriately. The scores on
each question were summed to generate an overall knowledge score. To assess specific knowledge,
Kuligowska et. al. asked a series of domain-specific questions that could be relevant across
domains (e.g. What product are you selling? What is the price of your product?) [102]
Fifth, the researchers assessed the chatbots’ ability to present their knowledge along five
dimensions: (i) did the bots open click-through links in a new tab or window so as not to disrupt
the dialogue, (ii) did the bots provide an “interactive connection to an external knowledge database,”
(iii) does the bot provide a back button or the ability to scroll through past dialogue, (iv) are
chatbots able to provide explanations for their functioning after receiving inputs such as “?”, “info”,
and “help,” and (v) do chatbots have a Home button that links back to the Main Menu. [102]
Sixth, Kuligowska et. al. evaluate conversational abilities. Chatbots that are able to lead a
coherent dialogue, understand context, and repair ‘broken’ conversations receive the highest 5/5
ratings. Those bots that may not be able to fully understand context but that can lead a coherent
dialogue receive a rating of 4/5. Finally, those bots that can only understand a small number of
utterances, and will frequently respond with the same blanket phrases, receive 3/5 ratings. [102]
Seventh, Kuligowska et. al. assess whether chatbots have consistent and rich personalities,
based on the assumption that for chatbots to act like believable humans, they must be able to
simulate having a unique personality. Those bots such as Orange’s Ewa with rich personalities
received ratings of 5/5, while those chatbots with only a “rudimentary outline of personality”
received ratings of 3/5. [102]
Eighth, Kuligowska et. al. evaluate the depth of personalization options for the chatbot, along
five dimensions: (i) is the user empowered to select the gender of the bot, (ii) can the bot recall the
name of the user, (iii) can the user view the conversation history during the talk, (iv) can the bot
send the conversation history via email or print to the user, and (v) can the bot recognize the subpage of the website that the user is browsing. [102]
3. Architecture & Design
3.1.
Speech-to-Text Conversion
3.1.1.
Large Vocabulary Speech Recognition
Speech is one the most natural and powerful modes of communication, and is “widely accepted
as the future of interaction with computer and mobile applications.” [13] Speech is so integral to
the human condition that people with IQs of less than 50, and with brains one third the size of
humans can speak. Neurological research indicates that speech activates more of the brain than
7
any other processing function. By incorporating speech processing, chatbots will be able to
interface over phones and radios. [14]
Speech-to-text conversion begins with a process called automatic speech recognition (ASR).
The goal of ASR is to achieve speaker-independent large vocabulary speech recognition (LVCSR).
[15] Improvements in LVCSR can be measured along a number of dimensions:
A. Vocabulary size: vocabularies started out miniscule, and only included basic phrases (e.g.
yes, no, digits, etc.) and now include millions of words in many languages. [15]
B. Speaker independence: ability to recognize specific speakers. This is only relevant if
chatbots use the speaker’s identity to generate user-specific responses. [15]
C. Co-articulation: ability to process a continuous stream of words, which do not necessarily
contain breaks between words. Requires proper tokenization and segmentation of the input
stream, discussed in the next section. [15]
D. Noise handling: ability to filter out noise (e.g. traffic, background music or speech). [15]
E. Microphone: ability to process speech at varying distances from microphone. [15]
There is not a direct one-to-one matching between a sound and a corresponding phoneme or
word: each time a speaker makes attempts to communicate a word, the resulting sound will be
different because of: (i) environmental noise, (ii) emotional state, (iii) tiredness, and (iv) distance
from microphone, among other factors. The matching process in ASR is therefore not deterministic,
but rather can be modeled as a stochastic process. Given a sound X, the model generates the most
likely phoneme, word (W), phrase, or sentence from all possible words in the Language L. [15]
3.1.2.
ASR Process Model
The first step in ASR is pre-processing. Speech is recorded into a microphone; Apple’s Siri,
for example, uses iPhone hardware, whereas hardware (including microphone) for Amazon’s
Alexa must be purchased independently of a mobile phone. The signal is discretized by the
microphone with a sampling frequency (16kHz is empirically optimal for speech). [15]
The second-step is speech/non-speech segmentation. The ASR system must distinguish
between the phonemes (basic unit of speech) that should be recorded for translation vs. the
background noise, which can also be recorded as phonemes but is not relevant to translation.
Likewise, the system should remove phonemes before the actual speech began, and after the
desired recording took place; this is called end point detection and removes noise. Signal energybased algorithms, which set energy thresholds that are crossed when speech begins and ends, as
well as Gaussian mixture models can be used to solve this problem. [15]
The third step is feature extraction. Acoustic observations are “extracted over time frames of
uniform length,” typically 25 milliseconds. The fourth step is decoding, in which the acoustic
feature vectors are mapped to the most likely corresponding words. This requires three tools. [15]
8
First, we need an acoustic model which will give us P(X|W) – given a word, the probability
that we hear a sound. We will find the argmax over all words in our language w, or the word with
the highest likelihood of representing this sound. Acoustic models are typically trained on sound
recordings and accompanying transcripts, which can be used to empirically find these probabilities.
The statistical representation of each word (or phoneme) generated from analysis of the sound
corpus is typically represented as a Hidden Markov Model (HMM). HMMs are Markov process
where the states are unobserved/hidden (we do not know the actual phonemes/words used, and are
approximating this information). [15]
Second, we need a language model which can tell us the probability of hearing a given word
in our language. Third, we need a dictionary with a list of words and their phonemes. [15] Given
a sound, we can decode the word using Bayes rule:
� = �������∈� �(�|�) = �������∈�
����� ��(�)
�(�)
= �������∈� �(�|�)�(�) [16]
The fourth step is post-processing; Bayes rule gives us a list of the probable word sequences
with their ranks, and we choose the most probable one as the word sequence we’ve heard. The
other top hypotheses are stored and can be used in reinforcement learning algorithms later, where
they will be used to learn from and correct mistakes in the ASR phase. [15]
Figure 1: Automatic Speech Recognition (ASR) Process [15]
3.1.3.
Restricted Boltzmann Machine (RBM) Implementation
Deep learning and the use of Restricted Boltzmann Machines (RBMs) have led to the
development of even more effective methods for decoding sounds into phonemes. RBMs are
neural networks with one layer of stochastic visible units and N layers of stochastic hidden units;
there are no connections with each layer, but there typically connections between each unit in the
visible layer and every unit in the hidden layer. The weights on each edge are determined by an
activation function, and are altered and optimized during training via back propagation. [17]
Rather than include an HMM to represent the sequence of sounds that make up a phoneme,
RBM-based models represent each phoneme with an RBM, where each sound frame that composes
the phoneme is a visible unit. Mohamed et. al. tested three variants of the RBMs in acoustic
modeling: the unconditional RBM, a conditional RMB (CRBM), which considers the visible
variable from previous time frames as conditional inputs, and the interpolating CRBM (ICRBM),
which considers both previous and future time frames as conditional inputs. Of these, they find the
ICRBM generates the best results, and outperformed standard feed-forward neutral nets. [17]
3.2.
Natural Language Processing
9
The goal of natural language processing (NLP) is to take the unstructured output of the ASR
and produce a structured representation of the text that contains spoken language understanding
(SLU) or, in the case of text input, natural language understanding (NLU). In this section, we
explore a number of methods for extracting semantic information and meaning from spoken and
written language in order to create grammatical data structures that can be processed by the
Dialogue Management unit in the next step. This is non-trivial because speech may contain: (i)
identity-specific encodings (e.g. pitch, tone, etc.) in addition to meaning-encodings and (ii) noise
from the environment. Likewise, both speech and text inputs to a chatbot may contain (iii)
grammatical mistakes, (iv) disfluencies, (v) interruptions, and (vi) self-corrections. [18]
3.2.1.
Dialogue Act (DA) Recognition
One way to extract meaning from natural language is to determine the function of the
text/sentence (e.g. is this a question, suggestion, offer, or command); this is called dialogue act
recognition. In dialogue act recognition systems, a corpus of sentences (training data) is labeled
with the function of the sentence, and a statistical machine learning model is built which takes in
a sentence and outputs its function. The model uses a number of different features to classify the
sentences including: (i) words and phrases such as “please” (function=request) and “are you”
(function=yes/no question), and (ii) syntactic and semantic information. [16]
Figure 2: Dialogue Act Recognition Example [19]
The first-step in creating a dialogue act recognition system, is defining the relevant functions
or the DA tag-set. This involves choosing labels that are general enough to be re-used in multiple
tasks, specific enough to remain relevant for the target task, and clear/separable enough that there
is little confusion for humans in labeling the functions of sentences in the training set. A number
of tag-sets have gained prominence and are the most frequently used in chatbots: Dialogue Act
Markup in Several Layers (DAMSL), Switchboard SWBD-DAMSL, Meeting Recorder,
VERBMOBIL, and Map-Task. [19]
The DAMSL annotation scheme labels a sentence in four dimensions: communicative status,
information level, forward-looking function, and backward-looking function. Communicative
status labels a sentence as: (i) uninterpretable, (ii) abandoned, or (iii) self-talk. Information level
labels a sentence as: (i) task, (ii) task management, (iii) communication-management, or other.
Forward-looking functions encode any information that will affect future conversation and label
sentences along eight sub-dimensions: (i) statement – assert, reassert, or other, (ii) influencingaddressee-future-action – open-option or action directive, (iii) info-request, (iv) committingspeaker-future-action – offer or commitment, (v) conventional opening-closing, (vi) explicit10
performative, (vii) exclamation, or (viii) other. Backward-looking functions encode the
relationship between current and previous speech, such as (i) agreement, (ii) understanding, (iii)
answer, or (iv) information relation. [20]
Switchboard is an adaptation of DAMSL for automated telephone conversations. [21] Meeting
Recorder DA (MRDA) is similar to Switchboard: its training Corpus, however, is labeled
recordings of 72 hours of conversations from meetings. MRDA handles intricacies well given the
complications that often occur during meetings such as speaker overlap, frequency of abandoned
comments, and complicated turn-taking interactions. [22] Finally, Map Task is a hierarchy of
levels: the first level labels transactions, the second is conversational games, which labels patterns
like question and answer pairs, and the third includes 19 conversational moves. [23]
3.2.2.
Bayesian Approaches to DA Models
The idea behind using a Bayesian approach to DA models is to find the probability of every
possible sequence of dialogue acts DA that could represent a sentence or utterance (U), and find
the dialogue act sequence ����� with the highest probability of occurring. [16]
����� = �������� �(��|�) = �������
�(��)�(�|��)
[16]
�(�)
����� = �������� �(��) ∗ �(�|��) [16]
Assuming the N individual words of the utterance are known, and the Naïve Bayes assumptions
of independence of subsequent words holds, this leads to:
����� = �������� �(��) ∗ � �(�� |��) [16]
�=1..�
This is the unigram model, where P(DA) and �(�� |��) can be quantified from empirical data.
Using this Naïve Bayes classifier, Reithinger et. al. finds a 74% recognition/accuracy rate for
classifying the correct dialogue act from a given sentence. [24] N-gram models are frequently used
to include dialogue history in the model. These models estimate P (DA | N historical DAs) rather
than P(DA). Assuming N = 3, we would estimate �(�� | ���−1 , ���−2 , ���−3 ). [25] Hidden
Markov Models can also be used to model dialogue history, where each state represents a dialogue
act in the conversation history of the chatbot. [26] Likewise, neural network classifiers can be
trained as well. A combination of HMMs and neural networks has achieved 76% accuracy. [27]
3.2.3.
Non-Bayesian Approaches to DA Models
DA-classification is a classic machine learning problem. A number of non-NB approaches
have been used, including: neural networks, multi-layer perceptrons, and decision trees. [19]
Wright, 2015, for example, implemented a multi-layer perceptron. The input layer to the MLP
neural network used suprasegmental features (e.g. stress and intonation) extracted from the
utterance, durational features (i.e. time taken to voice word), and prosodic features (e.g. root mean
11
square frame energy). The neural network used one hidden layer with 60 neurons, and an output
layers with 12 nodes, corresponding to each of the 12 possible DA configurations/labels being
used. [28] The network was trained using back-propagation with a cross-entropy cost function:
�� = − ln ∑�[���(�) + (1 − �) ln(1 − �)] [29]
N is the number of training data items, x is the training inputs/features, and y is the outputs in
each round of back-propagation. [29]
Figure 3: MLP Example [19]
Andernach et. al., by contrast, used a different type of neural network: the Kohonen network
or Self-Organizing Map (SOM), which maps from a set of data (the input features) onto a twodimensional output grid representing the possible DAs. Seven input features were used including:
speaker, use of “wh” question word, use of question mark, and other features. These networks use
unsupervised clustering to create the labels used to classify input sentences. [30]
A third non-NB approach to DA classification is memory-based learning (MBL) using Knearest-neighbor. The idea behind MBL is to store in memory all instances that have been seen
and classified (i.e. at first, the training data). When an unseen instance (new DA) is seen, the system
retrieves the k-nearest-neighbors (i.e. the most similar utterances from the training set) and the
unseen instance is classified based on the majority or “dominant” class in this sample. [19] Rotaru
achieves optimal accuracy of 72% using the 3-nearest-neighbor approach. [31] A fourth approach
is transformation-based learning in which a simple classifier is used to get a generally-correct
solution, and transformations are applied to get a specifically-correct solution. [16]
DA classification is critical to chatbots in that it provides necessary context to the bot and
directs future communication. This, however, is somewhat of a circular problem in that DA
classifiers would be much improved if they themselves had context information. DA classifier
development is focused on providing more relevant context information, including social roles of
users, relationships, emotions, context, and history of interaction as classification features. [19]
3.2.4.
Intent Identification
In SLU, the functions / dialogue acts (DAs) are often domain specific. In other words, instead
of asking whether the function of the user’s utterance is a question or answer, we ask whether the
function is to, for example, find flights or cancel a reservation in a flight reservation program. [32]
Domain-specific dialogue acts are called intents. Intent identifying has been most prominently
used by call center bots, which ask the user “how can I help you?” and subsequently use intent
12
identification to re-direct the user to one of N pre-defined re-direction options. [18] Many of the
same machine learning algorithms used for DA classification are used for intent identification. [16]
3.2.5.
Information Extraction
The primary responsibility of the SLU is not just to understand phrase function, but to
understand the meaning of the text itself. To extract meaning from text, we convert unstructured
text – either the output of the ASR, or text written into a text-only chatbot – into structured
grammatical data objects, which will be further processed by the Dialogue Manager.
The first step in this process is breaking down a sentence into tokens that represent each of its
component parts: words, punctuation marks, numbers, etc. Tokenization is difficult because of the
frequency of ambiguous or mal-formed inputs including: (i) phrases (e.g. New York), (ii)
contractions (e.g. aren’t), abbreviations (e.g. Dr.), and periods (e.g. distinguishing those used in
“Mr.” and at the end of a sentence). These tokens can be analyzed using a number of techniques,
described below, to create a number of different data structures that be processed by the dialogue
manager. [16]
Bag of Words: We ignore sentence structure, order, and syntax, and count the number of
occurrences of each word. We use this to form a vector space model, in which stop words (e.g. a,
the, etc.) are removed, and morphological variants (e.g. talk, talks, talked, etc.) go through a
process call lemmatization and are stored as instances of the basic lemma (e.g. talk). In the dialogue
manager phase, assuming a rule-based bot, these resulting words will be matched against
documents stored in the bot’s knowledge database to find the documents with inputs containing
similar keywords. The bag of words approach is simple because it does not require knowledge of
syntax, but, for this same reason, is not precise enough to solve more complex problems. [16]
Latent Semantic Analysis (LSA): This approach is similar to the bag of words. Meanings /
concepts, however, not words, are the basic unit of comparison parsed from a given sentence or
utterance. Second, groups of words that co-occur frequently are grouped together. In LSA, we
create a matrix where each row represents a unique word, each column represents a document, and
the value of each cell is the frequency of the word in the document. We compute the distance
between the vector representing each utterance and document, using singular value decomposition
to reduce the dimensionality of the matrix, and determine the closest document. [16]
Regular Expressions: Sentences / utterances can be treated as regular expressions, and can be
pattern matched against the documents in the bot’s knowledge database. For example, imagine
that one of the documents in the bot’s knowledge database handles the case where the user inputs
the phrase: “my name is *”. “*” is the wildcard character, and indicates that this regular expression
should be triggered whenever the bot hears the phrase “my name is” followed by anything. If the
user says “my name is Jack”, this phrase will be parsed into a number of regular expressions,
including “my name is *” and will trigger the retrieval of that document. [16]
Part of Speech (POS) Tagging: POS tagging labels each word in the input string with its part
of speech (e.g. noun, verb, adjective, etc.). These labels can be rule-based (a manually-created set
of rules is created to specify part of speech for ambiguous words given their context). They can
13
also be created using stochastic models which train on sentences labeled with correct POS. In the
dialogue manager, POS can be used to store relevant information in the dialogue history. POS is
also used in response generation to indicate the POS object type of the desired response. [16]
Figure 4: Sentence labeled with POS [16]
Named/Relation Entity Recognition: In named entity recognition (NER), the names of
people, places, groups, and locations are extracted and labeled accordingly. NER-name pairs can
be stored by the dialogue manager in the dialogue history to keep track of the context of the bot’s
conversation. Relation extraction goes one step further to identity relations (e.g. “who did what to
whom”) and label each word in these phrases. [16]
Figure 5: Sentence labeled with Information Extraction [16]
Semantic Role Labeling: The arguments of a verb are labeled based on their semantic role
(e.g. subject, theme, etc.). In this process, the predicate is labeled first followed by its arguments.
Prominent classifiers for semantic role labeling have been trained on FrameNet and PropBank,
databases with sentences already labeled with their semantic roles. These semantic role-word pairs
can be stored by the dialogue manager in the dialogue history to keep track of context. [16]
Creation of Grammatical Data Structures: Sentences and utterances can be stored in a
structured way in grammar formalisms such as context-free grammars (CFGs) and dependency
grammars (DGs). Context-free grammars are tree-like data structures that represent sentences as
containing noun phrases and verb phrases, each of which contain nouns, verbs, subjects, and other
grammatical constructs. Dependency grammars, by contrast, focus on the relationships between
words. [16]
Figure 6: Example of Dependency Grammar parsing [16]
3.2.6.
Statistical Methods for Information Extraction
Historically, the above hand-crafted models, created based on knowledge of the specific
situation at hand, were used to extract structured meaning from a sentence or utterance. [16]
However, these models have two problems. First, hand-crafted models lead to high
development costs, because new models must be built with each new system. Second, these models
are often not robust to diverse user input, because users do not know which grammatical structures
14
are handled by a specific system. To solve these problems, data-driven, statistical models have
arisen. These models derive the meaning M of an utterance U that maximizes P(M|U). [16]
Hidden Vector State (HVS) Model: He and Young offer a statistical hidden vector state
model. Given a sentence, our goal is to automatically produce some accurate structured meaning.
Consider the sentence “I want to return to Dallas on Thursday.” The parse tree below represents
one way of representing the structured meaning of the sentence. SS represents the initial node
send_start, and SE represents the end node send_end. We view each leaf node as a vector state,
described by its parent nodes: the vector state of Dallas is [CITY, TOLOC, RETURN, SS]. [33]
Figure 7: Parse tree of HVS [33]
The whole parse-tree can then be thought of a sequence of vector states, represented by the
sequence of squares above. If each vector state is thought of as a hidden variable, then the sequence
of vector states (e.g. squares above) can be thought of as a Hidden Markov Model: we start at SS,
and have certain probabilities of reaching a number of possible hidden states as the next state. Each
vector state can be thought of as a “push-down automaton” or stack. This way, the first vector state
is SS and after that, our transitions are a combination of push or pull elements to get to the next
vector state. At each step, we can make up to n stack shifts as our transitions. If we do not restrict
n, the state space will grow exponentially (e.g. n possible states from S, n^2 possible states in the
next step, etc.). To prevent this from happening, we limit the maximum depth of the stack. [33]
With this model in mind, He and Young argue that the probability of a set of words W
occurring with a corresponding set of concepts C, and set of transition operations N is the product
of the (i) the probability of the t-th stack operation �� occurring given the last concept ��−1, (ii)
the probability of the t-th concept from the root node �� [1] occurring given the sequence of
previous concepts �� [2. . �� ] and (iii) the probability of the t-th word �� occurring given the t-th
concept �� , for all t=1…T, where T is the final state of the model: [33]
�(�, �, �) = � �(�� | ��−1 )�(�� [1]|�� [2. . �� ])�(�� |�� ) [33]
�=1…�
To train the model, we create a set of pairs of sentences and their structured concept annotations,
and fit model parameter � that allows us to convert from sentences to concepts. In training, we
seek to maximize:
15
L(�) = log �(�, �, � | �) [33]
When we run the model on unseen sentences (e.g. in prediction / classification), we apply the
model parameters derived in training to generate our desired output: the appropriate structured
concept annotations. [33]
Other similar models include: the stochastic finite state transducer model, which uses finite
state machines to determine the appropriate grammatical concept with which to label a given word,
and dynamic Bayesian network models. All of these models are generative models in that they
seek to find the joint probability distribution P(X,Y) where X is the sentence inputs and Y is the
structured grammatical concept outputs. Next, we discuss discriminative which calculate the
conditional probability of P(Y|X) in order to map sentences to concepts. [16]
Support Vector Machine (SVM) Model: Support Vector Machines are a supervised machine
learning tool. Given a set of labeled training data, the algorithm generates the optimal hyperplane
that divides the sample into their proper labels. Traditionally, SVMs are thought of as solving
binary classification problems, however multiple hyperplanes can be used to divide the data into
more than two label categories. The optimal hyperplane is defined as the hyperplane that creates
the maximum margin, or distance, between different-labeled data point sets. [34]
Traditionally, a hyperplane is represented as H(X) = W*X + B = 0, where X is the input points,
W is a weight vector, and B is the bias term. SVMs are used to achieve SLU by treating SLU as a
sequence of binary classification problems mapping words to concepts. [16]
Conditional Random Field Models: CRFs are log-linear statistical models often applied for
structured prediction. Unlike the average classifier, which predicts a label for a single object and
ignores context, CRF’s take into account previous features of the input sequence through the use
of conditional probabilities. A number of different features can be used to train the model,
including lexical information, prefixes and suffixes, capitalization and other features. [16]
Deep Learning: The most recent advancement in the use of statistical models for concept/
structure prediction is deep learning for natural language processing, or deep NLP. Deep learning
neural network architectures differ from traditional neural networks in that they use more hidden
layers, with each layer handling increasingly complex features. As a result, the networks can learn
from patterns and unlabeled data, and deep learning can be used for unsupervised learning. Deep
learning methods have been used to generate POS tags of sentences (chunk text into noun phrases,
verb phrases, etc.) and for named-entity recognition and semantic role labeling. [16]
3.3.
Response Generation
Response generation is arguably the most central component of the chatbot architecture. As
input, the Response Generator (RG) receives a structured representation of the spoken text. This
conveys information about who is speaking, the dialogue history, and the context. As output, the
RG generates a response to deliver to the user, which it will deliver to the Dialogue Manager (DM).
16
The response selector has access to three key components it will use to make its decision about
what to say: (i) a knowledge database / data corpus, which will differ in content based on
implementation, (ii) a dialogue history corpus, which will only exist in more complex models, and
(iii) an external data source, which provides the bot with intelligence (e.g. a dog is an animal). This
latter “common sense intelligence” is often achieved by allowing bots to access and retrieve
documents from search engines.
In this section, we will discuss in detail the ways in which a bot can retrieve a response.
3.3.1. Rule-Based Models
The principle idea underlying rule-based RG is that a bot contains a knowledge base with
documents, where each document contains a <pattern> and <template>. When the bot receives an
input that matches the <pattern>, it sends the message stored in the <template> as a response. The
<pattern> could either be a phrase “What’s your name?” or a pattern “My name is *”, where star
is a regular expression. Typically, these <pattern> <template> pairs are hand-crafted. [6]
The programmer’s main challenge is to choose an algorithm to use for pattern matching
between the input sentence and the documents in the data corpus. We can think of this as a nearestneighbor problem, where our goal is to define the distance function, and retrieve the closest
document to the input sentence. Below, we include some of the most common methods.
ELIZA Incremental Parsing: ELIZA, an early chatbot created at MIT between 1964 and
1966, used an incremental parsing or “direct match” approach for pattern matching. ELIZA parsed
the input text word by word from left to right, looking each word up in the dictionary, giving it a
rank based of importance, and storing it on a keyword stack. The keyword with the highest rank
of importance would be treated as the <pattern> and pattern matched against the documents in the
corpus to find the appropriate template response. If no document existed corresponding to the input,
ELIZA would say “I see” or “Please go on” in the DOCTOR script. ELIZA was created with
multiple “scripts” which indicate different <template> responses to <pattern> inputs. [35]
ALICE/AIML Direct Match Technique: ALICE-bot is a later chatbot implementation and
three-time winner of the Loebner prize. The ALICE architecture includes a knowledge base called
Graphmaster, which is a graph with nodes (“nodemappers”) and edges. Graphmaster can be
thought of like a file system, with a root that contains files and directories. The name of the path
to every leaf node is that leaf node’s <pattern> sentence. We conduct pattern matching using depthfirst search on the knowledge-base. Let’s assume our sentence starts with the word “Hello” and
we start in Folder. Start by checking for a sub-folder that starts with “_”. Scan through the subfolder to look for matches with any remaining suffixes from the input sentence. If no match is
found, return to the Folder and look for the sub-folder Hello, and scan through the sub-folder to
look for matches with any remaining suffixes, minus the word Hello. If no match is found, return
to the Folder and look for the sub-folder *. Scan through the sub-folder to look for matches with
any remaining suffixes, minus the word Hello. If no match is found, retrieve the second word and
repeat this process. Recursively continue until a match is found or you finish the sentence. [36]
17
VPBot Keyword Set Matching Technique: Incremental parsing and direct match both
require only one word to match in order to retrieve a response (e.g. one keyword match in the
former, and one direct match in the latter). This flexibility could lead to unexpected behavior.
VPBot modified this to technique by allowing developers to assign between 1-3 keywords to a
keyword set, and required that all keywords in the set be present to trigger a response. [35]
ViDi One-Match and All-Match Categories (OMAMC) Technique: OMAMC combines
the ideas of the ALICE and VPBot matching techniques by creating two matching mechanisms.
The first mechanism is one-match, which is an exact-match technique. If there is an exact match
between keywords (can be single word or phrase) in the input sentence and a document’s onematch category in the knowledge base, this document will be retrieved. The second mechanism is
all-match, which allows the programmer to create keyword sets that are larger than three (the limit
set by VPBot). If all keywords in a document’s all-match category appear in the input sentence,
regardless of the ordering, this document is retrieved. If two different documents are retrieved via
one-match and all-match, one-match has precedence (in practice, all-match is not run unless onematch does not retrieve a document). OMAMC has two benefits. First, it allows for varied
keywords arrangements in documents (order does not matter in all-match). Second, it allows for
keyword variety (no limit of three, like in VPBot, in all-match). [35]
3.3.2. Information Retrieval (IR)-Based Models
The main problem with rule-based systems is that the programmer must specify all “rules” or
<pattern> <template> pairs. With the advent of large datasets such as dialogues on Reddit and
Twitter, this no longer became necessary. Instead, developers could analyze millions of
conversations and store these conversations in documents as pairs of <statuses> (first speaker) and
<responses>. The principle behind retrieval-based systems is, given an input sentence, to pattern
match this against the set of <status> <response> pairs and select a response.
The main challenge in Information Retrieval (IR) algorithms is determining how to conduct
pattern matching. First and foremost, there is the consideration of, given an input sentence, whether
to pattern match against the <status> or <response>. The former seems more intuitive, and would
involve finding the most similar <status> in the data corpus. The latter, however, is used more
often in practice because it is more effective, and involves comparing the input sentence against
the <responses> in the data corpus. The reason matching against <responses> is more effective is
that if words co-occur between the input sentence and a <response>, it is likely an appropriate
response. By contrast, just because words co-occur in the input sentence and <status>, if it is not
an exact match, we don’t know if the <response> will be appropriate. Likewise, if the system
pattern matches against responses, not inputs, it can simply use search engines to query for a set
of responses that are similar to the input phrase (e.g. search engines retrieve response sets). [37]
Second, regardless of whether we are pattern matching against <status> or <response>, we
must choose a pattern matching algorithm. Here, we could use a basic keyword approach, in which
we look for keywords that co-occur in the input sentence and document, and this is occasionally
used. More complex IR models, however, typically use an ensemble of score functions, calculate
the overall rank of each document as a weighted average of those scores, and return the document
with the highest score. [37]
18
Given a status (S) – the input sentence – and a corpus of documents (D) in the knowledge base,
these algorithms return:
����(�, �) = � �� ∗ ������ (�, �) [37]
�
In this formula, the output of the k-th score function ������ (�, �) is multiplied by the weight
received by that score function �� . The sum across all score functions gives the overall rank. [37]
Yan et. al. use a number of score functions. First, they count the number of common words in
S and D, excluding stop-words (“a”, “the”, etc.), weighted by inverse document frequency (IDF).
IDF values indicate how common a word is in the data corpus. For example, octopus typically
receives a higher IDF value than school, because the term is used less frequently in most corpuses
of text. By weighting by IDF score, we raise the score for documents with rare words that co-occur
with the input sentence. Second, they measure the average cosine distance between the word
embeddings in the input sentence and words in the document, excluding stop words. [37] Word
embeddings are mappings of words to mathematical vectors. [38]
A weighted average of these and other scores gives each document a rank, and the highest
ranked document is returned. Other more complex score functions used by Yan et. al. include
statistical machine learning phrase tables and convolution neural networks. [37]
Use of neural networks to rank response choices is quite common. Wu. et. al.’s chatbot
response selection model, for example, uses neural networks to produce a score for each potential
document. Their model takes as inputs an utterance u and a response candidate r. At the first layer
of the model, we break the utterance and response down into word embeddings.
� = ��,1 , … , ��,�� and � = ��,1 , … , ��,�� [39]
From these work embeddings, we construct a word-word similarity matrix �1 and a sequencesequence similarity matrix �2 . Constructing �1 is relatively simple: we set every element:
⊥
��,� ∈ �1 = ��,�
∗ ��,� . [39]
Constructing �2 is significantly more complex, and requires the use of a recurrent neural
network with gated units (GRU). The GRU models “the sequential relationship and dependency
among words” in a sequence of words from word i=1 up until word i=r (in response) and word i=u
(in utterance). [39] Chung et. al. provides a full description of the GRU’s functionality. [40] Simply
put, the GRU allows the model to track relationships between the input sentence and response
candidate on a segment or phrase level, rather than simply on a word by word level. [39]
These matrices are input channels to a convolution neural network (CNN), which outputs a
vector v that encodes the similarities between u and r based on the data from �1 and �2 . The CNN
learns from training data which phrases in a given input utterance, and subsequently which GRU
outputs, are useful in identifying responses with high similarity. During prediction, these rules will
19
be applied to select specific features of the input channels by convolution and pooling, move these
into the output vector, generate a score for each response, and choose the best one. [39]
Figure 8: RNN used in IR [39]
This approach is too complex to be run on every response candidate. Wu et. al. use a basic
rule-based pattern-matching approach to select the top five response candidates. These responses
are run through the RNN above and re-ranked to select the top response. [39]
3.3.3. Statistical Machine Translation Generative Models
Information retrieval models require a database of possible responses to choose from.
Generative models, by contrast, build responses on the fly. Most do so using machine learning
techniques: classifiers are trained based on real dialogues, and used to generate responses to users
by “translating” inputs into responses. Statistical Machine Translation (SMT) models are some the
most recent, and effective, models used to generate chatbot responses.
Statistical Machine Translation is a field of machine translation which analyzes bilingual
corpuses of text, and uses statistical analysis to derive exact transactions between individual words,
phrases, and features of the texts. The benefit of machine translation over human translation is first
and foremost resource savings. Moreover, SMT has access to many parallel corpora of languages
allowing for analyses to generate translations between multiple languages. SMT generates cost
savings relative to other automated approaches in that rule-based translation systems require the
manual creation of rules. This is costly, and creates rules that typically only work for one language
pair because they do not generalize well for other language translations. [41]
Ritter et. al. proposed in 2011 the first use of SMT in chatbot response-generation. They
proposed the creation of a machine translation system that could translate between an input in one
language, and a response to that input in the same language based on corpora of Twitter dialogue.
Historically chat-bots have been limited to specific domains (e.g. ELIZA is a Rogerian
psychotherapist, and cannot answer questions about cooking). Ritter’s chatbot is restricted to the
Social Media domain: in other words, because social media covers nearly every conceivable topic,
the chatbot has extensive knowledge. The second benefit of Ritter’s proposed chatbot is while
historically chatbot used canned responses and templates (e.g. all retrieval-based systems), a
generative bot model creates utterances as a direct function of specific input words, and therefore
has the potential to generate more natural responses. [42]
20
SMT models applied to conversation generation take the previous user’s comment c as input,
and generate a response r using a log-linear model, based on a number of features. Features include:
P(C|R), the probability of given comment C given the response R, P(R|C), the probability of a
response R given a comment C, the language model P(R), which ensures that R is a reasonable
phrase that could be generated in conversation. The language model is build using n-gram statistics
(we count the appearance of n-grams in our training data, and estimate if the n-grams that appear
in R exist in or are well-predicted by the training data). During training, a translation table is also
created based on the training data (e.g. Twitter dialogue conversations). [42]
In prediction, Ritter et. al.’s model uses the Moses phrase-based decoder to choose the best
response given the input sentence. [43] The decoder goes left to right and translates phrases from
the input sentence, generating a number of possible hypotheses at each step. Without pruning the
number of hypotheses would grow exponentially with the length of the input sentence because all
hypotheses for the translation phrase n+1 are added to the hypotheses for phrase n, and so on. [44]
To prevent this from happening, hypotheses that are deemed similar are combined throughout
this process. Second, pruning occurs. As words are translated, and these hypotheses are added to
previous hypotheses sets, we calculate the cost (i.e. weakness) of each set of hypotheses and
estimate their future cost. Current translation costs are estimated using the language model which
accounts for the probability of these words occurring in the target language (in our case, the
response set). Current costs also include distortion costs which account for the likelihood of
combinations of phrases co-occurring. The highest cost hypotheses are discarded until we reach
the end of the input sentence, and choose the lowest cost option as our selected output. This
prediction process is called beam search. [44]
Figure 10: Example output of SMT Model [42]
Figure 9: Top input-output phrase pairs [42]
Challenge of Lexical Repetition: When SMT is applied to response generation, a problem
occurs: given an input sentence, the model generates the same, or a very similar, sentence as the
response. This occurs because in dialogue, the responder typically uses the same language as the
initiator. To solve this challenge, Ritter et. al. filters out any translations where the translation is a
substring of the initial language and penalizes responses that are similar to input phrases by
calculating the Jacard similarity between the input and response. [42] The Jacard similarity index
is typically calculated as a distance score between the bag of words in the input and response. [45]
21
Challenge of Word Alignment: The challenge in word alignment involves long input
sentences in which there in not a 1-1 correspondence between words/phrases in the input sentence
and words/phrases in the output sentence. [42] Ritter et. al. use Giza++, the refined alignment
model created by Och and Ney, a widely used tool which empirically outperformed most existing
alignment tools, to pair phrases in input sentences and responses. [46] Still, they achieve an
alignment score of 1.7 compared with the standard score of 14 for translation between French and
English phrase pairs, indicating that these sentences may not align well. This problem has not been
sufficiently solved. As a result, Ritter et. al. uses Fisher’s Exact Test, a test of statistical
significance, to remove pairs of input-output phrases that do not align. [42]
Despite these challenges, SMT achieve relatively strong results: in a Mechanical Turk study,
65% of crowd-workers preferred SMT to Information Retrieval where input is compared against
document <status> and 59% preferred SMT to Information Retrieval where input is compared
against document <response>. SMT was preferred over human response 15% of the time. [42]
3.3.4. Sequence to Sequence (Seq2Seq) Model
Cho et. al. in 2014 proposed a variation of Ritter’s generative model that uses advances in deep
learning to achieve higher accuracy. Their Encoder-Decoder model uses two recurrent neural
networks (RNN). The first, encodes the <status>, or input sentence, and the second decodes the
<status> and generates the desired <response>. In doing so, the model is able to translate between
language (e.g. for statistical machine translation) and can also be used to enable chatbots to convert
between document <status> and document <response>. The Sequence to Sequence model
(Seq2Seq) is the current industry best practice for response generation, and is widely used. [46]
Figure 11: Example of Seq2Seq RNN [47]
Recurrent neural networks (RNN) are neural networks with a hidden state ℎ and output �,
which is a function of a sequence �1 … �� . In our case, the sequence is the input string to the
chatbot. At each time step, the RNN’s hidden state is updated by the below function. [46]
ℎ� = � (ℎ�−1, �� ) [46]
F is an activation function such as the sigmoid function or long short-term memory (LSTM)
unit. The goal of the encoder RNN in Cho et. al.’s model is to predict the following function. [46]
���1 … ��� ��1 … ��� ) [46]
22
The lengths of tx and ty (e.g. length of input and response) may differ, and �1 … ��� is the input
sequence. At each step, the RNN’s hidden state is updated until it reaches ��� and carries
information pertaining to the entire string, which we’ll call �. The goal of the decoder RNN is the
reverse: to predict the response string based on the encoded string created by the encoder, the
hidden state created by the encoder, and the final information about the original string �. [46] The
hidden function of the decoder is than a function:
ℎ� = � (ℎ�−1, ��−1, �) [46]
Both RNN’s are trained to maximize the likelihood of probability that the model outputs the
response sequence y, given the input sequence x, as seen in the equation below. [46]
1
max � ∑�
�=1 ��� �� (�� |�� ) [46]
�
� is the model’s parameters. The model is trained on real dialogues with <input> <response>
pairs. The resulting RNN’s could be used for two functions: in SMT, they could be used to generate
a <response> translation of an input, which is how they are used here. Second, they could be used
to score a response in how well it matches an input, which is how Seq2Seq is used in information
retrieval processes. [46]
Cho et. al.’s approach is now the basis for most chatbot SMT response generation, but it is
built on a long history of neural network-based SMT techniques. Schwenk et. al. in 2012 used
feed-forward neural networks with an input capped at 7 words to generate responses. [48] Devlin
et. al. in 2014 advanced the use of feed-forward neural networks, but maintained the necessity of
capping the input length a priori. Seq2Seq is widely used in response generation because it does
not require a set input length, and achieves stronger performance than earlier models. [49]
While the model is effective, however, Seq2Seq has a number of serious drawbacks. First,
Seq2Seq trains the encoder-decoder complex to generate the best response to a single <input> by
maximizing the log-likelihood conditional probability: [46]
1
max � ∑�
�=1 ��� �� (�� |�� ) [46]
�
While this function offers an approximation of a good response, it does not achieve the
chatbot’s true goal: to simulate human-human interaction by “providing interesting, diverse, and
informative feedback that keeps users engaged.” For example, bots will frequently respond to
inputs with the phrase “I don’t know” because of the frequency with which this response is used
in training sets. Second, Seq2Seq often gets stuck in infinite loops, as shown by Li et. al. in
simulations of conversations between two Seq2Seq bots. [50]
3.3.5. Reinforcement Learning with Seq2Seq bots
To solve these problems, we must refine the objective of our chatbots by defining long-term
reward functions that encourage bots to take steps that will lead to good conversations – those that
are “forward-looking…interactive, informative, and coherent.” [51]
23
We can do so using reinforcement learning. Reinforcement Learning (RL) approaches define
dialogue as a Markov Decision Process with a state space S, transitions T, action set A, and reward
function R. Specifically, because of the possibility of errors in ASR and throughout the process,
we consider a Partially Observable MDP or POMDP since we are uncertain of the state space. In
POMDPs, we encode states as b(s), our belief about the state, rather than state itself. This allows
the system to recover from errors by shifting its beliefs between different states throughout the
dialogue. [51]
In this case, the bot’s state space b(s) is based on what previous dialogue the bot has heard and
used in its past engagement. Our action set A is a set of possible responses (e.g. the RNN algorithm
will generate a number of possible responses. Typically, we select the best-fit response, but in this
case, we select the top N responses, and call those the actions which the system can select). The
chatbot’s goal is to define the optimal response strategy, where the number of possible responses
is theoretically infinite (we allow for arbitrary length strings). [50]
The reward function is a weighted average of reward functions: � = �1 �1 + �2 �2 + �3 �3 ,
where �1,2,3 are defined below and �1,2,3 are .25,.25,.5 in Li et. al.’s approach. [50]
Ease of Answering: To facilitate extended dialogue, we would like to reward the bot for
statements that are easy to respond to. Li et. al. creates a forward-looking reward function that
takes this into account by manually creating a list of dull responses (e.g. “I don’t know what you
are talking about”, “I have no idea”, etc.). [50] They define:
1
1
�1 = − � ∑�∈� � log ����2��� (�|�) [50]
�
�
s is the list of dull responses, �� is the number of dull responses, and ����2��� (�|�) is the
probability of encountering a dull response (the idea is that if the probably of achieving these
representative dull responses is low, the chance of generating dull responses overall is low). [50]
Information Flow: The second problem Li et. al. identified was repetition in answers: this
reduces the flow of information between bot and user. To solve this problem, they propose
penalizing the bot for generating answers that are too semantically similar to previous responses:
the reward function is the negative log of the cosine similarity between two consecutive responses
ℎ�� and ℎ��+1 . [50]
�2 = − log cos�ℎ�� , ℎ��+1 � = −���
ℎ�� ∗ ℎ��+1
. [50]
��ℎ�� ��∗|�ℎ��+1 �|
Semantic Coherence: To ensure that the responses are grammatical and coherent, Li et. al.
design a third and final reward function, which is the probability of generating this response given
the previous dialogue �� , �� , which we presume is coherent, plus the reverse probability of
generating the question �� if we can run reverse seq2seq on our proposed response. [50] They
define, where a is the proposed response and �� and ��� are the cardinalities of a and �� :
24
1
1
�3 = � log ����2��� (�| �� , �� ) + � ���������������2��� (�� | �) [50]
�
��
To find the optimal policy, reinforcement learning is run in a simulation between two Seq2Seq
bots. The resulting bots (now running the optimal policy) are compared against Seq2Seq bots
without reinforcement learning, and judged by crowdsourced workers. The results indicate that in
a single-turn, reinforcement learning does not improve performance (40% wins for RL-based, 36%
for non-RL-based bots). However in multi-turn measurements, for which the RL algorithm was
built, RL bots win 72% of the time. [50]
3.4.
Knowledge Base Creation
Chatbots are only as intelligent as the knowledge they have access to. Collecting training data
used to train machine learning classifiers used in generative bot models or building corpuses of
data used by information retrieval bots is critical to achieving human-like interactions.
Advancing the corpora of data used by bots is a complement to algorithm development in
improving bot accuracy. This sections focuses on methods for data collection for chatbots. Initially,
for bots like ALICE developed in the 20th century, rule-based approaches were complemented with
manually created knowledge bases. Later, developers began to use large human-annotated
dialogue data-sets. Recent years have seen a rise in the scraping of millions of online dialogues
and automated creation of knowledge-bases used by chatbots.
3.4.1.
Human-Annotated Corpora
Recognizing the value of already-existing dialogue corpora, Shawar et. al. and others have
developed algorithms to convert human-annotated dialogues into AIML, and other chatbotscompatible formats. Their programs have four basic steps: (i) read the dialogues from the existing
corpora, (ii) remove all overlapping text (e.g. two speakers talk at once) and non-verbal fillers, (iii)
process the text to consider every first line in a dialogue as the <pattern> and the second line as
the <template>, and (iv) turn these dialogues into AIML files, or another chatbots-compatible
formats so that these dialogues can be used to form the basis of knowledge bases. [52]
Shawar et. al. used the Dialogue Diversity Corpus (DDC) as the basis for their information
extraction. The DDC is composed of six corpuses. (i) MICAS is a collection of dialogues recorded
and transcribed from academic speech events at the University of Michigan; it includes long
monologues which are ineffective for bot-training and non-verbal cues (e.g. “LAUGH”). (ii)
CIRCLE was collected by the Center for Interdisciplinary Research on Constructive Learning
Environments and contains dialogues from tutorials on physics, algebra, and geometry; it does not
contain consistent formatting and is difficult to parse. (iii) CSPA is a set of transcripts from
conversations recorded between 1994 and 1998 across a variety of topics; like MICAS, it contains
extensive monologues. (iv) TRAINS is a collection of task-oriented spoken dialogues; it is wellformatted, but contains non-verbal annotations. (v) ICE-Singapore is a collection of Singapore
English; it also contains non-verbal annotation. (vi) Finally, Mishler Book is a transcription of
medical interviews; it is in image-format. [52]
25
The problems with each of these dialogues necessitated the use of hand-crafted filtering and
processing by Shawar et. al. This is generally an issue with human-annotated corpora: most were
not designed for the purpose of use by chatbots, and are not formatted in useful ways. The corpora
discussed are English language-based because English is the predominant language used in NLP
research; however, similar corpora exist in other languages. Conversion processes to AIML have
been implemented or attempted in French, Afrikaans, Spanish, and Arabic. [52]
3.4.2.
Discussion Forums
Huang et. al. generated the first chatbot knowledge forum scraped from online discussion
forums. The benefit of using forums is that they contain different discussion theme pages, giving
bot developers access to dialogue on a wide range of topics. Second, these dialogues may involve
multiple responses to the same question, which the bot can treat as separate <input><response>
pairs. The disadvantage is that discussion forums are a type of asynchronous communications;
users do not respond in real-time to each other’s queries as required by bots, and this may affect
communication style. Second, quality control is a serious issue: replies on forums are edited, short,
and may be filled with errors. Third, there is no guarantee that the <response> is a reply to the
previous <input>, and not a response to an earlier <input>. [53]
To solve for these deficiencies, Huang et. al. define a high-quality direct reply on a thread
(called an RR). They use quality-labeled threads as training data to train a support vector machine
(SVM) classifier that can, after training, be used to predict the quality of postings. [53] SVMs are
appropriate for the task because (i) this is a binary classification problem identifying quality vs.
non-quality RRs and (ii) SVMs are robust to over-fitting. [54]
Huang et. al. use the following features in the SVM model: (i) does reply and quotes root
message, (ii) does reply and quotes other replies, (iii) is reply posted by root poster, (iv) # of replies
between this reply and previous reply by same author, (v) # of words, (vi) # of content words, and
(vii) # of overlapping words between reply and thread-title, among other features. Huang et. al.
use author-related features, extracting information from users’ forum profiles such as # of threads
started by the author, to assess credibility of the reply based on its author. [53]
Figure 12: Model for SVM Classification of RR Replies [53]
After classification, three filters were applied to: (i) remove replies that use words on an
obscenity list, (ii) remove personal replies that use the phrase “my” (these will not apply to the
bot), and (iii) remove forum-specific terms. Based on the SVM classifier and after filtering, 53%
of replies were classified as high-quality RR, and these were <thread title> <response> pairs were
used to train the chatbot in question. [53]
3.4.3.
Email Conversations
Shrestha et. al. use email conversation to extraction question-answer pairs. This approach
benefits from the prevalence of email conversations, but has a number of drawbacks including: (i)
26
long monologues used, (ii) emails are asynchronous unlike chat, (iii) stylistic differences in chat
vs. email communication and (vi) lack of clarity as to the specific question-answer pairs within
longer emails. Only the last of these challenges is prohibitive, and solving this challenge is the
focus of Shrestha et. al.’s research. [55]
Shrestha et. al. start by identifying questions in emails by building a machine learning classifier
trained on sentences from the SWITCHBOARD corpus and annotated with DAMSL tags which
indicate whether sentences are “statement-opinion”, “statement-non-opinion”, “yes-no-question”,
and “rhetorical-question” (the corpus and tags are discussed in the Dialogue Act section). [55]
In prediction, a number of features are used including: (i) part of speech (POS) for the first five
terms, (ii) POS for the last five terms, (iii) length of utterance, and (iv) POS-bigrams in the sentence
from a list of 100 discriminating POS-bi-grams (POS-identification is discussed in the Information
Extraction section). The model used for classification is the Ripper program which takes in output
classes, features, and training data, and outputs a set if-then questions to be used in prediction. The
results led to 72% recall, 96% recall, and �1 score of 0.82. Question extraction from emails achieve
better results than from speech because of written indicators (e.g. question mark). [55]
Given an email, we can now find questions. Step two is, given a reply, to identify the answer.
In training, human subjects label the Q-A pairs. These are used in a classifier, which keeps track
of features, including: number of non-stop words in question and answer segments and (ii) cosine
similarity and Euclidian distance between question and answer segments, among other features. In
testing, the classifier takes in a question and the <reply> email, and is able to identify the answer
based on its knowledge of the relationship between questions and answers, learned during training.
Overall, this leads to ~70% precision and 62% recall. [55]
3.5.
Dialogue Management
Once the chatbot has selected a response, the Dialogue Manager (DM) must choose a number
of communication strategies, use language tricks to make it seem human, and deliver the message.
3.5.1. Communication Strategies
The first step in the DM system design is to choose an interaction strategy, which determines
who is leading the conversation: the user, the system, or both. When user-directed initiative is used,
the user initiates dialogue, and the system returns responses. The problem with this approach is
that the user may provide ill-defined inputs, to which the system does not know how to respond.
When system-directed initiative is used, the system directs the dialogue and user responds to
queries. This significantly reduces the amount of undefined responses from the user, but reduces
the flexibility of the system and restricts the behavior of the user. The mixed-initiative strategy
allows the user and system to lead dialogue in different contexts. [16]
The second step is to choose an error handling and confirmation strategy. With explicit
confirmation, the system repeats back to the user the query it thinks it heard. McTear, Callejas,
and Griol (2016) provide the following example:
27
User: “I want to go to New York.”
System: “You want directions to New York?
This is time consuming, but ensures the correct response. With implicit confirmation, the
system includes what it thinks it heard back to the user as part of the next question.
User: “I want to go to New York.”
System: “What type of transportation do you want to use to get to New York?” [16]
Reinforcement learning, discussed at length in the response generation section, can be used in
a similar way to train bots to find optimal interaction and confirmation strategies. Singh et. al., for
example, use reinforcement learning to find the optimal policy for communication style. They
define the possible actions in their action set as: (i) user directive – open question with nonrestrictive grammar in the ASR, (ii) system directive – directive question with restrictive grammar,
or (iii) mixed initiative – directive question with unrestrictive grammar. Second, Singh et. al. use
reinforcement learning to find the optimal policy for confirmation: explicit confirmation or no
confirmation. The reward functions for these learning tasks were part of the experimental design:
subjects were given scripts, and these scripts corresponded to desired behavior from the system. If
this behavior occurred, the system received a reward of 1. Otherwise, it received a reward of 0.
Singh et. al. found that reinforcement learning enabled the system to improve over time, and
converge on the optimal policies for communication style and confirmation. [57]
Likewise, active learning can be used to improve the bot’s ability to generate effective
responses. As discussed, response selection typically generates a list of ranked responses, the
highest ranked of which is selected. In active learning, the user is shown a number of responses
(e.g. the top five) at each turn, and selects the preferred response. This allows the system to learn
in real-time, and adapt to the specific needs of each user. There are two costs, however. First, this
is time-consuming and distracting for the user, especially if the bot is being used for an important
application (e.g. mHealth application). Second, the average bot user is not an expert in DM design
strategy, and therefore may not actually be able to choose the appropriate response. [58]
3.5.2. Language Tricks
Many of the response generation strategies previously discussed are generated alongside
confidence scores that indicate how likely the bot is that its response is appropriate. For responses
that have low probabilities of being appropriate, bot developers employ a number of “language
tricks” that improve conversation outcomes:
Yu et. al. [56] discusses a number of these strategies used by the bots they develop, including:
A. Switch Topic: Chatbot proposes a new topic, for example talking about sports, the
weather, or a number of other pre-programmed, buffer topics. While this can generate
non-sequiturs, users are more likely to rate these responses as appropriate than the lowconfidence responses otherwise generated.
B. Ask Open Questions: Rather than simply saying “I don’t know” when the bot cannot
understand the input, it says “Sorry, I don’t know. Could you tell me something
28
interesting?” This is effectively the same answer, but gives the user a next step to
continue the conversation.
C. Tell a Joke: This may be a non-sequitur, but continues the conversation.
D. Elicit more information: If the chatbot doesn’t have an adequate response, it may
simply ask the user to provide more information. This is an open-ended question that
continues the conversation and gives the bot another chance at interpreting the new
input it will receive.
3.5.3. Dialogue Design Principles
Chatbots provide these functions through a dialogue interface, and are developed to adhere to
a number of dialogue design principles including: (i) disambiguation – bots clarify user inputs
when they could have multiple meanings (e.g. input: “Was Amadeus nominated for an academy
award?” and response: “Is Amadeus the movie's title?”), (ii) relaxation – ability to drop constraints
to properly handle user inputs (e.g. input: “Are there any flights today” and response: “No. Would
you like me to find one tomorrow?” (iii) confirmation – checking user details when a task could
not be carried out or for import decisions (e.g. response: “you want me to book you the 7am flight
from Philadelphia to San Francisco?”), and (iv) completion – asking for necessary details left out
in a user request (e.g. input: “I would like to book the 7am flight to SF” and response: “Which
airport are you flying out of?”). [59]
3.5.4. Human Imitation Strategies
There a number of different strategies that can be employed to make bots appear more human.
Pereira et. al. [60] suggests a few common approaches:
A. Personality Development: Most chatbots from ELIZA to ALICE and CHARLIE have
names, as well as related personalities in the hopes of making them appear more human.
ELIZA’s DOCTOR script took on the personality of a Rogerian psychologist, while
PARRY took on the personality of a paranoid mental patient (which enabled it to get
away with non-sequitur answers).
B. Direct the Conversation: ELIZA commonly asked the user questions to elicit user
participation and minimize the chance for error. This type of approach is frequently
used in chatbots to make them appear more human and was used by Converse in 1997
to successfully convince a Loebner prize judge that it was a human, for the first five
minutes of conversation.
C. Small Talk: Faced with task-specific situations, humans revert to “neural, non-task
oriented conversation about safe topic, where no specific goals need to be achieved.”
For bots to appear human, developers build in small-talk functionality which builds
rapport with the human, and avoids silence (if the human is not initiating dialogue).
29
D. Human-like Failures: Chatbots often contain flaws, and while too great a propensity
to make mistakes does not bode well, flawless communication may also indicate that
the bot is not human. Including disfluencies in dialogue makes bots appear human.
Hung et. al. introduce the notion of goal-orientation as a mechanism for making bots appear
more human. They argue that to truly mimic humans machines must be able to understand goals,
and propose a framework in which chatbots engage in goal recognition, goal bookkeeping, and
context topology. Using many of the NLP methods previously discussed including keyword
extraction and POS tagging, bots use linguistic analysis to infer the goal of the human with which
they are speaking at each turn of speech, and add these goals to a goal stack. Goals are associated
with specific contexts, and bots use context-based reasoning (CxBR) to direct their responses; as
goals change, contexts change, and bots communication styles may change. [61]
3.5.5. Personality Development
Personality infusion into chatbots, at its most basic level, can be accomplished directly by
manually creating the <pattern><template> pairs, and incorporating personality features across the
<templates>. This was the strategy employed by ALICE (e.g. its DOCTOR script). This makes the
personality hard to change, however, unless an entirely different script is used.
Galvão et. al. propose an alternate model in which a “personality component” exists outside
the knowledge base. Particular personality elements can be selected for each different bot (or
instance of the same bot), or based on the dialogue using natural language processing. Galvão et.
al. propose tying rules to personality elements, so that as personality changes so do the bot’s
selections of responses. Their bot, for example, “only answers personal questions in case it is happy”
and “only searches the Internet for answering users’ doubts in case he likes the user.” [62]
Figure 14: Persona-AIML Model [62]
Figure 13: Example bot response
based on emotions [62]
In practice, emotions are tied to responses in two ways. First, the system adopts a different set
of rules depending on its emotions or mood. This can be modeled as separate knowledge bases
accessed for each emotion, or as <category> <template1> <template2>… <templateN> sets, where
each template corresponds to one of N emotions. Second, each <template> response contains one
or more emotion associated with it, each with a probability, such that depending on where the
conversation goes, the bot will change its emotions. When emotional tags are selected per the
probability distribution, they are passed to a personality module, which can be implemented as a
Bayesian Belief Network (BBN). The personality module will, given a current emotion and a new
emotional-tag change the bots “mood.” [63]
30
In Thalmann’s implementation, changing personality/emotion changes the facial expression of
the avatars of bots. For bots without physical representations, use of punctuation (e.g. exclamation)
and emoticons may be used to express emotion as well. [63] Personality/emotion could also change
the speed or volume of speech in a speech-based chatbot system. [64]
Responses do not have to change radically based on emotions, as in the previous
implementations. Ball et. al. propose using paraphrases for existing speech based on emotional
state, such as saying “if you insist,” rather than “yes” when frustrated. Likewise, they propose
using different paraphrases based on the specific bot (or bot instance’s) personality, such as
choosing between “you should definitely” or “perhaps you might like to.” While the discussion of
emotion-based response has so far been focused on entertainment, chatbots can also incorporate
task orientation to improve the user experience. For example, if speech is rushed, the bot could
shift its personality to terse, and give short replies. [64]
3.6.
Text to Speech
Text-to-Speech (TTS) is the final step of the process, and converts the response generated to
speech, which is returned back to the user. The first step of TTS is text analysis, in which text is
converted into phonemes with pitch and duration. The second step is waveform synthesis, in which
segments of recorded speech corresponding to each phoneme are concatenated to form speech.
3.6.1. Text Analysis
Text Analysis begins with a normalization process in which text is broken up into its
component parts; chunks of text are broken into sentences, sentences into words and punctuations,
and words into their respective phonemes. Sentence tokenization involves searching for typical
sentence break punctuation, such as the exclamation mark, question mark, colon, semi-colon, or
period. The challenge is that these punctuations do not always mark the end of a sentence; to solve
this problem, machine learning classifiers are trained on text to learn the difference between
sentence-ending and non-sentence-ending punctuations. [65]
Normalization also includes the conversion of non-standard words into natural language.
Examples include numbers, abbreviations, and acronyms which can be pronounced in a number
of different ways. Finally, normalization involves the disambiguation of homographs, or words
that are pronounced differently and contain different meaning based on context. For these words
(e.g. “I used by car” vs. “I put my car to use”), part-of-speech (POS) analysis can be run to tag the
part of the speech of the homograph, and in doing so, disambiguate its pronunciation. [65]
The second step in text analysis is pronunciation. This involves access to a pronunciation
lexicon which includes mappings between words/phonemes and their pronunciations, as well as
name lexicons, which map names to their pronunciations. The character-to-phoneme mapping
process was initially rule-based, but has since progressed and now uses advanced algorithms to
generate the most accurate mappings based on context (there is no one-to-one mapping). [65]
Pagel et. al. used decision trees for the process [66] and Sejnowski and Rosenberg trained a
feed-forward neural network. [67] Pronunciation alone would lead speech to sound like a string of
31
words. Prosidy information is needed to add proper “phrasing, prominence, and intonation,” and a
number of machine learning classifiers have been developed and trained to provide proper
mapping between characters and prosidy information. [65]
3.6.2. Waveform Synthesis
Wavelength synthesis is the voicing of the sounds generated in text analysis, and can occur
using a number of different processes. Formant synthesis involves a number of resonators
connected either in series or parallel that pass the output of frequency production one to the next
(typically three is required for understandable sound, and five for high-quality sound production).
Articulatory synthesis theoretically is intended to generate “speech by direct modeling of the
human articulator behavior,” but does not produce high-quality results in practice.
Concatenative synthesis, by contrast, generates speech by connecting a number of pre-recorded
speech units. The most common unit used is the di-phone which starts at the middle of one
phoneme and extends to the middle of a second, capturing both sounds and the transition. The
problem with concatenative synthesis is the prosidy of units may not align. To solve this problem,
unit selection synthesis stores multiple instances of each unit, each with a different prosidy, and
selects the unit to deliver based on the context. Storing multiple instances of each unit can be
space-inefficient. Statistical models (e.g. HMM-synthesis models) exist to solve this problem by
mapping from one instance onto the variant with the correct prosidy. [65]
32
4. Applications & Development
4.1.
IBM Watson Case Study
IBM’s Watson gained popular recognition for its Jeopardy! win. The architecture behind
Watson is a Question-Answer chatbot; in this section we take a deep dive into its workings.
4.1.1. Natural Language Processing
The first step is Question Analysis, in which Watson runs natural language processing
techniques to classify the problem, including shallow parsing, deep parsing, semantic role labeling,
conference relations, and named-entity recognition, most of which are discussed in the section on
natural language processing (NLP). [68] Deep parsing techniques in Watson involve taking any
input sentence and performing tokenization and segmentation, followed by a number of linguistic
analyses (e.g. morpholexical and syntactic analysis). Tokens are converted into one-word phrases,
and then further processed into sequences of multiple word phrases that carry meaning. Through
this process, Watson is able to break down a sentence into its component parts, label each, and
understand the functional and grammatical significance of each for use in response generation. [69]
4.1.2. Response Generation
The second step is Hypothesis Generation. Based on its understanding of the question, Watson
searches its external knowledge bases for an answer, and generates all possible content that could
contain the answer. This “primary search” phase is focused on maximizing recall, and assumes
later analytical work will be completed to glean the correct response. Watson’s goal is to achieve
85% recall for the top 250 candidates (e.g. for 85% of questions, the correct answer will be
contained in one of the top 250 candidate responses retrieved). [68]
Search techniques include: text search engines, knowledge base search, document search, and
passage search and include multiple search queries per question. In document search, Watson
extracts the titles of documents that contain the text of the search query as hypotheses. In passage
search, named-entity recognition techniques are used on passages with the text of the search query,
and these named-entities are generated as hypotheses. Passage search and document search use
existing platforms such as Lucene and Indri to find relevant documents that may contain the answer
to the search query. [68]
The third step is hypothesis and evidence scoring, in which the system collects evidence on
each hypothesis, and applies a ranking algorithm to determine the most likely response. Scoring
functions measure: “the degree of match between a passage’s predicate-argument structure and the
question, passage source reliability, geospatial location, temporal relationships, taxonomic
classification, the lexical and semantic relations the candidate is known to participate in, the
candidate’s correlation with question terms, its popularity (or obscurity), its aliases, and so on.”
For example, Watson uses a passage score that calculates IDF-weighted terms in common between
the input and hypothesis, one that measures the longest common subsequence between input and
hypothesis, and a third that measures the alignment of logical forms in input and hypothesis. [68]
33
Watson likewise is developed to include reasoning capabilities. For example, Watson is
developed to contain temporal reasoning, such that it can detect inconsistencies in dates and
geospatial reasoning such that it can detect inconsistencies in geographies (e.g. Sydney is not an
Asian city). All of the aforementioned scorers are aggregated into dimensions on which each
response is scored, such as “Taxonomic, Geospatial (location), Temporal, Source Reliability,
Gender, Name Consistency, Relational, Passage Support, Theory Consistency, and so on.” [68]
The fourth step is ranking and confidence estimation. Watson uses a machine learning
approach to achieve this step; it takes in the above dimensions as features and trains a classifier on
training data with known correct answers, thus creating weights for each dimension. Given the
variety of dimensions, different classifiers are applicable to each. To solve this problem, Watson
trains separate classifiers to handle sub-sets of features, and then uses an ensemble of these
classifiers to generate an overall ranking. [68]
4.1.3. Knowledge Base
Watson’s knowledge, primarily, is generated through access to a collection of online
documents through two platforms: Indri and Lucene. Accessing these platforms, Watson is able to
pair queries with document sets, and extract information containing the answer either from the
titles of the documents of from the contents of the documents. Typical bots run this same procedure
using the Google search engine, Bing search engine and Wikipedia; Watson did not in this case
because it was run offline as part of the Jeopardy! rules. [68]
Indri is an open source search engine developed out of a collaboration between the University
of Massachusetts and Carnegie Mellon University. Indri was built on top of the Lemur project, a
toolkit built for language modeling and information retrieval. [70] Indri is widely used because it
uniquely allows for “complex phrase matching, synonyms, weighted expressions, Boolean
filtering, numeric (and dated) fields, and the extensive use of document structure (fields).” [71]
Lucene is likewise a text search engine library. The program was built by the Apache Software
Foundation, and is used as an open source data source. [70] Lucene provides ranked searching,
“phrase queries, wildcard queries, proximity queries, range queries, fielded searching, multipleindex searching with merged results, [and a]… configurable storage engine.” [72]
Figure 15: Watson Retrieval Process [68]
34
4.2.
Security Considerations
Chatbots are widely used in a variety of application including mobile health (mHealth). Both
for regulatory and ethical reasons, then, privacy of chatbots (or lack thereof) is a significant design
consideration, and potential architecture vulnerability. [74]
4.2.1.
Security Flaws in Chatbot Platforms
Chatbots are primarily accessed through chat/messenger applications (e.g. Facebook, Skype,
etc.). This is problematic from a security perspective given the vulnerabilities of these systems.
Based on the Electronic Frontier Foundation (EFF) Secure Messaging Scorecard, Facebook
Messenger, for example, is not secure in five of seven tested dimension. First, communication is
not encrypted by default, which means that Facebook, or a government agency, can read these
messages. More importantly, failure to use end-to-end encryption could allow individuals that
share a local wireless network to sniff each other’s credentials, access their accounts, and read
their private conversations. Second, Messenger does not guarantee identity confirmation. Third,
past communications are not hidden, and can be stolen if a malicious attacker steals user credentials.
Fourth, the code is not open to independent review. Fifth, the security design is not properly
documented. These vulnerabilities extend to bots utilizing the Messenger platform. [75]
Chatbots themselves have serious privacy implications as well. Users can order an Uber, buy
plane tickets, and make payments through Messenger bots. Personal data is stored on the
Messenger platform, and often sent, unencrypted, to external tools such as wit.ai, api.ai, and IBM
Watson for NLP analysis. More importantly, bot creators themselves have access to personal
information inputted into applications, including pictures and forms of identification. [75] Speechbased bots such as Amazon Echo collect even more information: whenever the “wake word” (e.g.
Alexa) is used, Echo records the conversation and stores it on the cloud. This again is problematic
as consumers are exposing themselves to the possibility of a privacy breach. [76]
4.2.2.
Malicious Chatbots
While the above privacy violations are un-intended by the bot creators, chatbot users expose
themselves to the possibility of intentionally malicious chatbots, created to exploit their users.
Already, there exist an extensive number of spam-bots that exist to send spam to internet users.
While these bots are detectable, and often shut down, the improving conversation abilities of bots
will make it easier for malicious bots to disguise themselves. These chatbots, able to build trust
with users, may be programmed to exploit that trust, sell products, infect computer with malware,
and steal identity information. [77] Lauinger et. al.’s chatbot HoneyBot, for example, achieved a
click-through rate of 37% sending phishing messaging using a man-in-the-middle attack. [78]
4.3.
Applications
35
4.3.1. Virtual Personal Assistants (VPAs)
Virtual Personal Assistants (VPAs) are among the popular applications of the chatbot
technology. In particular, Apple Siri, Microsoft Cortana, Google Assistant, and Amazon Alexa
have emerged to offer users services over speech and text interfaces.
Apple Siri: Siri is a virtual assistant specifically available on Apple products, and has access
to Apple-applications including Mail, Contacts, Messages, Maps, and Safari. Siri can read users’
email, text contacts, change music playing, make calls, find restaurants, find books, set alarms,
and give directions. Siri’s gender, accent, and language are configurable and changeable. [79]
Microsoft Cortana: Cortana can be used for basic search functionality (e.g. search for the
weather), can access users’ google calendars, read their Outlook emails, give time estimations for
travel, give directions, and integrate with OneNote to show users’ their notes. [80]
Google Assistant: Google Assistant is an extension of the basic “OK Google” functionality
that allows users to conduct search and control their mobile devices through voice commands.
Assistant is programed into Google phones, Android OS and will be integrated into some cars. [81]
Amazon Alexa: Alexa can access the weather, connect to radio and television stations, and
has partnerships with a number of services, including: JustEat, Uber, FitBit, The Telegraph,
Spotify and Nest, among others. These services can be accessed with the Alexa interface. [82]
4.3.2. Consumer Domain-Specific Bots
Transportation Bots: Instalocate provides real-time flight tracking, security wait times,
ability to file for compensation for delayed flights, ability to push notifications to WhatsApp, gate
and terminal information, weather updates, airport directions, baggage information, and an ability
to book cabs or Uber vehicles at the airport. The bot uses rule-based heuristics for responses;
conversation is restricted to system-directed initiatives. [83]
Dating: Foxsy suggests new friends to you based on user-specified settings. Effectively a
Tinder-like applications embedded within the Messenger platform. The bot uses a very basic rulebased heuristic for responses; allows for user-directed conversation, but replies with “Sorry, I
didn’t get it” to most queries. [84]
Mediation: Meditate Bot provides users with information about breathing exercises and body
scans and allows users to schedule daily medications. The bot notifies users when it is time for
their daily medications. [85] Peaceful Habit offers similar functionality in scheduling and timing
meditation routines for its users. [86]
Fitness Bots: GymBot allows users to keep track of the number of exercises they’ve completed
and track their improvement over time. [87] FitCircle generates five-minute workouts for users.
Forksy asks you about what food you’re eating, and scolds you for unhealthy choices. Whole
Foods bot allows you to search for healthy recipes with text or emoji requests. [88]
36
Weather Bots: Poncho is a weather bot with a distinctive personality that allows users to
request the current weather conditions. [89] The Weather Channel offers a more traditional
weather bot, powered by IBM Watson. [90] Other weather bots include: HippoBot, Weatherman,
and Yahoo weather; they provide similar functionality with differing personality settings. [91]
Medical Bots: The symptoms defining medical conditions are often well-defined, and chatbots
have the ability to improve rates of accurate self-diagnosis. One of the main problems with selfdiagnosis is that most individuals do not know what questions to ask in order to diagnose a medical
condition, and therefore are unable to use platforms like WebMB properly. Chatbots, which are
able to ask follow-up questions until they are ready to make a diagnosis, solve that problem. Baidu,
the Chinese search engine, for example, launched a diagnosis app called Melody in 2015, which
responds to medical questions from users. [92]
37
5. References
[1] Jia, J. (2003). The Study of the Application of a Keywords-based Chatbot System on the
Teaching of Foreign Languages. ArXiv preprint cs/0310018. Retrieved from
https://arxiv.org/abs/cs/0310018
[2] Sansonnet, J.-P., Leray, D., & Martin, J.-C. (2006). Architecture of a Framework for Generic
Assisting Conversational Agents. Intelligent Virtual Agents Lecture Notes in Computer Science,
145–156. http://doi.org/10.1007/11821830_12
[3] Turing, A. M. (1950). Computing Machinery And Intelligence. Mind, LIX(236), 433–460.
http://doi.org/10.1093/mind/lix.236.433
[4] Weizenbaum, J. (1966). ELIZA---a computer program for the study of natural language
communication between man and machine. Communications of the ACM, 9(1), 36–45.
http://doi.org/10.1145/365153.365168
[5] Colby, K. M. (1981). Modeling a paranoid mind. Behavioral and Brain Sciences, 4(04), 515.
http://doi.org/10.1017/s0140525x00000030
[6] Wallace, R. S. The Anatomy of A.L.I.C.E. Retrieved from
http://www.alicebot.org/anatomy.html
[7] Weinberger, M. (2017, January 14). Why Amazon's Echo is totally dominating - and what
Google, Microsoft, and Apple have to do to catch up. Retrieved from
http://www.businessinsider.com/amazon-echo-google-home-microsoft-cortana-apple-siri-2017-1
[8] Jurafsky, D. Conversational Agents. Retrieved from
http://web.stanford.edu/class/cs124/lec/chatbot.pdf
[9] A framework for chatbot evaluation. (2017, January 24). Retrieved from
https://isquared.wordpress.com/2017/01/24/a-framework-for-chatbot-evaluation/
[10] Walker, M. A., Litman, D. J., Kamm, C. A., & Abella, A. (1997). Paradise. Proceedings of
the 35th annual meeting on Association for Computational Linguistics -.
http://doi.org/10.3115/976909.979652
[11] Bates, M., & Ayuso, D. (1991). A proposal for incremental dialogue evaluation.
Proceedings of the workshop on Speech and Natural Language - HLT '91.
http://doi.org/10.3115/112405.112472
[12] Hung, V., Elvir, M., Gonzalez, A., & Demara, R. (2009). Towards a method for evaluating
naturalness in conversational dialog systems. 2009 IEEE International Conference on Systems,
Man and Cybernetics. http://doi.org/10.1109/icsmc.2009.5345904
38
[13] Abdul-Kader, S., & Woods, J. (2015). Survey on Chatbot Design Techniques in Speech
Conversation Systems. International Journal of Advanced Computer Science and Applications,
6(7). http://doi.org/10.14569/ijacsa.2015.060712
[14] Nass, C., & Brave, S. (2007). Wired for speech: how voice activates and advances the
human-computer relationship. Cambridge, MA: MIT Press.
[15] Gruhn, R. E., Minker, W., & Nakamura, S. (2013). Statistical Pronunciation Modeling for
Non-Native Speech Processing. Berlin: Springer Berlin.
[16] McTear, M., Callejas, Z., & Griol, D. (2016). The conversational interface: talking to smart
devices. Cham: Springer.
[17] Mohamed, A.-R., & Hinton, G. (2010). Phone recognition using Restricted Boltzmann
Machines. 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.
http://doi.org/10.1109/icassp.2010.5495651
[18] Tur, G., & Mori, R. D. (2011). Spoken language understanding: systems for extracting
semantic information from speech. Hoboken, NJ: Wiley.
[19] Kral, P., & Cerisara, C. (2010). Dialogue Act Recognition Approaches. Computing and
Informatics, 29, 227–250. http://doi.org/10.1109/mlsp.2008.4685529
[20] Draft of DAMSL: Dialog Act Markup in Several Layers*. Retrieved from
http://www.cs.rochester.edu/research/cisd/resources/damsl/RevisedManual/
[21] WS-97 Switchboard DAMSL Coders Manual. Retrieved from
https://web.stanford.edu/~jurafsky/ws97/manual.august1.html
[22] Dhillon, R., Bhagat, S., Carvey, H., & Shriberg, E. (2004, February 9). Meeting Recorder
Project: Dialog Act Labeling Guide. Retrieved from
http://www1.icsi.berkeley.edu/ftp/pub/speech/papers/MRDA-manual.pdf
[23] Carletta, J., Isard, S., Doherty-Sneddon, G., Isard, A., Kowtko, J. C., & Anderson, A. H.
(1997). The Reliability of a Dialogue Structure Coding Scheme . Association for Computational
Linguistics, 23(1), 13–31. Retrieved from http://www.aclweb.org/anthology/J97-1002
[24] Reithinger, N., & Klesen, M. (1997). Dialogue Act Classification Using Language Models.
In Proceedings of EuroSpeech, 2235–2238.
[25] Stolcke, A., Ries, K., Coccaro, N., Shriberg, E., Bates, R., Jurafsky, D., … Meteer, M.
(2000). Dialogue Act Modeling for Automatic Tagging and Recognition of Conversational
Speech. Computational Linguistics, 26(3), 339–373. http://doi.org/10.1162/089120100561737
39
[26] Stolcke, A., & Shriberg, E. (1998). Dialog Act Modeling for Conversational Speech. AAAI
Technical Report SS-98-01, 98–105. Retrieved from
https://pdfs.semanticscholar.org/d7de/4acec556524a8ab2b00aa4414768bc1617e9.pdf
[27] Ries, K. (1999). HMM and neural network based speech act detection. 1999 IEEE
International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99
(Cat. No.99CH36258). http://doi.org/10.1109/icassp.1999.758171
[28] Wright, H. (1998). Automatic Utterance Type Detection Using Suprasegmental Features.
International Conference on Spoken Language Processing. Retrieved from
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.47.5099&rep=rep1&type=pdf
[29] Nielson, M. A. (2015). Chapter 3: Improving the way neural networks learn. In Neural
Networks and Deep Learning. essay, Determination Press.
[30] Andernach, T., Poel, M., & Salomons, E. (1997). Finding Classes of Dialogue Utterances
with Kohonen Networks. Workshop Notes of the ECML/MLnet Workshop on Empirical
Learning of Natural Language Processing Tasks, 85–94. Retrieved from
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.53.5080&rep=rep1&type=pdf
[31] Rotaru, M. (2002). Dialog Act Tagging using Memory-Based Learning. Retrieved from
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.116.7922&rep=rep1&type=pdf
[32] Jeong, M., & Lee, G. (2006). Jointly Predicting Dialog Act And Named Entity For Spoken
Language Understanding. 2006 IEEE Spoken Language Technology Workshop.
http://doi.org/10.1109/slt.2006.326818
[33] He, Y., & Young, S. (2005). Semantic processing using the Hidden Vector State model.
Computer Speech & Language, 19(1), 85–106. http://doi.org/10.1016/j.csl.2004.03.001
[34] Introduction to Support Vector Machines¶. Retrieved from
http://docs.opencv.org/2.4/doc/tutorials/ml/introduction_to_svm/introduction_to_svm.html
[35] Lokman, A. (2010). One-Match and All-Match Categories for Keywords Matching in
Chatbot. American Journal of Applied Sciences, 7(10), 1406–1411.
http://doi.org/10.3844/ajassp.2010.1406.1411
[36] AbuShawar, B., & Atwell, E. ALICE Chatbot: Trials and Outputs. Retrieved from
http://www.scielo.org.mx/scielo.php?script=sci_arttext&pid=S1405-55462015000400625
[37] Yan, Z., Duan, N., Bao, J., Chen, P., Zhou, M., Li, Z., & Zhou, J. (2016). DocChat: An
Information Retrieval Approach for Chatbot Engines Using Unstructured Documents.
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics
(Volume 1: Long Papers). http://doi.org/10.18653/v1/p16-1049
40
[38] Ganguly, D., Roy, D., Mitra, M., & Jones, G. J. (2015). Word Embedding based
Generalized Language Model for Information Retrieval. Proceedings of the 38th International
ACM SIGIR Conference on Research and Development in Information Retrieval - SIGIR '15.
http://doi.org/10.1145/2766462.2767780
[39] Wu, Y., Wu, W., L, Z., & Zhou, M. (2016). Sequential Match Network: A New
Architecture for Multi-turn Response Selection in Retrieval-based Chatbots. ArXiv preprint
612.01627. Retrieved from https://arxiv.org/pdf/1612.01627.pdf
[40] Chung, J., Gulcehre, C., Cho, K. H., & Bengio, Y. (2014). Empirical Evaluation of Gated
Recurrent Neural Networks on Sequence Modeling. ArXiv:1412.3555v1. Retrieved from
https://arxiv.org/pdf/1412.3555.pdf
[41] Statistical Machine Translation. Retrieved from http://www.statmt.org/
[42] Ritter, A. (2011). Data-Driven Response Generation in Social Media. EMNLP '11
Proceedings of the Conference on Empirical Methods in Natural Language Processing, 583–593.
Retrieved from https://www.microsoft.com/en-us/research/wpcontent/uploads/2016/02/mt_chat.pdf
[43] Moses. Retrieved from http://www.statmt.org/moses/?n=Moses.Tutorial
[44] Koehn, P. Phrase-Based Models. Statistical Machine Translation, 127–154.
http://doi.org/10.1017/cbo9780511815829.006
[45] Jaccard Similarity and Shingling. Retrieved from
https://www.cs.utah.edu/~jeffp/teaching/cs5955/L4-Jaccard Shingle.pdf
[46] Cho, K., Merrienboer, B. V., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., &
Bengio, Y. (2014). Learning Phrase Representations using RNN Encoder–Decoder for Statistical
Machine Translation. Proceedings of the 2014 Conference on Empirical Methods in Natural
Language Processing (EMNLP). http://doi.org/10.3115/v1/d14-1179
[47] Ramamoorthy, S. Chatbots with Seq2Seq. Retrieved from
http://suriyadeepan.github.io/2016-06-28-easy-seq2seq/
[48] Schwenk, H. (2012). Continuous Space Translation Models for Phrase-Based Statistical
Machine Translation. Proceedings of COLING 2012: Posters, 1071–1080. Retrieved from
https://aclweb.org/anthology/C/C12/C12-2104.pdf
[49] Devlin, J., Zbib, R., Huang, Z., Lamar, T., Schwartz, R., & Makhoul, J. (2014). Fast and
Robust Neural Network Joint Models for Statistical Machine Translation. Proceedings of the
52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long
Papers). http://doi.org/10.3115/v1/p14-1129
41
[50] Li, J., Monroe, W., Ritter, A., Galley, M., Gao, J., & Jurafsky, D. (2016). Deep
Reinforcement Learning for Dialogue Generation. ArXiv:1606.01541. Retrieved from
https://aclweb.org/anthology/D16-1127
[51] Reiser, V., & Lemon, O. (2011). Reinforcement Learning for Adaptive Systems. Edinburgh:
Springer.
[52] Shawar, B. A., & Atwell, E. (2003). Machine learning from dialogue corpora to generate
chatbots. School of Computing, University of Leeds. Retrieved from
https://pdfs.semanticscholar.org/8bf9/391940801b762de98991c0fb5a426777d104.pdf
[53] Huang, J., Zhou, M., & Yang, D. (2007). Extracting Chatbot Knowledge from Online
Discussion Forums. IJCAI'07 Proceedings of the 20th international joint conference on Artifical
intelligence, 423–428. Retrieved from http://www.ijcai.org/Proceedings/07/Papers/066.pdf
[54] Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing
Surveys, 34(1), 1–47. http://doi.org/10.1145/505282.505283
[55] Shrestha, L., & Mckeown, K. (2004). Detection of question-answer pairs in email
conversations. Proceedings of the 20th international conference on Computational Linguistics COLING '04. http://doi.org/10.3115/1220355.1220483
[56] Yu, Z., Xu, Z., Black, A. W., & Rudnicky, A. (2016). Strategy and Policy Learning for
Non-Task-Oriented Conversational Systems. Proceedings of the 17th Annual Meeting of the
Special Interest Group on Discourse and Dialogue. http://doi.org/10.18653/v1/w16-3649
[57] Singh, S., Keams, M., Litman, D., & Walker, M. (2000). Reinforcement Learning for
Spoken Dialogue Systems. Advances in Neural Information Processing Systems, 956–962.
Retrieved from https://papers.nips.cc/paper/1775-reinforcement-learning-for-spoken-dialoguesystems.pdf
[58] Hiraoka, T., Neubig, G., Yoshino, K., Toda, T., & Nakamura, S. (2016). Active Learning
for Example-Based Dialog Systems. Lecture Notes in Electrical Engineering Dialogues with
Social Robots, 67–78. http://doi.org/10.1007/978-981-10-2585-3_5
[59] Abella, A., Brown, M. K., & Buntschuh, B. (1997). Development principles for dialogbased interfaces. Dialogue Processing in Spoken Language Systems Lecture Notes in Computer
Science, 141–155. http://doi.org/10.1007/3-540-63175-5_43
[60] Pereira, M., & Coheur, L. (2013). Just.Chat - a platform for processing information to be
used in chatbots. Retrieved from
https://fenix.tecnico.ulisboa.pt/downloadFile/395145485809/ExtendedAbstract.pdf
[61] Hung, V., Gonzalez, A., & Demara, R. (2009). Towards a Context-Based Dialog
Management Layer for Expert Systems. 2009 International Conference on Information, Process,
and Knowledge Management. http://doi.org/10.1109/eknow.2009.10
42
[62] Galvão, A. M., Barros, F. A., Neves, A. M. M., & Ramalho, G. L. (2004). Adding
Personality to Chatterbots Using the Persona-AIML Architecture. Advances in Artificial
Intelligence – IBERAMIA 2004 Lecture Notes in Computer Science, 963–973.
http://doi.org/10.1007/978-3-540-30498-2_96
[63] Kshirsagar, S., & Magnenat-Thalmann, N. (2002). Virtual humans personified. Proceedings
of the first international joint conference on Autonomous agents and multiagent systems part 1 AAMAS '02. http://doi.org/10.1145/544818.544826
[64] Ball, G., & Breese, J. (2000). Emotion and personality in a conversational agent. In
Embodied conversational agents (pp. 189–219). essay, Cambridge: MIT Press.
[65] Rashad, M. Z., El-Bakry, H. M., & Isma'il, I. R. (2010). An Overview of Text-To-Speech
Synthesis Techniques. CIT'10 Proceedings of the 4th international conference on
Communications and information technology, 84–89. Retrieved from
https://pdfs.semanticscholar.org/5ee7/c71a362cc01441c27ced052a70ee6e0732dc.pdf
[66] Che, H., Tao, J., & Pan, S. (2012). Letter-to-sound conversion using coupled Hidden
Markov Models for lexicon compression. 2012 International Conference on Speech Database
and Assessments. http://doi.org/10.1109/icsda.2012.6422464
[67] Sejnowski, T. J., & Rosenberg, C. R. (1988). NETtalk: a parallel network that learns to read
aloud. In Neurocomputing: foundations of research (pp. 661–672). essay, Cambridge, MA: MIT
Press.
[68] Ferrucci, D. (2010). Building Watson: An Overview of the DeepQA Project. Association for
the Advancement of Artificial Intelligence, 31(3), 59–79. Retrieved from
http://www.aaai.org/ojs/index.php/aimagazine/article/view/2303/2165\
[69] Mccord, M. C., Murdock, J. W., & Boguraev, B. K. (2012). Deep parsing in Watson. IBM
Journal of Research and Development, 56(3.4). http://doi.org/10.1147/jrd.2012.2185409
[70] Middleton, C., & Baeza-yates, R. A Comparison of Open Source Search Engines. Retrieved
from http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.119.6955&rep=rep1&type=pdf
[71] Indri Query Language Quick Reference. Retrieved from
https://www.lemurproject.org/lemur/IndriQueryLanguage.php
[72] Apache Lucene Core. Retrieved from https://lucene.apache.org/core/
[73] Secure Messaging Scorecard. (2016, September 28). Retrieved from
https://www.eff.org/node/82654
[74] Yuan, M. (2016, April 29). Chatbots made for healthcare. Retrieved from
https://tincture.io/chatbots-made-for-healthcare-fec631bc8462#.5bmzw2edo
43
[75] Why you shouldn’t talk to your chatbot about everything. Retrieved from
http://venturebeat.com/2016/11/17/why-you-shouldnt-talk-to-your-chatbot-about-everything/
[76] Your Security Resource. Retrieved from
http://www.yoursecurityresource.com/expertqa/how-private-is-new-amazon-echo/
[77] Narang, S. (2014, August 4). Are You Talking to a Chatbot? How Spam Chatbots Will Get
Smarter. Retrieved from https://www.linkedin.com/pulse/20140804184616-7077712-are-youtalking-to-a-chatbot-how-spam-chatbots-will-get-smarter
[78] Lauinger, T., Pankakoski, V., Balzarotti, D., & Kirda, E. (2010). Honeybot, Your Man in
the Middle for Automated Social Engineering. LEET'10 Proceedings of the 3rd USENIX
conference on Large-Scale exploits and emergent threats: botnets, spyware, worms, and more,
11–11. Retrieved from https://www.sba-research.org/wp-content/uploads/publications/autosocleet2010.pdf
[79] O'Boyle, B. (2015, October 12). What is Siri? Apple's personal voice assistant explained.
Retrieved from http://www.pocket-lint.com/news/112346-what-is-siri-apple-s-personal-voiceassistant-explained
[80] Ravenscraft, E. (2015, August 3). Everything You Can Ask Cortana to Do in Windows 10.
Retrieved from http://lifehacker.com/everything-you-can-ask-cortana-to-do-in-windows-101721725525
[81] Betters, E. (2017, March 2). What is Google Assistant, how does it work, and which devices
offer it? Retrieved from http://www.pocket-lint.com/news/137722-what-is-google-assistant-howdoes-it-work-and-which-devices-offer-it
[82] O'Boyle, B. (2016, December 26). Amazon Echo: What can Alexa do and what services
are ... Retrieved from http://www.pocket-lint.com/news/138846-amazon-echo-what-can-alexado-and-what-services-are-compatible
[83] Instalocate. Retrieved from https://www.messenger.com/t/instalocate/
[84] Foxsybot. Retrieved from https://www.messenger.com/t/foxsybot
[85] MeditateBot. Retrieved from https://botlist.co/bots/2021-meditatebot
[86] Peaceful Habit. Retrieved from https://botlist.co/bots/1381-peaceful-habit
[87] GymBot. Retrieved from https://gymbot.io/
[88] Top 5 Bots To Get You Fit. (2016, October 22). Retrieved from
http://www.topbots.com/top-5-best-fitness-bots-fitness-apps/
44
[89] Poncho. Retrieved from https://poncho.is/
[90] The Weather Channel Launches Bot For Facebook Messenger, Powered By IBM Watson.
Retrieved from http://www.theweathercompany.com/newsroom/2016/10/25/weather-channellaunches-bot-facebook-messenger-powered-ibm-watson
[91] 12 Weather bots for Facebook Messenger, Slack, Skype and Telegram. Retrieved from
https://chatbottle.co/bots/tagged/weather
[92] Vincent, J. (2016, October 11). Baidu launches medical chatbot to help Chinese doctors
diagnose patients. Retrieved from http://www.theverge.com/2016/10/11/13240434/baidumedical-chatbot-china-melody
[93] How To Build Bots for Messenger. Retrieved from
https://developers.facebook.com/blog/post/2016/04/12/bots-for-messenger/
[94] Bot Shop. Retrieved from https://bots.kik.com/#/
[95] Bot Users. Retrieved from https://api.slack.com/bot-users
[96] What are Skype bots and how do I add them as contacts? Retrieved from
https://support.skype.com/en/faq/FA34646/what-are-skype-bots-and-how-do-i-add-them-ascontacts
[97] How WeChat bots are running amok. (2017, January 18). Retrieved from
https://venturebeat.com/2017/01/18/how-wechat-bots-are-running-amok/
[98] LINE Developers. Retrieved from https://developers.line.me/bot-api/overview
[99] Telegram Bot Platform. (2015, June 24). Retrieved from https://telegram.org/blog/botrevolution
[100] Constine, J., & Perez, S. (2016, September 12). Facebook Messenger now allows payments
in its 30,000 chat bots. Retrieved from https://techcrunch.com/2016/09/12/messenger-botpayments/
[101] Matney, L. (2016, August 03). Kik users have exchanged over 1.8 billion messages with
the platform’s 20,000 chatbots. Retrieved from https://techcrunch.com/2016/08/03/kik-usershave-exchanged-over-1-8-billion-messages-with-the-platforms-20000-chatbots/
[102] Kuligowska, K. (2015). Commercial Chatbot: Performance Evaluation, Usability Metrics
and Quality Standards of Embodied Conversational Agents. Professionals Center for Business
Research, 2(02), 1-16. doi:10.18483/pcbr.22
45
[103] Vugt, H. C., Bailenson, J. N., Hoorn, J. F., & Konijn, E. A. (2010). Effects of facial
similarity on user responses to embodied agents. ACM Transactions on Computer-Human
Interaction, 17(2), 1-27. doi:10.1145/1746259.1746261
46