Keywords

1 Introduction

In recent years, dialog systems have become ubiquitous in various aspects of our lives and have gained increasing attention [1]. Dialog systems can be broadly classified as open-domain and task-oriented. Open-domain systems can converse on various topics without domain restrictions, while task-oriented systems assist users in completing specific tasks in vertical fields to reduce manual workload. Leading travel industry companies such as Expedia.com, KLM, and Booking.com are racing to introduce their own online chatbots. Researchers and industry experts agree that a strong and effective task-oriented dialog system can greatly improve the user experience.

Although dialog systems in the travel domain offer significant business potential, building a satisfactory dialog system that can fully serve the domain remains an arduous task. The tourism dialog system faces the following challenges.

  1. 1.

    Multimodal and fine-grained understanding for tourism. Parsing user’s input and obtaining semantic information are essential requirements for any dialog system. However, in the field of tourism, most researches has overlooked the utilization of visual information from images. Using visual input of attractions can enhance a dialog system’s ability to identify the intent of a person who lacks sufficient knowledge to describe their desired attractions in words. This requires achieving a multimodal understanding that extends beyond natural language understanding. Furthermore, traditional natural language comprehension parses a user’s natural language input into predefined semantic slots, which may vary for specific domains. In tourism, users may pay particular attention to specific aspects such as price, distance, attraction type, etc. Mining domain terms and building fine-grained semantic slots automatically is another challenge. Both multimodal information and fine-grained slots can enhance the system.

  2. 2.

    Travel route recommendation. The recommendation task in the tourism domain differs from other domains. Unlike recommending a single item, users seek a satisfactory travel route consisting of multiple target attractions through multiple rounds of dialog interaction. This requires the dialog system to fully consider the explicit and implicit constraints involved in the user’s dialog process and recommend suitable attractions. The dialog system needs to have sufficient background knowledge to assist with these tasks.

To address the above problems, we propose using the Multimodal Chinese Tourism Knowledge Graph (MCTKG) as an external knowledge supplement to construct a multimodal dialog system for the tourism domain. In general, MCTKG extracts a large number of attraction entities, attributes, and images from various heterogeneous data sources, such as visitbeijing.com.cn, Baidu Encyclopedia, and tripadvisor.com. The main contributions of our work are as follows.

First, we improve the conventional understanding module in three ways, providing a comprehensive understanding of multimodal and domain-specific inputs. (1) To facilitate visual comprehension, we introduce pre-trained visual models and devise user’s intent related to images. (2) Through the analysis of attribute values of entities extracted from MCTKG, we construct a fine-grained slot ontology in the tourism domain to acquire more comprehensive semantic information. (3) To recognize explicit or implicit user requirements, we add the ability to identify domain-related constraints to the dialog system.

Second, we leverage the multimodal information and the knowledge graph to enhance the effect of travel route recommendation. We use an automatic route generation algorithm to generates initial travel routes from user’s utterances, which are further refined by visual information. The recommendation module utilizes the user’s historical dialog information to identify the most similar candidate entities in the knowledge graph, which are then filtered using explicit and implicit conditional constraints. Users can also modify the initial route based on recommended results. During the dialog, the system provides information assistance to the user through MCTKG.

2 Related Work

2.1 Task-Oriented Dialog Systems

Task-oriented dialog system design has always been a critical area of concern for the research community and industry. It has achieved apparent success in enterprise commerce and education, and significantly reduced the manual workload. Most traditional task-oriented dialog systems are based on text and are constructed in a modular form [14, 25], comprising the following components: (1) natural language understanding, usually joint modeling to complete intent extraction and slot filling tasks [13, 16]; (2) dialog state tracking, used to track the user’s goals and constraints in the dialog process, and determine the value of the predefined slot; (3) policy network, the policy network component receives the information from the dialog state tracking module, and decides what action to take next; (4) a Natural Language Generation (NLG) model, used to convert the action taken by the strategy network into natural language and output, can be achieved technically through predefined sentence templates [6] or generation-based methods [4]. In addition, with the development of deep learning, some end-to-end task-oriented dialog systems have emerged [18, 23].

Recently, some dialog systems have introduced external knowledge bases to better answer user’s questions and significantly improve the system’s performance [9, 27]. In regard to complex tasks with multimodal information, these text-based task-oriented systems are greatly restricted and cannot effectively meet user needs.

The demand for multimodal travel dialog systems is increasing due to the rich visual semantics in attraction images. However, limited research exists due to the lack of large-scale multimodal dialog datasets. Saha et al. [17] constructed a multimodal dialog (MMD) dataset. Later, Liao et al. [11] proposed a knowledge-aware multimodal dialog (KMD) model. Nevertheless, a need of pertinent large-scale multimodal dialog datasets still exists in tourism field. We adopt a modular design method to build the travel dialog system.

Fig. 1.
figure 1

Framework of multimodal travel dialog system based on modular design.

2.2 Conversational Recommendation

The goal of the recommendation system is to select a subset that meets user’s needs from the entire collection of items. In particular, the knowledge graph is widely used to improve recommendation performance and interpretability [8, 21]. A conversational recommendation system provides a more natural method of interaction for recommendation services. Instinctively, integrating recommendation technology into the dialog system will significantly enhance user’s experience, especially in the tourism industry. For dialog systems, the introduction of recommendation technology can better meet user’s needs, such as recommending suitable tourist attractions. For recommendation systems, focus on historical dialog information instead of historical interactive data, which can effectively solve the cold start problem [28].

In our work, we make full use of text and image data, and introduce knowledge graphs as external knowledge supplements. First, we use an automatic route generation algorithm to generate the initial travel route. Then the recommendation module and the information assistant module help users to choose their favorite attractions to complete route planning tasks.

3 Multimodal Travel Dialog System

As shown in Fig. 1, our framework has four modules. The multimodal understanding module recognizes user’s intent and fills semantic slots. The dialog management module stores historical conversation information and decides which actions to take. The route planning module is responsible for helping users solve some travel-related tasks. The response generation module generates text or image responses. MCTKG serves as an external knowledge base to improve the performance of the entire dialog system.

3.1 Multimodal Chinese Tourism Knowledge Graph

Multimodal Chinese Tourism Knowledge Graph obtains and integrates heterogeneous Beijing tourist attractions data from major Chinese tourism websites, including visitbeijing.com.cn, Baidu Encyclopedia, tripadvisor.com, meituan.com, etc. MCTKG [24] construct an ontology classification and organizes concepts in a hierarchical structure. MCTKG contains nine top concepts. At the same time, each top concept is further divided into a set of low-level concepts. For example, Attraction can be further divided into natural landscape, park, religiousplaces, etc. Table 1 shows the number of instances, image links, properties, and triples of each top concept.

Table 1. Statistics of each concept

3.2 Multimodal Understanding

The correct understanding of user’s intention is an integral part of the dialog system. As a multimodal dialog system, the user’s input may contain text, images or both. Given a multimodal utterance u, the multimodal utterance understanding module should map u to the user’s intention and the corresponding slot value. This process is followed by generating the corresponding representation, denoted as U, for the multimodal utterance:

$$\begin{aligned} U = {<}I,A,C,V{>} \end{aligned}$$
(1)

where I denotes the user’s intention of the utterance (Table 2 shows some intentions), A denotes the attributes or slot values extracted from u, C denotes constraints contained in u, and V indicates the visual representation of the images in u.

Slot Ontology Construction. Developing structured ontologies for language understanding is challenging in the cold start phase because domain experts must define the slots and possible values. We have extended traditional slot ontology to make the dialog system more applicable to tourism field.

First, we extract the different attribute names {\(a_1\),\(\cdot \cdot \cdot \), \(a_n\)} of all entities in MCTKG, and the number of attributes is 8,883. Our experience has shown that attributes sharing a common suffix typically have similar properties. We merged the attributes with the same suffix into the same set, treating the suffix as the central word of all the attributes in the set. Finally we obtained 4,490 central words, and recorded the number of attributes in the corresponding set.

Second, we analyzed central words with set length less than k to determine important tourism attributes. We found that most central words have set length less than 10, so k was set to 15, resulting in identification of 240 central words. After manual review, we merged identical attributes and obtained 548 attributes to expand slot ontology, including Opening hours, Floor space, and uncommon words such as Snack provided, Premium Steak for attractions.

Table 2. Intention types and examples

Intent Classification and Slot Filling. Intent classification is a classification problem and slot filling is a sequence labeling task. Recent studies have highlighted the effectiveness of joint learning methods in addressing both tasks simultaneously [5]. We adopt a joint intent classification and slot filling model based on Bidirectional Encoder Representations from Transformers (BERT) as in [2]. Additionally, we employed tourism-specific large-scale text data to fine-tune a BERT model that is optimized for the tourism domain. During the fine-tuning process, for intent classification, we use the output corresponding to the [CLS] and the Softmax classifier.

$$\begin{aligned} y^i = softmax(W^ih_1 + b^i) \end{aligned}$$
(2)

where \(h_1\) is the output corresponding to the special identifier [CLS], \(W^i\) and \(b^i\) represent the weight coefficient and bias in the feedforward neural network respectively. For slot filling, the other hidden states {\(h_2\),\(\cdot \cdot \cdot \), \(h_N\)} are used to feed into the feedforward neural network, and passed through the softmax classifier to obtain the probability distribution of each token on the slot label:

$$\begin{aligned} y^s_n = softmax(W^sh_n + b^s), n \in 2,\cdot \cdot \cdot ,N \end{aligned}$$
(3)

To model the two tasks of intent classification and slot filling together, the training objectives of the model can be formalized as follows:

$$\begin{aligned} p(y^i,y^s|x) = p(y^i|x)\prod _{n=2}^Np(y_n^s|x) \end{aligned}$$
(4)

The training goal of the model is to maximize the conditional probability \(p(y_n^s|x)\) in the above formula, and the model performs end-to-end fine-tuning by minimizing the cross-entropy loss function.

Constraints Acquisition. Constraints play a crucial role as they allow the system to quickly filter out a large number of potential recommendations. Our system presently incorporates location, price, attraction level, attraction type, distance, time, and travel style as constraints. We employed two methods to get user constraints from user’s input: option boxes for fixed categories, and regular expressions or template matching for numeric constraints like price or time. The system ultimately filters out unqualified attractions based on the given constraints.

Visual Extraction. Since every entity in MCTKG is accompanied by a precise multilevel classification and high-quality images, the user’s image inputs can be effectively utilized to perform location recognition and image-based attraction recommendation tasks. First, to improve the calculation speed, we use ResNet [7] in advance to extract the visual features \(V_{all}\) = {\(v_1\),\(\cdot \cdot \cdot \), \(v_n\)} of the existing pictures in the graph and save them offline, where n is the number of images. When the user’s input \(u_i\) contains an image, we use ResNet to extract its visual feature \(V_i\), and use the cosine similarity calculation method to select the 5 images closest to \(V_i\). If \(I_i\) = Location Recognition, the system selects the image with the highest score and its corresponding entities as the response. If \(I_i\) = Image Recommend, system selects the 5 images and their corresponding entities, entity categories as the response. Users can freely choose the type of attractions they are interested in for further inquiry. Moreover, visual features are also used later in the recommendation module.

3.3 Travel Route Recommendation

The route planning module consists of three parts: initial route generation, attraction recommendation and information assistance. The initial route generation part obtains the user’s initial needs and uses the automatic route generation algorithm to plan the initial route for the user.

The attraction recommendation module leverages contextual dialogue information to suggest relevant attractions to users. We also use a multimodal approach combining textual and image information to improve recommendation effectiveness.

The information assistant incorporates a knowledge graph to provide users with a comprehensive understanding of attractions and employs diverse query methods to offer informed responses. Users can modify their initial travel route based on recommendations and relevant information provided by the assistant.

Initial Route Generation. To fulfill user’s requirements and generate customized travel route, we have predefined four initial needs slots, which include travel style, starting location, expected duration of stay, and attraction preferences. Our system employs active questioning techniques to gather the necessary values for each slot. Then the automatic route generation algorithm is utilized to create the initial travel route based on the user’s input. The pseudocode of the algorithm is shown in Algorithm 1.

Algorithm 1
figure a

. Automatic generation algorithm of initial routes

Attractions Recommendation. The traditional search method of constructing a query for recommendation is limited by natural language expression. To make the interaction between users and the dialog system more natural and solve the above problems, we introduce conversational recommendation techniques [28] to indicate the effect of recommendation. Given an attraction entity \(e_n\), \(e_n^{text}\) represents its valuable text content, and \(e_n^{img}\) represents its corresponding picture. We use BERT and ResNet to extract its text feature \(t_n\) and image feature \(v_n\) respectively and introduce an MFB layer to obtain their fusion representation \(f_n\), whose effectiveness in combining multimodal features has been proven in many visual tasks [26].

$$\begin{aligned} t_n = BERT(e_n^{text}) \end{aligned}$$
(5)
$$\begin{aligned} v_n = ResNet(e_n^{img}) \end{aligned}$$
(6)
$$\begin{aligned} f_n = SumPooling(M_1^Tt_n \circ M_2^Tv_n, k) \end{aligned}$$
(7)

where \(M_1^T\) and \(M_2^T\) are two transform matrices used to map \(t_n\) and \(v_n\) to the common high-dimensional space, \(\circ \) denotes the element-wise product. For the user’s input \(U = {<}I,A,C,V{>}\), when the system detects that the user’s intention is to recommend, it uses the included A, C, and V to make a recommendation. A represents the user’s text content, V represents the user’s image content. We use the same method to fuse the two content to obtain the user’s fusion representation \(f_{user}\). Then, \(f_{user}\) and \(f_n\) are used to calculate a matching score for ranking. C represents the constraints that filter recommended candidates.

Information Assistant. Recognizing that users may require additional information about specific attractions before modifying their route, we have developed an information assistant to cater to user’s needs. This assistant incorporates three information retrieval methods, including the following:

  1. (1)

    Knowledge graph query based on SPARQL language. The attraction name and attribute name extracted from the user’s input are mapped to the corresponding SPARQL sentence, and searched in MCTKG;

  2. (2)

    Information query based on open information extraction results. We use open relation extraction techniques based on dependency syntactic analysis to generate triples from texts of attractions, such as the summary of attractions, referring to the work of Wen [22]. We save all triples and the corresponding text, and when the answer is not retrieved in the knowledge graph, we search for the relevant answer in these triples and return the original text to the user together;

  3. (3)

    Information retrieval (IR) based on BERT. A large number of experiments have proven that BERT can also perform well on information retrieval tasks [3]. We convert the input of BERT into the prompt of ([CLS] Query [SEP] Paragraph), where Query represents the user’s question and Paragraph represents the paragraph of the attractions. The model calculates the correlation between two paragraphs and selects the one with the highest correlation as the answer. To improve efficiency and reduce computation load, we use the term frequency-inverse document frequency (TF-IDF) algorithm to retrieve 20 most relevant paragraphs and then use the BERT model to determine the correct answer from those paragraphs.

The three methods are executed sequentially, and if one method retrieves an answer, the subsequent methods are not called.

3.4 Dialog Management

Given the absence of large-scale dialog datasets for specific tasks in the tourism field, supervised methods are not feasible to train dialog management models. As an alternative, we designed a set of rules based on different dialog scenarios to ensure a smooth dialog process and enhance the dialog system’s performance.

Dialog State Tracking. \(S_t\) represents the dialog state at time t, which is jointly determined by the state \(S_{t-1}\) at the previous time and the current sentence representation \(U_t\).

$$\begin{aligned} S_t = \varUpsilon (S_{t-1}, {<}I_t, A_t, C_t, V_t{>}) \end{aligned}$$
(8)

where \(\varUpsilon \) represents our predefined dialog rules, summarized as follows:

  1. (1)

    if \(I_t = Information\) Service, update \(S_t\) based on \(A_t\) and \(V_t\);

  2. (2)

    if \(I_t = Location\) Recognition, update \(S_t\) based on \(V_t\);

  3. (3)

    if \(I_t = Chitchat\), \(S_t = S_{t-1}\);

  4. (4)

    if \(I_t = Add\) Attraction, update \(S_t\) by \(A_t\) and \(C_t\);

  5. (5)

    if \(I_t = Negation\), \(S_t\) will inherit information stored in \(S_{t-1}\) while updating the parts according to \(U_t\);

  6. (6)

    if user has not responded for more than timelimit minutes, the system will clean \(S_t\) and timelimit is a predefined constant.

Action Decision. Similar to dialog state tracking, we also use a series of rules to convert the obtained dialog state into the corresponding action.

  1. (1)

    Question Answering. It will be triggered when \(I_t\) is Information Service and the name of attractions is detected. The system will call the Attractions Information Assistant to return the query result.

  2. (2)

    Attractions Recommendation. It will be triggered when \(I_t\) is Recomm endation and \(A_t\), \(C_t\), \(V_t\) are not empty. The system will call Attractions Recommendation to return the results.

  3. (3)

    Route Planning. It will be triggered when \(I_t\) is Route Planning and there are several situations: (a) If the initial route has not been generated, the system will collect information by actively asking questions; (b) If there is an initial route and the user intends to delete or add attractions, the distance and time judgment algorithm will be invoked to determine whether the modified route meets the requirements. For example, if the user exceeds the constraints set for total tour time or travel distance, the system will prompt to adjust constraints or choose different attractions.

  4. (4)

    Image Reply. It will be triggered when \(I_t\) is Location Recognition or Image Recommend. The system will return the result of location recognition and the corresponding picture.

  5. (5)

    Chitchat. It will be triggered when \(I_t\) is Chitchat. if the system cannot respond to questions that are not related to the task, the conversation may not continue. Therefore, it will reply to the chitchat according to the pre-defined conversation template.

  6. (6)

    Route Navigation. It will be triggered when \(I_t\) is Affirmation about the final route, and the Baidu Map API will be called to generate navigation, including walking, cycling, public transportation, etc.

3.5 Response Generation

Response generation is the final step of the entire dialogue process. Our system employs two methods: text reply and picture reply. The text method is further divided into IR-based response generation and template-based response generation.

Text Response Generation. Most of the responses are generated by the template-based method, as it is easier to control. For example, if the following detailed action is required (\(TravelStyle=S\)) in initial routes generation, The system may respond with the question “What type of attractions do you want to visit?”. When the information retrieval module is invoked, we directly retrieve the paragraph that is most relevant to the user’s query and use it as a response. For the question “颐和园和哪些园林并称为中国四大名园?” (Which other gardens are commonly known as the Four Famous Gardens of China alongside the Summer Palace?), The system’s response may be “颐和园、承德避暑山庄、拙政园、留园并称为中国四大名园.” (The Summer Palace, Chengde Summer Resort, Humble Administration Garden and Liuyuan Garden are known as the four famous gardens in China.).

Image Response Generation. Image response generation is used for location recognition and recommendation. When the user inputs a picture, we typically search for similar pictures in the knowledge graph and return them, or we return the corresponding pictures based on the generated list of recommended attractions.

4 Experiments

4.1 Evaluation of Language Understanding

To have a sufficient amount of data for training our BERT model, we initially use a templated method to create different questions for each intent. Placeholders are used to replace specific positions in the questions. When generating questions, the entities, attributes, or concepts in the MCTKG are used to replace the placeholders. Then, we merge our data with CrossWoz [29] data. CrossWoz is a large-scale Chinese dialog dataset. We divide the dataset into a training set, validation set and test set at ratios of 0.6, 0.2 and 0.2. The evaluation metric of intent classification is accuracy, while the metric for slot filling is F1 score. We employ Adam as our optimizer with a learning rate of \(5\textrm{e}{-5}\) and gradually reduce it to maintain the stability of training. The number of training epochs is 50, and the batch size is 64.

We compare some baselines: Boosting [20], Boosting+Simplified sentences [19], RNN-EM [15], Encoder-labeler Deep LSTM [10], Attn.BiRNN [12], Slot-Gated [5]. As seen in Table 3, our joint model generally outperforms the independent training model. Compared to standard RNN models, the BERT model has stronger semantic capture and generalization ability due to its pre-training on large-scale corpuses. The joint BERT model significantly outperforms the baseline joint models on both tasks, demonstrating the model’s strong capabilities. Additionally, using a joint model for both intent classification and slot filling simplifies the dialog system as only one model needs to be trained and deployed.

Table 3. NLU performance
Table 4. Recommendation performance

4.2 Evaluation of Attraction Recommendations

To evaluate the efficacy of our recommendation approach, we retrieved popular tourist attractions and their recommended attractions from professional travel websites and applied our method to generate recommended results for these attractions, which were then compared with the actual recommended attractions. We use hit radio@5 and hit radio@10 for evaluation. As shown in Table 4, The results of text recommendation and image recommendation are similar in the field of tourism as attractions can be related based on their background stories or appearances. Combining both methods can improve recommendation performance. Furthermore, constraints collected in the dialog can help users filter candidate results, improving the recommendation performance of the dialog system.

4.3 Evaluation of the Information Assistant

In tourism, providing users with adequate knowledge information is crucial for informed decision-making about attractions. The Information Assistant uses three methods to provide knowledgeable answers. We conducted ablation experiments to analyze the effectiveness of these methods for knowledge questioning. We generated 100 questions from users, used the Information Assistant to provide responses, and evaluated the accuracy of the answers using user feedback.

Table 5 presents the results of Information Assistant for knowledge question answering. We found that only using SPARQL for knowledge graph query has the worst effect, and the combination of the three methods has the best effect, which is precisely in line with our expected results. Open information extraction and information retrieval techniques make full use of the large amount of unstructured text available in the knowledge graph to generate more answers for users. Compared with the two of them, information retrieval has a more noticeable impact as most of the relevant triples are already integrated into the knowledge graph.

Table 5. Ablation study on knowledge question answering
Table 6. Human evaluation results

4.4 Evaluation of the System

We developed a demo of a dialog system for human evaluation. Five participants were recruited to evaluate the system independently, in which the system performed route planning tasks and provided knowledge questions and answers. Participants were asked to rate the conversation’s fluency, informativeness, and planning satisfaction. To reduce personal bias, we administered 50 sets of conversation tests to each participant based on different travel goals. They were asked to rate each quality on a Likert scale ranging from 1 (low quality) to 5 (high quality).

Table 6 presents the result of human evaluation for the dialog system. We found that when compared to other aspects, planning satisfaction with the system’s recommendations exhibits a more favorable performance. That’s probably because the system’s dialog contains a significant amount of informative content from various resources, such as textual information, image data, and external APIs. We utilize these information to construct a more comprehensive multimodal travel dialog system.

5 Discussion and Conclusions

In this work, our objective was to construct an intelligent dialog system to address diverse user requirements in the field of tourism. To achieve this, we adopted a modular approach and integrated MCTKG as a knowledge supplement in the system. Facts have proven that the modular construction approach addresses the scarcity of multimodal dialog datasets in the tourism domain. Furthermore, the dialog process can be effectively controlled through a set of rules. Additionally, incorporating image information and recommendation technology has substantially enhanced the dialog system’s performance. Our framework can be used as a template to build a new dialog system in other domains using a modular design approach.

In our future work, we plan to construct a large-scale multimodal dialog dataset in the tourism field and develop an end-to-end multimodal dialog system. The lack of large-scale domain-specific conversation datasets make it difficult for deep learning models to improve the dialog management and response generation modules. Additionally, enhancing the interpretability of recommended results is an important task that we aim to focus on.