Keywords

1 Introduction

Category systems play a crucial role in building knowledge bases, and the constant emergence of new entities leads to the need for constructing new categories. In contrast to coarse-grained categories, We focus on fine-grained categories, which have longer and more complex names (such as Minesweepers of the United Kingdom vs. Minesweeper). These categories can reflect human classifications and entity relations, and have irreplaceable value for various tasks, including knowledge acquisition [19] and entity set expansion [34]. However, manual updating of a category system by knowledge editors can be costly. Therefore, there is a clear need for automatic generation of category suggestions.

The fine-grained category generation task aims to produce category suggestions for a set of entities along with their surrounding context. Currently, a large portion of relevant research is based on fine-grained entity typing [5, 7, 21], which assumes that suitable categories already exist. However, this assumption does not always hold true in practical scenarios. The problem of creating new categories for existing category systems has not received sufficient attention.

Inspired by recent Seq2Seq models [13, 23, 32], the task of generating new categories can be viewed as an abstract summarization task [33]. Figure 1 shows an example of fine-grained category generation. However, this process suffers from three challenges: (1) Noise - Category creation should not be solely based on individual entities. Directly feeding raw data into the model will introduce much noisy information about individual entities. (2) Drift - Entities and context are mixed input into the model, which can lead to ineffective utilization of entity features. As a result, the category suggestions may deviate from the entities. (3) Repetition - The model relies on a search algorithm to rank the candidate. Sampling strategies like beam search [24] can result in repetition, where the same category is generated multiple times.

Fig. 1.
figure 1

Given a set of entities along with context (surrounding text), we aim to generate category suggestions. The blue labels indicate the categories of entities in Wikipedia. (Color figure online)

To address the above problem, we design a two-stage category generation framework that includes three new modules:

1) Conceptual Text Mining: This module is aimed at reducing noise by extracting common features, making it easier for the model to capture the essential parts of the input. After extracting the conceptual text, it is fed into the Seq2Seq model to generate candidate categories.

2) Prompt Ranking: To address the drift problem caused by entity mixing, we design a prompt ranking module to associate entities with categories. Inspired by prompt paradigm, we construct patterns and use prompt [4, 15] to rank the categories by perplexity [10]. This helps to ensure that the category suggestions are more closely related to the entities.

3) Ensemble Ranking: This module alleviates repetition problems by integrating multiple models’ parsing abilities. Ensemble ranking [27] combines the results of candidate ranking and prompt ranking, employing a dynamic rule design that preserves diversity while alleviating repetition.

For evaluation, we construct a new dataset from Wikipedia, which consists of 23.8M entities and context, 1.1M categories, and 5.3K fine-grained categories. We compare it with five abstractive summarization methods. Additionally, we conduct ablation experiments to evaluate the contribution of three modules.

Contributions. Our contributions include: 1) proposing a two-stage framework that uses a multi-model ensemble for fine-grained category generation; 2) constructing a new dataset based on Wikipedia and conducting five baseline evaluations and ablation experiments; 3) systematically investigating the performance of the Seq2Seq model for the fine-grained category generation task.

Fig. 2.
figure 2

Our framework for fine-grained category generation which adds three modules: (1) Conceptual Text Mining; (2) Prompt Ranking; (3) Ensemble Ranking. Y is the candidate category, and X is the entity after clustering.

2 Problem Formulation

In this section, we give some necessary definitions and then formulate the problem of fine-grained category generation.

Entity. An entity \(e_{i}\) is a term or phrase that refers to a real-world instance, e.g. entity HMS Peterhead (J59) is a minesweeper built for the Royal Navy during the Second World War. A set of entities of the same category constitutes an entity set, denoted as \(E=\left\{ e_{i} \right\} _{i=1}^{n} \).

Context. The context \(a_{i}\) refers to a piece of text that contains the entity \(e_{i}\). A set of context text constitutes a text set, denoted as \(\mathcal {A} =\left\{ a_{i} \right\} _{i=1}^{n} \).

Category. A category \(c_{i}\) is the set of entities with particular shared characteristics. Fine-grained Categories usually have complex names compared to coarse-grained categories, which contain more features, such as people, items, attributes, and semantic relations.

Fine-Grained Category Generation. Given a set of Entities E and its contexts \(\mathcal {A}\), this task is to generate a ranked list of fine-grained categories suggestions \(\mathcal {C}^{e}\).

3 Method

To generate fine-grained categories, we need to address two crucial problems: 1) How to effectively generate enormous fine-grained candidate categories (for recall)? 2) How to choose high-quality categories with language model probing (for precision)? As shown in Fig. 2, we introduce our two-stage framework:

Category Generation: To generate a large number of candidate categories, we first filter data by extracting conceptual texts among context \(\mathcal {A}\) and then employ the Seq2Seq language model to generate a candidate list \(\mathcal {C}\).

Category Selection: To select high-quality categories from the candidate list. We design discrete templates to get prompt ranking \(\mathcal {C}^{p}\). These generated results are further ensemble ranked to preserve final results \(\mathcal {C}^{e}\).

3.1 Category Generation

The category generation stage can be considered as a type of text summarization process. Recent studies have shown that proper extraction of input texts can greatly enhance the final generation results [17]. Therefore, we use a pre-processing technique called conceptual text mining to filter out texts. Then, the conceptual text is inputted into a Seq2Seq model to generate the category.

Conceptual Text Mining. This pre-processing module aims to reduce noise in the contextual text. The given context set \(\mathcal {A}\) may contain descriptive text that pertains only to specific entities, which can introduce noise when generating categories. Meanwhile, the large amount of texts may exceed the input capacity of existing language models. To address these issues, we propose a method for filtering out low-value text using unsupervised metrics.

We follow the idea of Information Density [1], an evaluation metric sourced from pedagogy, to preserve the text \(\mathcal {T}\) with valuable information. Specifically, we count all the keywords in the context set \(\mathcal {A}\) and filtered out the sentences that only contain high-frequency keywords, which can be formally presented in the following metric:

$$\begin{aligned} {\mathcal {T}=\left( \sum s_{i, j}\mid k_{l} \in s_{i, j} , f_{k_{l},\mathcal {A}}> \varTheta \right) }, \end{aligned}$$
(1)

where \(k_{l}\) is the keyword of \(\mathcal {A}\), \(s_{i, j}\) denotes the j-th sentence in the i-th context. \(k_{l} \in s_{i, j}\) means that \(k_{l}\) is in the sentence \(s_{i, j}\). \(\varTheta \) is a is a predefined threshold for keyword occurrence frequencyFootnote 1 for filtering, and \(f_{k_{l},\mathcal {A}}\) indicates the frequency of keywords \(k_{l}\) in \(\mathcal {A}\). In particular, we conduct this conceptual text mining stage with two steps:

1) Keyword extraction: Given a set of context text \(\mathcal {A}\), we employ a keyword extraction algorithm, such as TF-IDF [25], to extract the keywords of each context. Then we count the number of occurrences for each keyword and filter out the ones that appear less frequently.

2) Sentence search: We search for sentences in a given context text that contain specific keywords. We merge the sentences into a text \(\mathcal {T}\), which is called as conceptual text.

Candidate Generation. In the task of summarization, the dataset typically consists of (document, summary) pairs, while for category generation training, we use a dataset where each example (x, y) is a (conceptual text \(\mathcal {T}\), category) pair. During the Seq2Seq fine-tuning process, the model is trained using the standard cross-entropy loss [30]:

$$\begin{aligned} L_{Data} = - \sum _{t=1}^{T} \log _{}{p(y_{t+1} \mid y_{1:t}, x ),} \end{aligned}$$
(2)

where T is the target sequence length and p is the model’s predicted probability for the correct word.

For the generation of candidate categories, we employ beam search as the search algorithm, which is used to find the most probable textual sequence in the set of candidate sequences. In the algorithm, each candidate sequence has a corresponding probability value, referred to as the “confidence” of the sequence. Specifically, given an input sequence x and a target sequence y, the sequence confidence s is defined as:

$$\begin{aligned} s= log P(y|x) = \sum _{n=1}^{T} log{p(y^{<t>}|x, y^{<1>}, ..., y^{<t-1>}),} \end{aligned}$$
(3)

where T is the target sequence length and \(y^{<t>}\) is the candidate category’s t-th word. In this stage, we record the confidence s of all candidate categories \(\mathcal {C} =\left\{ c_{i} \right\} _{i=1}^{n} \) for use in the final ensemble ranking.

3.2 Category Selection

We consider the category selection stage as a reranking process and aim to improve precision. To achieve this, we propose a lightweight approach that leverages the parsing capabilities of multiple models. First, we create a template to rerank category suggestions based on perplexity using another pre-trained model. Next, we use ensemble ranking to combine candidate and prompt ranking results through a special rule design.

Prompt Ranking. This module is designed to address the drift problem where the entity name included in the context is not effectively utilized. We propose a process that involves constructing discrete patterns and computing perplexity for ranking. This helps to ensure that the suggested categories are more closely related to the entities.

  • Entity Clustering: Entity clustering [18] is a key step in our work. The purpose of clustering is to select the most appropriate entity X.

    In particular, we use K-means [8] algorithm to select the X from entity set E. We randomly initialize K cluster centers, and select the entity closest to the center with the maximum cluster distance as the representative entity X.

  • Template: The key for the prompt paradium is to construct a template that is adapted to the downstream task. Based on the work of [9], we construct five hand-built Hearst templates (as illustrated in Table 1):

    Where X is an entity \(e_{i}\), and Y is a category \({c}_{i}\) in \(\mathcal {C}\). We manually build the template with the same X and different Y. We create templates to enhance the connection between entity and category.

  • Perplexity (ppl): Perplexity is one of the most common metrics for evaluating language models. The large pre-trained model can determine the degree of sentence fluency based on the value of ppl. If we have a tokenized sequence \(\mathcal {X} =\left\{ x_{i} \right\} _{i=1}^{n} \) then the perplexity of \(\mathcal {X}\) is:

    $$\begin{aligned} PPL (\mathcal {X}) =exp\left\{ -\frac{1}{t}\sum _{i}^{t}\log _{}{} p_{\theta }(x_{i}\mid x_{<i} ) \right\} , \end{aligned}$$
    (4)

    where \(log_{}{} p_{\theta }(x_{i}|x_{<i} ) \) is the log-likelihood of the i-th token conditioned on the preceding tokens \(x_{<i}\).

We feed the template into a pre-trained model and calculate the perplexity of category Y, which is then used to re-rank the candidate categories and obtain the prompt ranking, denoted as \(\mathcal {C}_{p} =\left\{ c_{i}^{p} \right\} _{i=1}^{n} \). The smaller the perplexity of the template, the better the result in the prompt ranking.

Table 1. Hearst templates.

Ensemble Ranking. This module is designed to alleviate the problem of repetition. In last stage, we already obtain the candidate generation and prompt ranking results. We aim to combine them to create the ensemble ranking \(\mathcal {C}_{e} =\left\{ c_{i}^{e} \right\} _{i=1}^{n} \), where \({c}_{i}^{e}\) represents a category.

We set a threshold \(\lambda \)Footnote 2 by comparing beam search confidence s to dynamically adjust the contribution of the two ranking groups in the ensemble ranking \(\mathcal {C}^{e}\). If the beam search score s of a category suggestion \(c_{j}\) in the candidate generation results is greater than the threshold, it is directly included in the ensemble ranking. Otherwise, we take the corresponding category \(c_{i}^{p}\) in prompt ranking to obtain the final category \(\mathcal {C}_{e}\) in the ensemble ranking:

$$\begin{aligned} \mathcal {C}^{e} = \sum _{i=1}^{n}(\lceil (s-\lambda ) \rfloor c_{i} + \lceil (\lambda -s) \rfloor c_{i}^{p}), \end{aligned}$$
(5)

where \((s- \lambda )\) varies in the range of (−1,1), \(\lceil \rfloor \) means ROUNDUP function (Fig. 3).

Fig. 3.
figure 3

Ensemble ranking. Ed Saugestad is the name of an individual. The top 2 results of the ensemble ranking come from confidence, while the rest come from perplexity. The blue labels indicate the categories of entities in Wikipedia. (Color figure online)

4 Experimental Evaluation

4.1 Experiment Settings

Datasets. Since there is no publicly available dataset for fine-grained category generation, we collect data from WikipediaFootnote 3 and construct a new dataset called Wiki-category. Wikipedia is a multilingual online encyclopedia and the largest reference work in history. Our dataset is constructed through a three-stage process.

In the first stage, we gather information from Wikipedia on entities, context texts, categories, and their relationships. We normalize this information into three sets: 1) the entity and its category; 2) the entity and its corresponding context; 3) the category and its parent categories. In the second stage, we use the category relationships to construct a hierarchical tree of categories and select specific layers as fine-grained categories. Finally, we filter out all subcategories under the fine-grained categories, retaining only entities.

We collect 23.8M entities and context, and 1.1M categories from Wikipedia and construct a dataset of 5.3K fine-grained categories. The average length of a category is 28 characters, and on average, each category contains 21 entities. In our task, we use 5.3K fine-grained categories as the dataset. The dataset is divided into three parts: training, validation, and testing sets, with a split ratio of 80/10/10, respectively.

Baselines. We compare our model with five typical methods:

  • Pointer-generator [26] proposes an architecture that augments the Seq2Seq attention model. It can copy words from the source text via pointing and uses coverage to discourage repetition.

  • NATS [28] is an abstractive text summarization approach. It uses the pointer-generator network as the base model and equips it with the coverage mechanism and unknown word replacement.

  • Bart [13] uses a standard Seq2Seq translation architecture with a bidirectional encoder (like BERT [11]) and a left-to-right decoder (like GPT [22]). BART is particularly effective when fine-tuned for text generation.

  • T5 [23] is an encoder-decoder model in which each task is converted to text-to-text format, using the same model,function and training process for all NLP tasks.

  • Pegasus [32] is a large Transformer-based encoder-decoder model in which important sentences are removed/masked from an input document and are generated from the remaining sentences, similar to an extractive summary.

Basic Settings. The experiments are conducted on a Linux server equipped with an AMD EPYC 7402 CPU, 256 GB RAM, and an NVIDIA GeForce RTX 3090. The proposed models are implemented using PyTorch 1.10.1 in Python 3.7. We fine-tune the model with five typical methods. The pointer-generator method is based on the configuration of the paperFootnote 4. For NATS, we use their publicly released toolkitFootnote 5. For Seq2Seq models(T5, PEGASUS, Bart), we choose the tokenizer and model from huggingfaceFootnote 6.

Evaluation Metric. We use the rouge and text similarity as our evaluation metric, which is the metric in text generation for evaluating ranked lists.

1) Rouge [14] is a set of metrics for evaluating automated abstracts and machine translations. This metric primarily relies on the concept of recall to evaluate the degree of similarity between the generated summary and the reference summary. We report the scores for ROUGE-1, ROUGE-2 and ROUGE-L. We select the result with the highest rouge score among the top 5 results.

2) Text similarity is used to compare sequence differences. This metric primarily relies on the concept of the longest common subsequence(LCS)Footnote 7.

$$\begin{aligned} ratio = 2*M / T \end{aligned}$$
(6)

M = matches, T = total number of elements in both sequences. To reduce noise, we compare the text are only retained English characters which are converted to lowercase.

4.2 Overall Evaluation

As shown in Table 2, we conduct a comprehensive evaluation of the performance of commonly used Seq2Seq models on fine-grained category generation tasks. To ensure a fair comparison, all models are trained and tested using conceptual text \(\mathcal {T}\). Our method is also optimized and improved based on the Bart-large model, resulting in state-of-the-art performance. Moreover, we notice several interesting observations in our experiments.

For Different Evaluation Metrics. we find that the Rouge evaluation method could not effectively distinguish between similar fine-grained categories, which prompted us to design a text similarity evaluation method. Furthermore, we aim to generate categories that match the titles of corresponding Wikipedia pages, and as such, the text similarity evaluation was designed to better assess our models’ performance in this regard. E.g., T5-large performs well in rouge evaluation but only achieves mediocre results in text similarity evaluation.

For Pre-train Models. We compare the performance of pre-trained models with Seq2Seq methods such as PG and NATS, which performed poorly. Our hypothesis is that these methods rely on pointwise copying of words from context. However, conceptual texts \(\mathcal {T}\) do not contain explicit category information. Therefore, the generated categories are likely derived from model pre-training.

For Different Baselines. We compare our methods with five fine-turned models and a prompt-based zero-shot model and observe Bart achieves the best performance among the baseline models. Our method is based on Bart and has the same Prec@1 and Dist@1, but our approach significantly improves the performance of Prec@5. Furthermore, we observe that larger-scale models with the same structure tend to produce better results.

Table 2. Rouge and text similarity evaluation of different methods. GPT-2\(^{\ddagger }\) indicates the zero-shot result prompt. Results in bold are optimal, and the results underlined are suboptimal.

4.3 Result Analysis

Ablation Study. We conduct ablation experiments to verify the efficacy of conceptual text mining, prompt ranking, and ensemble ranking. As can be seen from Table 3, the use of conceptual text mining successfully extracts relevant information from context, effectively reducing noise.

Regarding the Prompt and ensemble ranking methods, They doesn’t alter the top-1 result, hence yielding the same values for Prec1 and Dist1. For prompt ranking, the rouge evaluation remained basically the same, while text-similarity improved, demonstrating that entity name information can promote category ranking. For ensemble ranking, both rouge and text-similarity are improved. Prompt integrates the relationships between entities and categories, while ensemble ranking combines the parsing capabilities of multiple models.

Table 3. Ablation study.

Implementation Observation. We input the dataset to generate the top10 results and record the beam search scores of categories. We comprehensively consider the performance of Prec@1 and Prec@5. In Fig. 4, there are a series of hyperparameter settings:

1) For the maximum constraint of the generation length, we set the value to 16 or more; 2) No-repeat size is the similarity penalty set. It refers to the size of the n-gram that is considered when generating new text to ensure that no repeated sequence of words is produced. We set the value to 3; 3) For the ensemble ranking scale, we compare the effects of fixed and dynamic scales. The scale of the model and prompt is [4:1 to 1:4]. The Prec@5 is highest when using the dynamic scale setting.

Fig. 4.
figure 4

Hyperparameter settings.

Case Study. This section shows two real cases and compares our approach with Bart for case studies. It can be seen in Table 4.

1) Case 1 demonstrates the effectiveness of the prompt method in reducing semantic drift using entity information. E.g., we input the names of three persons, but Bart preferentially generates categories related to an organization, prompt can select a category related with people based on the given entity information. This approach helps generate categories that are more relevant and aligned with the input entity information.

2) Case 2 illustrates how ensemble ranking can alleviate the problem of repetition in specific situations. Specifically, we observe that the Seq2Seq model tends to generate similar results due to the constraints of the search algorithm. Furthermore, when using the Bart model to generate categories, we find that the term “backup” was frequently included in the generated results. However, by combining the two sets of ranked results using ensemble ranking, we can increase the diversity of the generated categories and reduce the problem of repetition.

Table 4. Case studies between Bart and Ours. The ground truth category is marked in blue, and the entity is marked in red.

5 Related Work

Abstract Text Summarization. Our work is more closely related to the abstract text summarization task persisting salient information of source articles. There are two broad approaches to summarization: extraction and abstraction. Extractive summarization extracts key sentences and keywords from source documents. Abstract summarization allows the generation of new words and phrases to compose summaries.

Point-generator network [26] can copy words from source text via pointing and use coverage to discourage repetition. NATS [28] supplies a toolkit of RNN-based Seq2seq models. BART [13] is an encoder-decoder model and has shown superior performance in abstract summarization. T5 [23] is a unified framework that converts all text-based language problems into a text-to-text format and adapts for text summarization. Pegasus [32] is a pre-trained model tailored for abstractive text summarization. However, these works do not work well in fine-grained category generation tasks where severe noise and drift problems.

Fine-Grained Entity Typing. For category generation, a large amount of previous work has focused on assigning entities to categories, usually called fine-grained entity typing, which aims to classify marked entities from input sequences into specific types in a pre-defined label set. Ultra-fine types [5] introduce a new entity typing task to predict a set of free-form phrases that describe the entity’s role in a given sentence. [20] utilizes the type description available from Wikipedia to build a distributed semantic representation of the types. [35] requires no annotated data and can flexibly identify newly defined types. [6] investigates the application of prompt-learning on fine-grained entity typing in fully supervised, few-shot, and zero-shot scenarios. This task generates pre-defined categories and can not expand new categories.

Prompt-Based Learning. Our work also benefits from prompt-learning, which is based on language models that model the probability of text directly. In this paradigm, a model with a fixed architecture is pre-trained as a language model (LM), predicting the probability of observed textual data. GPT-2 [22] proposes to use unsupervised pre-trained models for supervised tasks. The seminal work that stimulates the development of prompt-learning is the birth of GPT-3 [3], which uses hand-crafted prompts for tuning and achieves very impressive performance on various tasks. Pattern design can be divided into discrete [29, 31] and continuous [12]. Inspired by Hearst patterns [9], we construct different templates and show stable performance. There are related works [2, 16] about prompt, which are very similar to the standard pre-training and fine-tuning paradigm, but adding prompt can provide additional guidance at the beginning of model training. In our work, we mainly use prompt-learning to model the probability of templates and rerank candidate categories.

6 Conclusion and Future Work

The category systems present in large-scale knowledge bases are unique and valuable resources that can be leveraged to accomplish a wide range of information access tasks. However, these systems are currently created and maintained manually by editors. To address this issue, this paper proposes an approach that utilizes the Seq2Seq model to generate categories based on a set of entities and their context as input. However, this task faces several challenges, including noise, drift, and repetition.

To overcome these challenges, we propose a novel approach that utilizes a two-stage category generation framework. We first utilize the Seq2Seq model, which produces a large number of fine-grained categories. Next, we apply language model probing to select high-quality categories. To reduce the noise in the original text, we use the conceptual text mining approach in the first stage to extract common text from context. The second stage includes two modules, prompt ranking and ensemble ranking. Prompt ranking is based on a manually constructed template, which includes the names of categories and entities. We utilize perplexity to re-rank categories, which helps to improve the connection between entities and categories and reduce the drift phenomenon. Ensemble ranking is designed to alleviate repetition issues. In this module, we use dynamic rule algorithms combined with candidate categories and prompt ranking results to improve the diversity of generated categories.

To evaluate the effectiveness of our proposed framework, we develop a dataset dedicated to fine-grained category generation based on Wikipedia. We systematically investigate the performance of the Seq2Seq model for the fine-grained category generation task. Our experimental results, based on five baseline evaluations, demonstrate the efficacy of our approach, while the ablation study validates the effectiveness of the three modules. Moving forward, we plan to explore the construction of continuous patterns for category generation and apply fine-grained categories to downstream tasks.