Keywords

1 Introduction

The integration of new technologies, such as artificial intelligence, cloud computing, and big data analysis, in the era of big data has led to increasingly closer connections between humans and various objects in society. Real-world networks are similar to interwoven webs, where data serves as a medium of communication and networks function as a means of transmission through various forms of mutual interaction and influence. Complex networks serve as an abstraction of real networks [1], where people and various objects are represented as nodes, and their connections form edges. Complex networks can be classified into homogeneous networks (HON) and heterogeneous networks (HEN) based on the characteristics of nodes and edges. Homogeneous networks are networks composed of nodes and connections of the same type, whereas heterogeneous networks contain a range of nodes and connections of different types [1]. Homogeneous networks are a simplified representation of real-world networks, while heterogeneous networks better reflect the dynamics and complex interactions of real-world systems.

An increasing number of scholars are studying heterogeneous networks, which encompasses research in link prediction, community detection, node classification and node importance evaluation. Node importance evaluation is the method of quantifying the influence of nodes within a network. Node influence can be assessed from two viewpoints. Firstly, the information dissemination ability of a node in the network, which measures whether information acquired by a node can be efficiently transmitted to a significant fraction of nodes in a timely manner. Secondly, the importance of a node in maintaining network stability, which assesses the impact of a node’s disruption on the network connectivity in light of the network topology. The assessment of node importance in networks has a broad range of real-world applications. For instance, it is applicable in securing critical facilities in communication or power networks and responding promptly to emergencies. In academic communities, it can be used to gauge the impact of publications or researchers. The evaluation of node importance in online social networks allows the identification of influential users and promotes efficient management of public opinion while curbing the propagation of rumors.As a result, assessing node importance in a network is a valuable approach to ensure network stability and enhance operational efficiency.

This study focuses on the topological and semantic information in heterogeneous networks and utilizes graph autoencoders to encode various semantic subgraphs and reconstruct their corresponding adjacency matrices. By optimizing the reconstruction loss, this study obtains node embedded vectors and calculates the overall impact of nodes by combining topological potential [2] and centrality measures. The main contributions of this study are described as follows:

  1. (1)

    The topology adaptive graph convolutional networks (TAGCN) [3] are utilized to enhance the graph autoencoder, which allows the encoder to have an expanded receptive field and aggregate more comprehensive node features in a single convolution. Additionally, TAGCN enables the efficient processing of large-scale networks with less computation cost.

  2. (2)

    This paper proposes a node importance evaluation method TAGCN Auto-Encoder Comprehensive Influence (TAE-CI), which comprehensively assesses the importance of nodes from both local and global perspectives. Local influence is calculated based on node degree, while global influence is computed by combining network topological potential with the similarity between embedding vectors.

2 Related Work

The assessment of node importance is a fundamental challenge in complex network analysis. While research on homogeneous networks has reached a considerable level of maturity, research on the evaluation of heterogenous networks remains relatively limited. This paper presents a comprehensive overview of related work on node importance evaluation in complex networks, focusing on two primary approaches: indicator-based evaluation methods and node embedding-based evaluation methods.

2.1 Indicator-Based Evaluation Methods

The indicator-based evaluation method defines nodal importance indicators on either a local or global scale and employs one or more indicators to assess the significance of nodes. In homogeneous networks, traditional centrality indicators used in both local and global contexts include degree centrality (DC) [4], betweenness centrality [5], and PageRank [6]. Reference [7] proposes the k-shell based on gravity centrality (KSGC) method to rank the importance of nodes based on global indicators. This method considers the K-shell value of the node in the network and the topological structure and edge importance between its neighboring nodes. By using a gravity model to synthesize local and global information about the nodes, this method produces nodal importance rankings that are more precise than those obtained using the K-shell [8] method. On the other hand, reference [9] proposes the Global Structure Model (GSM), which integrates k-shell and path-based evaluation methods. In [10], the Local-and-Global Centrality (LGC) centrality index is proposed as a method that combines both local and global influence. To calculate local influence, LGC uses degree centrality, and to calculate global influence, it incorporates shortest paths between nodes and node degrees. The product of these two factors determines the importance score for each node. Although this key node evaluation method has demonstrated its effectiveness, applying it directly to heterogeneous networks would significantly reduce its performance. Furthermore, such methods tend to concentrate on the structural information of the network and overlook the semantic information.

To evaluate heterogeneous networks, in [11] the Entropy Rank Method (ERM) is proposed, an algorithm for sorting based on information entropy. ERM defines three types of probability entropies based on meta-paths and linearly combines them to determine the importance of a node within that meta-path. Reference [12] calculates the centrality of nodes on each layer using global indices K-shell and betweenness centrality on multilayer networks. By quantifying the neighbor’s weights in multilayer heterogeneous networks, the method utilizes the multi-relationship coupling information and transmission mechanism to obtain the importance of nodes. Reference [13] collects information from multiple levels of neighborhoods, evaluating the super spreaders in the epidemic propagation network by considering different types of networks, including social networks, highly heterogeneous interpersonal contact networks, and epidemiological networks.

2.2 Node Embedding-Based Evaluation Methods

In recent years, researchers have leveraged machine learning and deep learning techniques, such as graph neural networks and reinforcement learning, to extract the essential features of nodes in a network. These methods are used to form evaluation methods based on node embeddings. In [14], node importance labels are generated using the Susceptible Infected Recovered (SIR) model, where the top 5% of nodes are ranked as the most influential, and the other nodes are classified as less influential. The node sampling is performed using a breadth-first search algorithm, and the graph convolutional network is utilized to aggregate features and obtain node embedding vectors. Finally, node importance evaluation is transformed into a node classification task. Reference [5] employed betweenness centrality to label nodes on a small-scale network. Node embedding vectors were then generated using an encoder, transformed into scalars using a decoder, and used to fit the betweenness centrality. This approach was subsequently applied to evaluate crucial nodes in a large-scale network. Reference [15] starts with the robustness of the network, using graph resistance as the node importance label, and applies GraphSAGE to obtain node embedding vectors. The embedding vectors are input into a multi-layer perceptron to obtain node importance scores for evaluation. The aforementioned methods are all evaluation methods in homogeneous networks, and most of them require labels. Reference [16] applies metapath2vec to construct a node importance evaluation method for heterogeneous networks. Based on a series of metapaths with specific semantics, this method extracts heterogeneous features from different metapaths and integrates different structural features using a weighted mechanism. By adopting a relevance threshold, it identifies the most influential set of nodes. This kind of method relies heavily on prior knowledge because it highly depends on the discovery and selection of metapaths.

3 Preliminaries

3.1 Heterogeneous Network

Heterogeneous network is defined as a directed graph \({\text{G}} = ({\text{V}},{\text{E}},\upphi , \,\uppsi )\) with two mapping functions [17], one for object types \(\phi {\text{:V}} \to {\text{A}}\) and one for relationship types \(\uppsi :{\text{E}} \to {\text{R}}\). Every node \( {\text{v}} \in {\text{V}}\) is categorized by a specific node type \({\text{A:}}\phi \left( {\text{v}} \right) \in {\text{A}}\) and every edge \({\text{e}} \in {\text{E}}\) is categorized by a specific relationship type \({\text{R}}:\uppsi \left( e \right) \in {\text{R}}\), in which the sum of | in which the sum of || in which the sum of |A|and |R| greater than 2.

3.2 Metapath

Meta-path is defined as a sequence of alternating node types and relationships [17], denoted as \({\text{P:A}}_{{1}} \to {\text{A}}_{{2}} \to \ldots \to {\text{A}}_{{\text{l}}}\), it defines the composite relationship between node type \({\text{A}}_{{1}}\) and node type \({\text{A}}_{{\text{l}}}\), each meta-path can be regarded as Semantic information channel [1].

3.3 Semantic Subgraph

To extract semantic information in heterogeneous networks, meta-paths are utilized to divide the original network into several semantic subgraphs.The set of meta-paths is denoted as \( {\text{MP}} = \{ {\text{MP}}_{{\text{1}}} ,{\text{MP}}_{{\text{2}}} , \ldots {\text{MP}}_{{\text{S}}} \} \), a heterogeneous network with s meta-paths is partitioned into s semantic subgraphs which can be denoted as \({\text{G}} = \{ {\text{G}}_{{1}} ,{\text{G}}_{{2}} , \ldots ,{\text{G}}_{{\text{s}}} \}\), the i-th subgraph is constructed by traversing nodes and edges based on metapath \({\text{MP}}_{{\text{i}}}\). The adjacency matrix \({\text{A}}_{{{\text{G}}_{{\text{i}}} }}\) of subgraph \({\text{G}}_{{\text{i}}}\) is defined such that its elements represent the existence of edges between nodes in \({\text{G}}_{{\text{i}}}\). The initial feature matrix \({\text{X}}_{{{\text{G}}_{{\text{i}}} }}\) of subgraph \({\text{G}}_{{\text{i}}}\) is defined such that node features are randomly initialized.

4 The Proposed Method

This study proposes TAE-CI, a node importance evaluation method that combines graph neural networks, centrality measures, and topological potential. To improve the graph autoencoder, the study employs TAGCN, which encodes variant semantic subgraphs and gets embedded matrices of nodes. Graph convolutional network (GCN) is then used to decode and reconstruct them. Node embedded vectors are obtained by minimizing the reconstruction loss, The framework of node embedding model is depicted in Fig. 1. The global importance of nodes is calculated by providing the embedded vectors of nodes to the topological potential function [2], and their importance is estimated by combining them with local importance. TAE-CI includes two parts: the node embedding model and the node importance evaluation model.

4.1 Node Embedding Model

Fig. 1.
figure 1

Framework of node embedding model

Node Feature Encoding

This study utilizes TAGCN to encode the node features of diverse semantic subgraphs. As a variant of GCN, TAGCN employs k graph convolution kernels at each layer to extract local features of various sizes. This method can fully capture the graph’s information and enhance the model’s performance [3]. For a semantic subgraph \({\text{ G}}_{{\text{i}}}\), its adjacency matrix is normalized using the degree matrix \({\text{D}}_{{{\text{G}}_{{\text{i}}} }}\), as shown in Eq. (1):

$$ {\tilde{\text{A}}}_{{{\text{G}}_{{\text{i}}} }} = {\text{D}}_{{{\text{G}}_{{\text{i}}} }}^{ - 0.5} ({\text{A}}_{{{\text{G}}_{{\text{i}}} }} + {\text{I}}){\text{D}}_{{{\text{G}}_{{\text{i}}} }}^{ - 0.5} $$
(1)

Then, the polynomial convolution kernel of the l-th layer for a given metapath is defined as \({\text{H}}_{{{\text{P}}_{{\text{i}}} }}^{{\text{l}}}\), which can expand the receptive field and adaptively adjust the size of node neighborhoods for different metapaths, as shown in Eq. (2):

$$ {\text{H}}_{{{\text{P}}_{{\text{i}}} }} { = }\mathop \sum\nolimits_{{\text{k = 0}}}^{{\text{K}}} {\text{h}}_{{{\text{k,p}}_{{\text{i}}} }} {\text{A}}_{{{\text{G}}_{{\text{i}}} }}^{{\text{k}}} $$
(2)

where \({\text{h}}_{{{\text{k,p}}_{{\text{i}}} }}^{{\text{l}}}\) is the polynomial coefficient and K is the kernel size. Finally, K convolutional kernels extract features on the graph structure, which are linearly combined, as shown in Eq. (3):

$$ {\text{Z}} =\upsigma \left( {{\text{H}}_{{{\text{P}}_{{\text{i}}} }} {\text{X}}_{{{\text{G}}_{{\text{i}}} }} + {\text{b}}_{{{\text{P}}_{{\text{i}}} }} } \right) $$
(3)

where Z is the embedded matrix of nodes, \({\text{ b}}_{{{\text{P}}_{{\text{i}}} }}\) is the bias and \(\sigma\) is the activation function.

Node Feature Decoding

Once the graph structure and initial feature matrix are fed into the encoder, the node embedded matrix can be generated. Given that node importance evaluation is an unsupervised task, this paper utilizes GCN [18] to decode the embedded vectors for reconstructing the semantic subgraph adjacency matrix and optimizing the reconstruction loss, thereby enhancing the representation power of the node embeddings. The decoding process of the embedded vectors of nodes is shown in Eqs. (4) and (5):

$$ {\hat{\text{Z}}} = {\tilde{\text{A}}}_{{{\text{G}}_{{\text{i}}} }} {\text{Z}} $$
(4)
$$ {\hat{\text{A}}}_{{{\text{G}}_{{\text{i}}} }} =\upsigma ({\hat{\text{Z}}\text{W}} + {\text{b}}_{{{\text{de}}}} ) $$
(5)

where \({\hat{\text{Z}}}\) is the node feature matrix after aggregation by GCN, \({\hat{\text{A}}}_{{{\text{G}}_{{\text{i}}} }} \) is the adjacency matrix of the reconstructed semantic subgraph \({\text{G}}_{{\text{i}}}\), W is the weight matrix, and \({\text{b}}_{{{\text{de}}}}\) is the bias in the decoder. The loss function during network reconstruction is defined as Eq. (6):

$$ {\text{L = }}\mathop \sum\nolimits_{{\text{i = 1}}}^{{\text{s}}} \left| {\left| {{\hat{\text{A}}}_{{{\text{G}}_{{\text{i}}} }} {\text{ - A}}_{{{\text{G}}_{{\text{i}}} }} } \right|} \right|_{{2}} {\text{ + L}}_{{2}} $$
(6)

where s is the number of semantic subgraphs, and \({\text{L}}_{{2}}\) is the regularization term to prevent overfitting.

4.2 Node Importance Evaluation Model

The network itself is an abstract system composed of nodes and edges, where the influence of a node itself is closely related to its neighboring nodes. In this paper, the local influence of a node is expressed by its degree. The local influence is defined as LI, and its calculation method is shown in Eq. (7):

$$ {\text{LI}}\left( {{\text{v}}_{{\text{i}}} } \right){ = }\frac{{{\text{d}}\left( {{\text{v}}_{{\text{i}}} } \right)}}{{\text{N}}} $$
(7)

where \({\text{d(v}}_{{\text{i}}} {)}\) is the degree of node \({\text{v}}_{{\text{i}}}\) and N is the total number of nodes in the graph.

Moreover, a node’s global influence is usually associated with the shortest path between nodes. Normally, to determine the shortest path among all possible pairs of nodes, Dijkstra algorithm or Floyd algorithm is usually required. Nonetheless, for large-scale networks, these methods are impractical because of their time-consuming nature. In this paper, the global influence of a node is calculated by combining its embedded vector with topological potential. Similar to an electromagnetic field, a node itself forms a potential field that affects other nodes, and the effect of the influence is related to the similarity between nodes. The calculation of similarity between two nodes is shown in Eq. (8):

$$ {\text{c}}\left( {{\text{v}}_{{\text{i}}} {\text{, v}}_{{\text{j}}} } \right){ = }\frac{{{\text{Z}}_{{{\text{v}}_{{\text{i}}} }} \cdot {\text{Z}}_{{{\text{v}}_{{\text{j}}} }} }}{{\left| {\left| {{\text{Z}}_{{{\text{v}}_{{\text{i}}} }} } \right|} \right|{\text{||Z}}_{{{\text{v}}_{{\text{j}}} }} {||}}} $$
(8)

where \({\text{Z}}_{{{\text{v}}_{{\text{i}}} }}\), \({\text{ Z}}_{{{\text{v}}_{{\text{j}}} }}\) is the embedded vector of node \({\text{v}}_{{\text{i}}}\), \({\text{v}}_{{\text{j}}}\). Then, the similarity between nodes is incorporated into the topological potential function to calculate the global influence of a node using Eqs. (9) and (10):

$$ {\text{m}}_{{{\text{v}}_{{\text{i}}} }} { = }\mathop \sum\nolimits_{{\text{j = 1}}}^{{\phi \left( {{\text{v}}_{{\text{i}}} } \right)}} {\text{d}}\left( {{\text{v}}_{{\text{i}}} } \right) $$
(9)
$$ {\text{GI}}\left( {{\text{v}}_{{\text{i}}} } \right) = \frac{1}{{\text{N}}}\sum\nolimits_{{{\text{i}} \ne {\text{j}}}}^{{\text{N}}} {\left( {{\text{m}}_{{{\text{v}}_{{\text{j}}} }} \cdot {\text{e}}^{{ - \frac{{{\text{c}}\left( {{\text{v}}_{{\text{i}}} {\text{,v}}_{{\text{j}}} } \right)}}{{\lambda {{\bar{d}}}}}}} } \right)} $$
(10)

In this process, \(\phi \left( {{\text{v}}_{{\text{i}}} } \right) \) is the set of neighboring nodes of node \({\text{v}}_{{\text{i}}}\), \({\overline{\text{d}}} \) is the mean degree of the network, and λ is a tunable parameter. The value of λ is determined by minimizing its entropy value as defined in Eq. (11):

$$ {\Pi }\left( {\uplambda } \right){ = - }\mathop \sum\nolimits_{{\text{i = 1}}}^{{\text{N}}} \frac{{{\text{GI}}\left( {{\text{v}}_{{\text{i}}} } \right)}}{{\text{U}}}{\text{ln(}}\frac{{{\text{GI}}\left( {{\text{v}}_{{\text{i}}} } \right)}}{{\text{U}}}{)} $$
(11)

where \({\text{U = }}\mathop \sum\nolimits_{{\text{i = 1}}}^{{\text{N}}} {\text{GI(v}}_{{\text{i}}} {)}\) is normalization factor.

After the aforementioned process, the local and global influence of a node have been obtained. Therefore, the comprehensive influence of a node is defined as shown in Eq. (12):

$$ C{\text{I}}\left( {{\text{v}}_{{\text{i}}} } \right){\text{ = LI}}\left( {{\text{v}}_{{\text{i}}} } \right) \cdot {\text{GI(v}}_{{\text{i}}} {)} $$
(12)

5 Experiment Result and Analysis

5.1 Datasets

In this paper, three real-world network datasets were selected for experimentation: ACM [19], a citation network containing three types of nodes (paper, author, and Subject); IMDB [20], a movie network subset from the Internet Movie Database containing three types of nodes (movie, actor, and director); and DBLP [19], an academic citation network containing four types of nodes (author, paper, term, and conference). The detail information about the datasets is shown in Table 1.

Table 1. Detail Information about the Datasets

5.2 Evaluation Criterion

The assessment of node importance in complex networks often involves the use of propagation dynamics models to evaluate the assessment results. In this paper, the SIR model was selected for simulation experiments [21]. The SIR model contains three types of node states: susceptible (S), infected (I), and recovered (R). There are two important parameters in the SIR model: the infection rate and the recovery rate.Nodes in the infected state have a probability of infecting first-order neighboring nodes in the susceptible state with infection rate, while nodes in the infected state have a probability of becoming nodes in the recovered state with recovery rate and nodes in the recovered state cannot be infected again.

To verify the effectiveness of node importance evaluation with the SIR model, all nodes are first initialized as susceptible. Then, a subset of nodes with the highest node importance values are selected to be infected at the initial stage according to a preset infection rate and recovery rate, and the entire network is infected accordingly. The proportion of infected and recovered nodes compared to the total number of nodes in the network when the spread reaches a steady state is then calculated as the spread range. If two methods have the same spread range, their propagation speeds will be compared by examining the time it takes for them to reach a steady state. For convenience in explaining the experimental results, the propagation coverage of nodes at the t time step is denoted as F(t), as shown in Eq. (13).

$$ {\text{F}}\left( {\text{t}} \right){ = }\frac{{\text{Propagation coverage at time step t}}}{{\text{N}}} $$
(13)

As the infection and recovery rates affect the spread process, there might be some variations in experimental results even with uniform conditions. To mitigate this issue, this paper conducted 100 independent experiments and computed the average of the 100 experiments as the final evaluation metric.

5.3 Results Analysis

In order to compare the TAE-CI algorithm proposed in this paper with LGC, KSGC, GSM and traditional centrality indices PageRank and DC, the node importance assessed by these methods is ranked from highest to lowest, and the top 20 nodes are selected as the key node set. The nodes in this set are set as the infected state in the SIR model. The propagation coverage rate F(t) at a given time step is counted, and the SIR transmission experimental curve is drawn. Figure 2 shows the SIR propagation curve of each method on three datasets.

Fig. 2.
figure 2

The SIR transmission experiment. (a): ACM dataset;(b): DBLP dataset;(c): IMDB dataset.

Figure 2a is an experiment in the ACM dataset, in which the network is relatively sparse, so methods that focus more on the local importance, such as DC, are slower in the first 10 iterations. On the other hand, the proposed method, LGC, and GSM all pay attention to both local and global influence and perform well in the experiment. The networks of the DBLP and IMDB datasets are denser, making it difficult to distinguish the performance of the methods in the first few iterations. The DBLP is the densest network, and under the same infection and recovery rates, the infection range of all methods exceeds 0.8, leading the other two datasets by about 0.3.

In order to further validate the effectiveness of our method, we conducted experiments on the information propagation ability of nodes in different ranking positions, as shown in Fig. 3. The top 20, middle 20, and bottom 20 nodes ranked by node importance were selected, as the node importance decreases, the information propagation ability of the nodes weakens, and both the propagation range and the propagation speed become smaller. Generally speaking, the propagation range and propagation speed of the top nodes are higher than those of the middle and bottom nodes. In addition, due to the different density of the networks, the difference in the propagation ability between the top 20 and bottom 20 nodes vary. For example, in the case of the ACM network, which is a sparse network, the difference in the propagation ability between the top 20 nodes and the bottom 20 nodes is the largest, with a difference of 44.7%, while for the IMDB network, which is a dense network, the difference between the top 20 nodes and the bottom 20 nodes is the smallest, with a difference of 19.2%.

Fig. 3.
figure 3

The SIR transmission experiment of nodes with different ranking positions. (a): ACM dataset; (b): DBLP dataset; (c): IMDB dataset.

6 Conclusion

This paper applies node embedding with topological potential to the evaluation of node importance in heterogeneous networks, and proposes a node importance evaluation method called TAE-CI. The method evaluates the comprehensive influence of nodes both locally and globally, avoiding the partiality of single-index evaluation, and avoiding the time cost brought by computing the shortest path between nodes. As verified by SIR propagation experiments, the proposed method in this paper can better evaluate node importance.

Although the method proposed in this paper achieved better results in the evaluation of node importance in heterogeneous networks, the semantic information obtained by using metapaths in the node embedding model is not comprehensive enough. In the future, it is considered to introduce metagraphs to extract semantic information in order to further improve the accuracy of the evaluation method.