Keywords

1 Introduction

Due to the growth of information in a wide range of applications, many real world data sets such as bibliographic, biological, and social networks contain a massive number of interconnected multi-typed objects. These data sets can be modelled as heterogeneous information networks (HINs) [1] to represent their complex structure at an abstract level. For example, a research publication site (a.k.a bibliography network) DBLP (http://www.informatik.uni-trier.de/) can be modelled as an HIN as shown in Fig. 1. This illustrates the relationship between the 4 object types such as: Paper (title of papers enclosed in the data set), Author (people who have written the papers), Conference (place where the papers are published) and Term (set of key words presented in the papers).

Fig. 1.
figure 1

Heterogeneous information network model of DBLP data set

Mining and analysing an HIN has become a significant and interesting research topic in recent years [2], clustering analysis is most helpful process to identify a group of related objects from the underlying complex structure [3]. For example, finding the group of Authors based on their research area can improve the search results on the DBLP website.

In general, the objects in an information network are of same type (homogeneous) and are clustered based on their attributes/features or links/structural similarity between them. Clustering on HIN is a challenging task since, each type of object has a unique characteristic and each relation implies different semantic meaning [4]. Moreover, HIN data sets contain much richer information than the homogeneous model. Though it is possible to project the links among HIN objects into multiple homogeneous networks, the mining results will be obtained with significant information loss. Therefore, to preserve this information of the network, it is necessary to compute the similarity between target typed objects via other types of objects that are linked with them. This can be achieved by exploring the different meta-path relations [5, 6] among objects where each relation produces different clustering results.

On the other hand, clustering algorithms are useful for finding meaningful patterns from network data. The Affinity Propagation (AP) clustering [7] is one of the most popular algorithms and has shown promising results in many applications including protein interaction networks [8]. This takes as input a collection of negative real-valued similarities between objects and automatically generates cluster exemplars for each object. However, AP has some limitations related to the choice of parameters (i.e., damping factor \(\lambda \) and preference p) that lead the algorithm to be non-convergent due to oscillation and impact on the number of resultant clusters respectively.

In [9], an adaptive AP (AAP) algorithm is developed to resolve these limitation by iteratively updating the parameter values, to eliminate the oscillation automatically and yield an optimal clustering from various clustering results. Recently, to resolve the problem of setting the initial preference p value, a modified AP (MAP) [10] algorithm introduces a similarity distribution based model to set a sensible p for each object in the data set.

Though the above modifications help to achieve better clustering results than the original AP, the clustering process is either slow, stuck in oscillation, or does not achieve desired clustering results. In view of this, this paper proposes an enhanced AP (EAP) algorithm that introduces new strategies for updating the value of the two key parameters in AP. Our contributions are highlighted as follows:

  1. 1.

    We propose a new method to detect oscillation occurrence and increase convergence speed, by reducing the computational time for monitoring and eliminating oscillation in AP.

  2. 2.

    We use the MAP algorithm to initialize the preference p and update the value based on the previous p, in order to retrieve an appropriate p and desired clustering.

  3. 3.

    The experimental results on an HIN data set show that the proposed EAP approach can handle the issue of slow convergence and produces better clustering results compared to the existing methods.

The reminder of this paper is organized as follows. Section 2 describes meta-path relations, similarity matrix construction, and the AP algorithm and derivatives. Section 3 introduces the proposed approach. In Sect. 4, empirical studies are presented. Finally, Sect. 5 concludes the paper and outlines our future work.

Fig. 2.
figure 2

Meta-path extraction from DBLP network schema.

2 Related Work

2.1 Meta-paths Extraction and Similarity Matrices Construction

According to [5], a meta-path \(\mathcal{M}\mathcal{P}\) is defined as a sequence of relations that connected different object types in a network, and it represents the composite relations between objects \( X_{1},X_{2}\cdots X_{l+1} \) in the form of \(X_{1}\xrightarrow {r_{1}} X_{2}\xrightarrow {r_{2}}\cdots \xrightarrow {r_{l}}X_{l+1}\). The meta-paths for simple and small HINs can be provided by the domain experts, but there are not effective methods for generating meta-paths for large-scale networks. For example, an algorithm [11] is proposed which automatically derives meta-paths for a given pair of objects based on feature extraction and link generation methods, but the calculation is very complex. Thus, in this paper we used the general model, in which a meta-path can be generated by multiplying the several adjacency matrices between two target objects and it is encoded in the commuting matrix [12] as shown in (Fig. 2). Similarity matrix construction is the process of evaluating the closeness between objects in a network for instance, two objects are similar if they are connected by many paths/links or if they share the more similar attributes in a data set. In [5], a meta-path based similarity metric is introduced which is designed to overcome the disadvantage of estimating the similarity based on only highly visible and concentrated objects.

The similarity score (S) between the target objects (\(x_{i},x_{j}\)) of same type is defined as:

$$\begin{aligned} S(x_{i},x_{j})\,=\,\frac{2 \times |ins(x_{i} \rightarrow x_{j}):ins(x_{i} \rightarrow x_{j}) \in \mathcal{M}\mathcal{P}|}{|ins(x_{i} \!\rightarrow \! x_{i}):ins(x_{i} \!\rightarrow \!x_{i}) \!\in \! \mathcal{M}\mathcal{P}| \!+\! |ins(x_{j} \!\rightarrow \! x_{j}):ins(x_{j} \!\rightarrow \! x_{j}) \!\in \! \mathcal{M}\mathcal{P}|} \end{aligned}$$
(1)

where, the path instance (ins) represents a path between two objects in the network that follows a specific meta-path (\(\mathcal{M}\mathcal{P}\)). For example, in Fig. 1 a path instance (a1 \(\rightarrow \) p1 \(\rightarrow \) a2) between author objects a1 and a2 based on meta-path (APA) is depicted in blue dotted lines. In Eq. (1), \(ins(x_{i} \rightarrow x_{j})\) is a path instance between objects \(x_{i}\) and \(x_{j}\) and \(ins(x_{i} \rightarrow x_{i}), ins(x_{j} \rightarrow x_{j})\) are path instances of objects (\(x_{i}\) to \(x_{i}\)) and (\(x_{j}\) to \(x_{j}\)) respectively.

2.2 Affinity Propagation (AP) Clustering and AP Variants

The affinity propagation (AP) clustering algorithm introduced in [7] identifies the appropriate cluster exemplars (centres) for the given object set through message passing techniques. The main objective of AP algorithm is to maximise the net-similarity, that is, the score of how appropriate the exemplars are for explaining the clusters.

Initially, it takes similarity measures between pair of objects as inputs and considers all the objects to be clustered as potential cluster centers (exemplars). For example, a data set with n objects and the similarity matrix among the objects \(S(x_{i},x_{j}), \text { for } i,j= 1,2,...,n\) are consider as initial inputs.

Also, for each object \(x_{i}\) a preference (p) value should be assigned in \(S(x_{i},x_{i})\), that defines the preference of an object to be chosen as an exemplar. This value influences the final number of clusters and exemplars, so the clustering results may vary by adjusting the \(p(x_{i})\) value. The median \(p_{min}\) or minimum \(p_{min}\) value of the similarity matrix can be set as the initial preference value which will result in a moderate or the least number of clusters, respectively.

Now based on the similarity scores two kinds of messages are exchanged among objects. For example, any object \(x_{j}\) send request messages to all other objects \(x_{i}\) and \((i \ne j)\), as it is willing to be their exemplar C (center object). This information is stored in a responsibility matrix \(R \rightarrow R(x_{i},x_{j})\) that describes how appropriate for an object \(x_{j}\) to become an exemplar for \(x_{i}\). Similarly, the objects’ response message is stored in a availability matrix \(A \rightarrow A(x_{i},x_{j})\) that indicates how probable \(x_{j}\) to be selected as an exemplar by \(x_{i}\).

These two matrices are computed by the following functions:

$$\begin{aligned} R(x_{i},x_{j}) = \left\{ \begin{array}{l} S(x_{i},x_{j}) - max_{x_{j'} \ne x_{j}}[A(x_{i},x_{j'}) + S(x_{i},x_{j'})] \quad if ~x_{i} \ne x_{j} \\[1mm] p(x_{i}) - max[A(x_{i},x_{j})+S(x_{i},x_{j})] \quad otherwise. \end{array} \right. \end{aligned}$$
(2)
$$\begin{aligned} A(x_{i},x_{j}) = \left\{ \begin{array}{l} min [0, R(x_{j},x_{j}) + \sum \limits _{x_{j'} \not \in (x_{i},x_{j})} max(0, R(x_{j'},x_{j}))] \quad if ~x_{i} \ne x_{j} \\[1mm] \sum _{x_{j'} \ne x_{i}} max[0, R(x_{j'},x_{j})] \quad otherwise. \end{array} \right. \end{aligned}$$
(3)

After calculating both matrices AP iteratively update their values with the damping factor (\(\lambda \)) by the following functions:

$$\begin{aligned} R(x_{i},x_{j}) \leftarrow (1-\lambda ) * R(x_{i},x_{j}) + \lambda * R(x_{i},x_{j}) \end{aligned}$$
(4)
$$\begin{aligned} A(x_{i},x_{j}) \leftarrow (1-\lambda ) * A(x_{i},x_{j}) + \lambda * A(x_{i},x_{j}), \end{aligned}$$
(5)

where the damping factor \(\lambda \) is any fixed value \((0 \le \lambda \ge 1)\), which is defined to avoid the numerical oscillation.

At each iteration, the exemplar for all objects \(E(x_{i})\), \(x_{i} = x_{1},x_{2},...,x_{n}\) will be assigned by a decision matrix as follows:

$$\begin{aligned} E(x_{i}) = max[R(x_{i},x_{j}) + A(x_{i}, x_{j})] \end{aligned}$$
(6)

The algorithm will be terminated when the cluster exemplars does not change for a certain number of iterations. Otherwise, the process continues until the maximum iteration is encountered with appropriate clustering and corresponding exemplars.

Adaptive Affinity Propagation (AAP). AP can be non-converged or produce unsatisfactory clustering results if it uses inappropriate values for two key parameters, i.e., the preference p and damping factor \(\lambda \). The AAP algorithm [9] is proposed to tackle these issues by adaptive values for the two parameters. AAP uses a moving monitoring window to detect oscillation that leads to non-convergence. If the oscillation is detected, AAP then uses Eq. (7) to gradually increase the damping factor which helps to reduce the oscillation.

$$\begin{aligned} \lambda _t = \lambda _{t-1} + 0.05 \end{aligned}$$
(7)

If AAP fails to converge when \(\lambda \ge 0.85\) then, the preference value will be adjusted adaptively as follows:

$$\begin{aligned} p_t = p_{t-1} + 0.01 \end{aligned}$$
(8)

which helps to escape from oscillation. If AAP is converged, then the preference p will be adjusted as follows:

$$\begin{aligned} p_t = \frac{0.01*p_{t-1}}{0.1\sqrt{K+\alpha }} \end{aligned}$$
(9)

where K is the number of exemplars generated based on initial preference \(p_{med}\) and \(\alpha \) is a constant value. This produces different K exemplars with the new preference value, and the process continues until the number of clusters reaches 2 which is the least possible partition of a data set.

Modified Affinity Propagation (MAP). The modified AP algorithm introduced in [10] to resolve the problem of determining the preference value to produce optimal clustering using AP algorithm. It models the preference value based on the similarity distribution rather than random setting as done in AAP.

From the given input similarity matrix \(S(x_{i},x_{j})\) where \((i,j= 1,2,...,n)\), first the sum of similarities from \(x_{i}\) to other data points is calculated as:

$$\begin{aligned} NTDS(i) = \frac{TDS(i)}{\sum _{j}^{n} TDS(j)} \end{aligned}$$
(10)

where

$$\begin{aligned} TDS(i) = \sum \nolimits _{j=1, j \ne i}^{n} S(x_{i},x_{j}). \end{aligned}$$
(11)

Then, the the preference for each data point \(x_{i}\) can be calculated as:

$$\begin{aligned} p(x_i) = n * NTDS(i) * const \end{aligned}$$
(12)

where, the const is any real values or the minimum value of similarity matrix.

3 Proposed Method - Enhanced Affinity Propagation (EAP)

As discussed earlier, the AP algorithm is sensitive to the values of parameters p and \(\lambda \). AAP addresses this by adaptive adjustment of the parameters, but their initial values are randomly chosen. MAP sets p rationally based on similarity matrices, but it does not address the choice of \(\lambda \). Here, we propose two new strategies to enhance the performance of AP, as shown in Fig. 3. The first strategy is a new oscillation detection mechanism that does not require costly monitoring of changes in cluster exemplars for many iterations as done AAP. The second strategy is to combine the advantages of AAP and MAP in parameter setting, leading to fast convergence of the clustering process and improved clustering results.

3.1 Oscillation Detection

Clustering that fails to converge may imply the presence of oscillation where (a) the number of cluster exemplars goes up and down or (b) the number remains the same but cluster exemplars change over iteration. With this in mind, we use at iteration t two indicator variables: \(\alpha _t=1\) if the number \(K_t\) of cluster exemplars is more than the number \(K_{t-1}\), 0 otherwise; \(\beta _t=1\) if the intersection of the set \(E_t\) of cluster exemplars at t and \(E_{t-1}\) is empty, 0 otherwise. Mathematically, they are expressed as follows:

$$\begin{aligned} \alpha _{t} = \left\{ \begin{array}{ll} 1, &{} \text {if } K_{t} \ge K_{t-1},\\ 0, &{} \text {otherwise}. \end{array} \right. \end{aligned}$$
(13)
$$\begin{aligned} \beta _{t} = \left\{ \begin{array}{ll} 1, &{} \text {if } |E_{t} \cap E_{t-1}| \le 0,\\ 0, &{} \text {otherwise}. \end{array} \right. \end{aligned}$$
(14)

Based on the above two conditions we determine whether the oscillation occurs \(O_{t} = 1\) or not \(O_{t} = 0\) at iteration t as follows:

$$\begin{aligned} O_{t} = \left\{ \begin{array}{ll} 1, &{} \text {if }(\alpha = 1 \text { and } \beta = 1).\\ 0, &{} \text {otherwise}. \end{array} \right. \end{aligned}$$
(15)

Compared with the oscillation detection mechanism in AAP that only monitors the change of cluster numbers over time, the proposed one is not only computationally efficient, but it also can prevent false detection of oscillations as the changing number of cluster exemplars does not necessarily mean the occurrence of oscillation (e.g., the clustering may undergo temporary adjustment in order to find better clustering results).

Fig. 3.
figure 3

The flowchart of the proposed EAP algorithm.

3.2 Updating p and \(\lambda \)

If an oscillation occurs following the oscillation detection in above subsection, i.e., \(O_{t} = 1\), then EAP increases the \(\lambda \) value slightly as shown in Eq. (7) and updates the message matrices iteratively until the algorithm converges.

Consequently, if the oscillation occurs even when \(\lambda \ge 0.85\), then updating the preference value will help to avoid oscillation and lead to optimal clustering. Here, we set the preference value same as in Eq. (12) since we presume that, rather than a common random p value for every object, the similarity distribution based \(p_{i}\) values for each object produce better clustering even at the initial stage.

The p value is adaptively updated by:

$$\begin{aligned} p_t(x_{i}) = p_{t-1}(x_{i})+0.01 * p_{t-2}(x_{i}) \end{aligned}$$
(16)

where \(p_t\) is the p value at iteration t and it decreases gradually in each iteration. After updating the preference value the algorithm moves to the message update stage and continues to produce different clustering exemplars. At this point, the algorithm checks for optimal clustering by evaluating the Silhouette score Eq. (18). If the score is higher that the one in the previous solution, then the algorithm continues with the same preference value. Otherwise, the preference value will be updated according to Eq. (16), and this process continues until a better silhouette score is achieved or the least possible number of clusters (\(K = 2\)) is produced.

4 Empirical Study

4.1 The Data Set

The star-schema subset of DBLP dataset (dblp.uni-trier.de/xml/) is used to evaluate the performance of proposed method. The subset covers four research areas: database, data mining, information retrieval and artificial intelligence, where five representative Conferences are extracted in each area along with relevant Papers, Authors and Topics. More specifically, this subset contains the following number of objects: 14,376 Papers(P), 14,475 Authors(A), 20 Venues/conferences (C), 8,920 Terms (T) and totally 1,70,794 links connecting these objects. In our experiments, 1000 Author objects are randomly selected and other types of objects linked with these Authors are used for training. The DBLP subset with 20 % of labelled data (seeds) is used for testing different clustering methods.

Unlike traditional attribute data sets, the features of objects in HIN cannot be explicitly specified. Therefore, the possible Meta-Paths connecting Author objects in DBLP: A-P-A, A-P-C-P-A and A-P-T-P-A have been considered as features. Since these relations different similarity matrices of Authors are evaluated using Eq. (1), the similarity matrices are tested by their performance in clustering tasks.

4.2 Evaluation Metrics

Silhouette Coefficient - is an internal validation metric for evaluating clustering. The silhouette score illustrates how close the objects are within a cluster (cohesion) and object’s distance from other clusters (separation). For a given dataset \(X = {x_{1},x_{2},\cdots ,x_{n}}\), the Silhouette score of each object \(x_{i}\) is defined as follows:

$$\begin{aligned} Sil(x_{i}) = \frac{b(x_{i}) - a(x_{i})}{max |a(x_{i}),b(x_{i})|} \end{aligned}$$
(17)

where \(a(x_{i})\) is the average distance from the \(x_{i}\)-th object to other objects in the same cluster and \(b(x_{i})\) is the minimum average distance from object \(x_{i}\) to objects in different clusters. It ranges from [\(-1\) to 1], where silhouette score ‘1’ indicates that the object is well clustered, ‘0’ means the objects are overlapped, and \(-1\) means objects are grouped in wrong clusters. The Silhouette Index for the data set X is expressed as follows:

$$\begin{aligned} Sil(X) = \frac{\sum _{i=1}^{n}(Sil(x_{i}))}{n} \end{aligned}$$
(18)

Normalized Mutual Information (NMI)- it measures the degree of agreement between the resultant clusters and ground truth data. NMI score is calculated as below:

$$\begin{aligned} NMI = \sum _{c=1}^{k} \sum _{j=1}^{q} N_{c}^{j}log (\frac{N.N_{c}}{N_{c}^{j}.N_{j}}) / \sqrt{(\sum _{c=1}^{k} N_{c} log \frac{N_{c}}{N}) (\sum _{j=1}^{q} N_{j} log \frac{N_{j}}{N})} \end{aligned}$$
(19)

where N is the total number of objects, \(N_{c}\) and \(N_{j}\) are the number of objects in the c-th result cluster and the j-th true cluster label. \(N_{c}^{j}\) is the common objects in c and j.

4.3 Clustering Analysis

Silhouette Score Based Clustering Validation. The proposed algorithm is compared with AP, AAP, and MAP on clustering Author objects. Table 1 provides the obtained clustering result of 1000 Authors based on three meta-paths for the four algorithms. From the Table 1, we can see that the AP algorithm leads to inappropriate results. Since, the given DBLP contains only 4 groups of Authors whereas, the AP clustering produces a large number of clusters and results in negative silhouette scores. Although the AAP algorithm produces the better clustering results, its convergence iteration is larger than the other algorithms. It produces numerous different clustering results and after analysing all we obtain the least number of clustering (2,3,3), which results in high silhouette scores. The clustering result based on MAP shows moderate results in both convergence and cluster number but it fails to update the preference to find optimal clustering. The silhouette score of EAP algorithm is high compared to the others and produces an appropriate clustering result. The following silhouette plots show the distribution of Author objects in 4 clusters obtained using the EAP algorithm. As we mentioned each meta-path based similarity measures produce different clustering results hence, we studied how the EAP algorithm worked with these three meta-paths.

Table 1. Silhouette scores of different clustering methods (best values are highlighted in boldface)
Fig. 4.
figure 4

Silhouette plots of EAP clustering based on 3 Meta-Paths

As presented in the Fig. 4, most of the objects are grouped in wrong clusters in the APA based clustering and the average of silhouette score of objects is below average. The APTPA path has acquired better average score than APA but not separated the objects equally in each cluster. The APCPA path produced better result than other two paths in AP clustering since, it achieves higher silhouette average. Also, the thickness of clusters shows the objects are separated with less variations yet certainly some objects are grouped in wrong clusters.

NMI Score Based Clustering Validation. The effectiveness of the enhanced AP approach is analysed by NMI score, the results are compared with the exiting clustering approaches. The NMI scores of four algorithms based on three meta-paths are depicted in the Table 2. According to that, we observe the EAP clustering results is more appropriate for the given ground truth values of DBLP dataset. The AAP produces nearly similar results to EAP which identifies that can also work better in HIN datasets but it takes longer time to compute. Followed by these observations, MAP reveals better performance than the original AP but produced least score because there is huge variation between the obtained exemplars and given cluster labels.

Table 2. NMI score of different clustering algorithms (best values are highlighted in boldface)

5 Conclusion and Future Work

In this paper, we propose an enhanced Affinity Propagation clustering method to provide promising clustering performance without losing rich information among multi-typed objects. To evaluate the performance of our method, we first construct the similarity matrices between 1000 Author objects based on various Meta-Path relations based on a subset of DBLP dataset to show how a specific type of (Author) objects are related with each other in different aspects. Thereafter, the similarity matrices are applied into Affinity Propagation algorithm to produce clustering of (Authors) objects. From the obtained Silhouette scores, we realize that each path provides different results based on the link count. APCPA path has more links between Author objects and achieves better results compared with APA and APTPA paths. Meanwhile, the proposed approach can provide a considerably efficient convergence in AP clustering by updating values of the preference and damping factor gradually with a simple oscillation detection strategy. Consequently, we examine the quality of different clustering algorithms on DBLP. These results show that the proposed algorithm can provide valid clustering results on HIN data sets.

In the future, we will continue our experiment by calculating the similarity scores based on combined meta-path [13, 14] associated with various weights to preserve the diverse semantic in a single similarity matrix. Furthermore, we will investigate other strategies for optimizing the preference value to speed the convergence and improve the efficiency of enhanced Affinity Propagation algorithm.