(PDF) Evaluating authoritative sources using social networks: an insight from Wikipedia | Nikos Korfiatis - Academia.edu
The current issue and full text archive of this journal is available at www.emeraldinsight.com/1468-4527.htm OIR 30,3 252 Accepted March 2006 Evaluating authoritative sources using social networks: an insight from Wikipedia Nikolaos Th. Korfiatis Department of Informatics, Copenhagen Business School (CBS), Copenhagen, Denmark, and Marios Poulos and George Bokos Department of Archives and Library Sciences, Ionian University, Corfu, Greece Abstract Purpose – The purpose of this paper is to present an approach to evaluating contributions in collaborative authoring environments, and in particular, Wikis using social network measures. Design/methodology/approach – A social network model for Wikipedia has been constructed, and metrics of importance such as centrality have been defined. Data has been gathered from articles belonging to the same topic using a web crawler, in order to evaluate the outcome of the social network measures in the articles. Findings – Finds that the question of the reliability regarding Wikipedia content is a challenging one and as Wikipedia grows, the problem becomes more demanding, especially for topics with controversial views such as politics or history. Practical implications – It is believed that the approach presented here could be used to improve the authoritativeness of content found in Wikipedia and similar sources. Originality/value – This work tries to develop a network approach to the evaluation of Wiki contributions, and approaches the problem of quality Wikipedia content from a social network point of view. Keywords Social networks, Encyclopaedias Paper type Research paper Online Information Review Vol. 30 No. 3, 2006 pp. 252-262 q Emerald Group Publishing Limited 1468-4527 DOI 10.1108/14684520610675780 1. Introduction and background Since, the invention of writing as a method of encoding human knowledge, the preservation and dissemination of information and knowledge has become a matter of great importance to humanity. People, as intelligent entities, produce and consume information that is preserved in, and accessed from, various sources such as books, articles and encyclopaedias. Since, a high level of complexity characterizes the organization of information, reference works that assist the retrieval of relevant information resources are crucial for dissemination and further development of human knowledge in a particular subject. Encyclopaedias and dictionaries represent the major instances of such reference works, since their principal scope is to assist, through associative trailing, the retrieval of the relevant resources through a particular domain (collection of relevant lemmas). Visions of the world wide web such as the Memex, envisioned by Bush (1945), and the original intuition behind the design of the world wide web by Berners-Lee et al. (1994), tend to represent the world wide web as a huge encyclopaedia, where lemmas are associated by using a hypermedia model. Nonetheless, an encyclopaedia and any other kind of reference work is subject to evaluation as to the level of its quality characterizing. Since, the production of such works is subject to a very small number of individuals, the development process is characterized by high complexity. Efforts in the world wide web such as Wikipedia, try to distribute that kind of complexity to several contributing authorities by allowing the synchronous editing and publication of lemmas through its publication model. However, since its inception Wikipedia has been subject to criticism (Fasoldt, 2004; Orlowski, 2005; Lipczynska, 2005), due to what level the information contained can be trusted and referenced in research works. In that case, models of credibility, which are used extensively on search engine research and information retrieval, can be used to evaluate the trustworthiness of the topic covered by Wikipedia. On the web, several successful approaches to credibility such us PageRank (Brin and Page, 1998; Page et al., 1999) use methods derived from graph theory to model credibility, which utilize the connections of the resource for evaluation. Several graph theoretic models of credibility and text retrieval (Faloutsos, 1985) rely on the consideration of the in-degree of the node (the sum of the incoming arcs of a node in a directed graph) to extract importance and trustworthiness. This is also implied by the publication workflow and the resulting context on which those models are based. For instance, on the world wide web and similar hypermedia systems, such models of credibility evaluate a web page using the in-degree extracted by the hypertextual context. However, there are publication models supporting social activities (e.g. collaborative authoring) that derive much of their credibility from their productions (e.g. authorship), where the hyperlink context does not depict that kind of activity. In that case, the in-degree cannot provide input to evaluate the importance of that entity; therefore, a holistic approach is needed. This alternative evaluation has to consider the outputs of the entity (productions), as happens with several informal social communication models (Festinger, 1950). In a graph theoretic interpretation this can be modelled as the outer-degree of the node, which conceptually represents the entity evaluated. 2. The Wiki publication model The web has given rise to new forms of collaboration and interaction facilitating the manipulation of shared artefacts and information spaces (Cadiz et al., 2000). In the current state, the web ecosystem consists of resources (web pages/ files) linked though hypertext connectors, thus forming a system of links denoting references to those resources as well as providing views to requesting authorities. However, one of the initial design goals of the web was not only to facilitate views of the resources requested but also allow editing and annotation of these resources in a simple way (Berners-Lee and Fischetti, 1999). The concept of Wiki (Leuf and Cunningham, 2001) has given a response to this challenge. WikiWiki (Hawaiian for “quick”) applications facilitate a way of collaborative editing, supported by a revision mechanism that allows the monitoring of changes and contributions to the sections edited. The use of WikiWiki applications is common in cases such as formation of collaborative document editing (e.g. in communities of practice), or formation of shared knowledge repositories such us dictionaries, etc. One of the best-known implementations of the usefulness of the Wiki system to support collaborative document editing is the Wiki-based encyclopaedia Wikipedia and its related projects (www.wikimedia.org). Evaluating authoritative sources 253 OIR 30,3 254 Figure 1. Growth of articles in accordance with the growth of users in the English Wikipedia Traditional encyclopaedias such us Britannica are often characterized by a high level of credibility by domain experts, taking into account the background process that has resulted – domain authorities contribute to the final outcome. Alternatively, since it uses the WikiWiki system, Wikipedia allows the editing and creation of encyclopaedic articles by anyone who wishes to contribute. Its primary aim is to provide free editing access and gather knowledge representing the consensus of the term presented, and thus not to evaluate the contributing authorities. However, as the content increases along with the contributing sources (Figure 1), a critical issue has emerged regarding the credibility of Wikipedia as an authoritative reference source (Andrew et al., 2005; Lih, 2004). The issue is extended not only to the outcome (article) but also to the process of shaping the article, in which a contributor would allow another authority to submit, change or delete a contribution accepted or not accepted by him/her. Wikipedia has internal mechanisms of managing those cases such as a permission ranking system, where a contributor is accredited by the level of participation in the shaping of the article, as well as a discussion tab on most of the articles or notifications and warnings regarding the content. Nevertheless, the research question looks at how to provide a clue to the credibility for an article based on the contributing authorities, and their acceptance by the community of their fellow contributors. Values on X-axis represent the articles on logarithmic scale. Values on Y-axis represent the number of contributors. (Statistics obtained from http://en.wikipedia.org/ wikistats/EN/TablesWikipediaEN.htm). In this paper we present an initial attempt to model the problem towards providing an authoritative ranking mechanism based on social interaction data collected through the Wiki. Social interaction is approached from social communication facilitated by the Wikipedia platform (e.g. edits on edits) (Cobley, 1996). We then model the credibility of each contributor using the metric of centrality, thus producing an overall centrality measure for the article depicting the social activity/process that has taken place through the shaping of the article. We argue that this factor can be used as a metric of credibility, representing the article and the contributing authorities. 3. A network approach on the Wiki publication model Social networks and social network analysis in particular (Wasserman et al., 1994; Scott, 2000), is a research paradigm that tries to unravel patterns of social relationships across various individuals in a social context. Following the patterns and measuring structural and compositional values in the networks, we ought to understand the basic structure and properties of the network and explain its behaviour; thus, uncover those actions that characterize most of the activity described by the network. Social network analysis focuses on a more rationalistic approach to research on organizations and social groups (Borgatti and Foster, 2003), since it aims to expand interdependent relations between the members of the group. WikiWiki applications facilitate a case where social relationships are established over a domain of social actions such as acceptance, objection or rejection of a contribution. As in the case of Wikipedia, the Wiki facilitates a collaborative document editing effort, which relies on the contribution of multiple authors in a concurrent system. This enables the combination of contributions in an effective and democratic way, allowing all the ground knowledge about the article/lemma to be present in the most recent revision of the article. By democratic, we also refer to the ability of anyone who uses the Wiki to contribute or to make modifications to content contributed by someone else. In that sense, as the Wiki-fication continues, the final document (or the most recent revision) is the outcome of a community process involving certain social interactions embedded in the content modification. From a social research point of view, what makes such a case interesting is the negotiation process that takes place when writing and structuring the article. For example, a user makes a contribution that is erased, and the user tries to establish its contribution back (to make it visible and accepted by the others). In both cases (article and negotiation), there are interaction ties characterizing the final outcome and the dynamics of the process. In this paper, we will focus on the interaction ties between multiple contributors working on the same article or domain of articles in the Wikipedia namespace. However, to do this kind of study, we first need to define the structural and compositional variables which characterize such a network. In the Wiki publication model, we can see the necessary structural and compositional variables important in the construction of a social network of contributors to an article or topic in the Wikipedia. Structural and compositional properties of the publication model can be found in the following use-cases: . When a contributor edits content submitted by someone else it establishes a tie with him/her. This is depicted by an acceptance factor, which represents the percentage of the previous contributor’s content that is visible after. . Every contributor who has a single contribution or more to the article establishes a relational tie with the other content contributors. Evidence of participation in common projects strengthens this tie. Evaluating authoritative sources 255 OIR 30,3 256 We can also link actors through two different layers of networks (Figure 2): (1) The articles network. Every article in the Wikipedia contains references to other articles as well as external references. A set of links used for classification purposes is also available in most of the active articles of the encyclopaedia. Every article represents a vertex in the article network and the internal connections between the article edges of the network. (2) The contributors network. Wikipedia is a collaborative writing effort, which means that an article has multiple contributors. We assume that a contributor establishes a relationship with another contributor if they work on the same article. In the resultant signed network, a vertex represents each contributor, and their social ties (positive or negative) are represented by an edge denoting the sequence of their social interaction. The resultant graph is a two-mode network with two sets of entities: articles and contributors. Contributors can be either connected (belong to the same article) or interconnected (common contributions on two or more articles in the same domain). In an article domain of high credibility it is expected that more interrelations will be found, since the contributors may contribute content to more than one article, thus depicting their common interest. Therefore, the more affiliated a contributor becomes with a domain, the more interested he/she is in the article; thus representing knowledge of the domain. Let us consider, for instance, a contributor who has made a lot of contributions to the domain regarding the history of Spanish colonies in Latin America. The author has also completed some contributions to an article on anatomy. However, the author is more affiliated to the articles regarding the history of the Spanish colonies than to medicine. Therefore, his contribution to medicine may be considered less authoritative than those contributions made in the other domain, as his knowledge of anatomy is not as extensive as his knowledge of Spanish colonies. In social network analysis, there are a variety of measures that can assess this kind of social activity in a sociometric study. As we have already defined our graph, we can use some common social network metrics to extract this kind of information from Wikipedia data. Figure 2. Network layers in the Wiki publication model. Contributors are linked together by working on common projects (articles) in the Wikipedia namespace 4. Network measures in the Wikipedia contributions As previously mentioned, contributors contribute to one or more articles belonging to the same or different domains. Based on this, we can evaluate the activity of the contributors and thus extract metrics of their presence in the domain of the articles. The development of those metrics is based on the following assumptions: . The more decentralized the editing of an article, the better the article represents a consensus. . The contributors whose content has been most accepted (seen from the result of the diff operation in the Wiki) are attributed a level of authority regarding the article. . This level of authority remains only in the domain of the article. However, domains that belong to the same topic retain the level of authority for a contributor. The graph we model is a signed directed network, with arcs as a factor depicting the level of acceptance of the content submitted by contributor A and accepted by contributor B. In order to model the authoritativeness of contributors, we selected the centrality index (Freeman, 1979; Sabidussi, 1966) of the resultant graph and, in particular, a measure of centrality dealing with the degree (total of incoming, outgoing edges) of the vertex/contributor in the examined article. The concept of centrality is well accepted in social network analysis, as there are numerous studies showing the usefulness of such a metric for measuring activity in social networks (Everett and Borgatti, 1999; Freeman, 1979). In sociometric studies, the usage of centrality is targeted to unfold the person/individual who is the most prominent in a network, thus ranking the actors according to their positions in the network; and is interpreted as the prominence of actors embedded in a social structure. In our study, we use the degree centrality index, which is the simplest definition of centrality and is based on the incoming and outgoing adjacent connections to other contributors in an article graph. To measure the centrality at an individual level, we define the contributor degree centrality; and to an article level, the article degree centralization, which represents the collective of contributor degree centrality. 4.1 Contributor degree centrality In classical social network models, the inner degree (the amount of edges coming into a node) represents the choices the actor has over a set of other actors. However, in our Wiki network model, the amount of incoming edges represents edits to the text; therefore, the metric of inner degree is the opposite, meaning that the person with the biggest inner degree has the biggest amount of objection/rejection in the contributor community, and thus receives a kind of negative evaluation from his/her fellow contributors. On the other hand, the outer-degree of the vertex represents edits/participation in several parts of the article, and thus gives a clue to the activity of the person in relation to the article and the domain. Mathematically, we can represent such formalism as follows: considering a graph representing the network of contributors for an article contributed in the Wiki, the contributor degree centrality – a contextualized expression of actor degree centrality – is a degree index of the adjacent connections between the contributor and others who edit the article. From graph theory, the outer degree of a vertex is the cumulative value of its adjacent connections: Evaluating authoritative sources 257 OIR 30,3 X C D ðni Þ ¼ dðni Þ xij 258 The adjacent xij represents the relational tie between the contributors and their contribution over the domain of the article. This also is characterized by the visibility of the contribution in the final article and can be either 1 or 0. To provide the centrality, we divide the degree with the highest obtained degree from the graph, which in graph theory is proved to be the number of remaining vertices (g) minus the self (g-1). Therefore, the contributor degree centrality can be calculated as: j C 0D ðni Þ ¼ dðni Þ g21 4.2 Article degree centralization We define an article’s degree centrality CDM as the variability of the individual contributor centrality indices. The C D ðn * Þ represents the largest observed contributor degree centrality:  g  X C D ðn * Þ 2 C D ðni Þ C DM ¼ i ðg 2 1Þðg 2 2Þ Again we divide the variability with the highest variability observed in the graph. Having defined the metrics, we apply them and explain their qualitative values in a case study of the English language Wikipedia. 5. An insight from Wikipedia As previously mentioned, Wikipedia follows a hypermedia model to categorize the articles (lemmas) in an associative taxonomy. In that particular taxonomic classification, we define the following structures: . Domain. A collection of articles that tackle a common subject (e.g. philosophy). . Category. A collection of domains that have a common categorical and etymological root. For example, the domains philosophy and economics have a connection in the category of social sciences. In order to provide a qualitative analysis of the metrics deployed in the evaluation of the articles, we picked ten articles with a similar number of contributors to the domain philosophy from the English language Wikipedia. Table I shows the list of articles used in the case study, as well as the values of their article degree centralization. The data for each article was collected using the python Wikipedia robot framework (www.pywikipedia.org). The resulting networks contained an average of 259 contributors per article, and the average article inter-relations per contributor were approximately two. We used the diff function of the Wiki to assess the tie between the pair of contributors as modelled in Section 2. The data was then analysed using a python script, in order to calculate the individual contributor degree centrality along with the article degree centralization. Article name Adam Smith Aristotle Immanuel Kant Johann Wolfgang von Goethe John Locke Karl Marx Ludwig Wittgenstein Philosophy Plato Socrates Number of contributors Article degree centrality (max 1) 276 274 231 242 292 232 220 280 284 289 0.039114 0.0232 0.20484 0.016682 0.008581 0.006601 0.006328 0.00254 0.001207 0.000405 As can be observed from the table, the article degree centralization is relatively low because of the small collections of articles used in the case study and the inter-connections of the actors in the domain. However, it is enough to let us discuss some qualitative interpretations: . The dispersion of the actor indices denotes how dependent this article is on individual contributors. For instance, if an article has a very low degree of centralization, then it means that the social process to shape it was highly distributed, thus resulting in an article that has been submitted by multiple authorities. In our case, the articles represent a low degree of centralization, which means that contributions have been made by individuals who have interests in other domains as well. . The range of the group degree centralization reflects the heterogeneity of the authoring sources of the article. In our case, the article “Immanuel Kant” has a significantly higher degree of centralization (Figures 3 and 4 and Table II), which means that it has been contributed to by authorities concentrated in the domain of the article, and thus who have contributed to other articles. Contributors with higher inter-relation over the same domain represent higher authorities, based on the assumption that their interest spans the domain to which the article belongs and, therefore, have conducted background research regarding the material they have contributed. On the other hand, contributors with lesser authority tend to have their content erased/objected by contributors with higher authority. As can be observed from Figure 3, there exist a number of contributors subject to objections regarding their submissions, and therefore, they are situated on the periphery, whereas contributors with accepted contributions (authorities) tend to be in the centre. 6. Discussion and further research The question of the reliability regarding Wikipedia content is a challenging one. As long as the size of Wikipedia grows, the problem becomes more demanding, especially for topics with controversial views such as politics or history. Our study represents an early attempt at getting to the problem and thus working towards a more sophisticated solution to address it. However, there are a number of open issues that can extend the merit of this report. Evaluating authoritative sources 259 Table I. The Wikipedia articles from which empirical data was gathered OIR 30,3 260 Figure 3. Visualization of the social network of the contributors for the article “Immanuel Kant”. Nodes in the core denote high degree centrality Figure 4. A decomposition of the network to the contributors with the highest degree centrality for the article “Immanuel Kant” The in-degree can be calculated using a more sophisticated factor, representing how much of the text contributed by one actor has been edited by another. In our case, we represent the editing or the objection by using a scale from 0 to 1, thus aggregating the factors using simple sums. A fuzzy operator could provide a solution for aggregating the results obtained by undertaking a fuzzy diff between the current version of the Cluster (outerdegree) Freq Freq per cent CumFreq CumFreq per cent 1 2 4 6 8 10 12 16 18 20 31 1 199 14 6 3 1 2 2 1 1 1 0.4329 86.1472 6.0606 2.5974 1.2987 0.4329 0.8658 0.8658 0.4329 0.4329 0.4329 s 1 200 214 220 223 224 226 228 229 230 231 0.4329 86.5801 92.6407 95.2381 96.5368 96.9697 97.8355 98.7013 99.1342 99.5671 100 Representative 65.6.92.153 82.3.32.71 80.202.248.28 Snowspinner Tim Ivorson StirlingNewberry 24.162.198.123 JimWae Jjshapiro SlimVirgin Amerindianart article and the version submitted. In that case, the social tie also needs to be expressed in terms of fuzziness, along with the relevant cases. Expressions of credibility using imprecise criteria (Sicilia and Garcı́a, 2004, 2005) can also contribute to further advancement in that direction. The organization of topics and the definition of inter-connections is also a matter for research, since there are related domains with contributing authorities. For instance, in the category of the social sciences, a contributor who edits the article of Adam Smith and has an acceptance factor can be retained on both the economics and philosophy domains, as an article about Adam Smith is represented in both. In that case, network modelling using two layer networks (document reference, authority reference) can enhance the trust of the contributions (Hess et al., 2006). Furthermore, the measures developed and presented in this report do not actually measure the subjective quality of an article, since such a task is a cognitive process characterized by a high level of complexity. Those measures can contribute to providing an indicator of consensus related to an article, and thus assert that it does not provide controversial views or expressions of a small group of persons (especially in articles with political content). Thus, a level of neutrality expressed in the writing of the article is asserted. Finally, specific attention should be given to the diffusion of different affiliations related to one actor. For example, a contributor may have many affiliations to unrelated subjects. This may imply that the contributor has knowledge of both fields, but in special topic cases (e.g. cardiology), the contribution in subjects such as Renaissance can be attributed as a non-expert one. Therefore, a classification of the competencies of each contributor may need to be promoted to strengthen their credibility and association with the subject or the article contributed. References Andrew, L., Jakob, V., Cathy, M., Samuel, K. and Reinhold, H. (Eds) (2005), Proceedings of Wikimania 2005 – The First International Wikimedia Conference. Berners-Lee, T. and Fischetti, M. (1999), Weaving the Web: The Original Design and Ultimate Destiny of the World Wide Web by its Inventor, Harper, San Francisco, CA. Berners-Lee, T., Masinter, L. and McCahill, M. (1994), Uniform Resource Locators (URL), RFC 1738. Evaluating authoritative sources 261 Table II. Contributor degree centrality for the article “Immanuel Kant” OIR 30,3 262 Borgatti, S.P. and Foster, P.C. (2003), “The network paradigm in organizational research: a review and typology”, Journal of Management, Vol. 29 No. 6, pp. 991-1013. Brin, S. and Page, L. (1998), “The anatomy of a large-scale hypertextual web search engine”, Proceedings of the Seventh International Conference on World Wide Web, available at: www-db.stanford.edu/%7Ebackrub/google.html, pp. 107-17. Bush, V. (1945), “As we may think”, The Atlantic Monthly, July. Cadiz, J.J., Gupta, A. and Grudin, J. (2000), “Using web annotations for asynchronous collaboration around documents CSCW ’00”, Proceedings of the 2000 ACM Conference on Computer Supported Cooperative Work, pp. 309-18. Cobley, P. (1996), The Communication Theory Reader, Routledge, London. Everett, M.G. and Borgatti, S.P. (1999), “The centrality of groups and classes”, Journal of Mathematical Sociology, Vol. 23 No. 3, pp. 181-201. Fasoldt, A. (2004), “Librarian: don’t use Wikipedia as a source”, Syracuse Post Standard, 25 August. Faloutsos, C. (1985), “Access methods for text”, ACM Computing Surveys, Vol. 17 No. 1, pp. 49-74. Freeman, L.C. (1979), “Centrality in social networks: conceptual clarification”, Social Networks, Vol. 1 No. 3, pp. 2152-39. Festinger, L. (1950), “Informal social communication”, Psychological Review, Vol. 57 No. 5, pp. 271-82. Hess, C., Stein, K. and Schlieder, C. (2006), “Trust-enhanced visibility for personalized document recommendations”, Proceedings of the 21st ACM Symposium on Applied Computing, Dijon, France. Leuf, B. and Cunningham, W. (2001), The Wiki Way: Collaboration and Sharing on the Internet, Addison-Wesley, Reading, MA. Lih, A. (2004), “Wikipedia as participatory journalism: reliable sources? Metrics for evaluating collaborative media as a news resource”, Proceedings of the International Symposium on Online Journalism. Lipczynska, S. (2005), “Power to the people: the case for Wikipedia”, Reference Reviews, Vol. 19 No. 2. Orlowski, A. (2005), “Wikipedia science 31% more cronky than Britannica’s”, The Register. Page, L., Brin, S., Motwani, R. and Winograd, T. (1999), “The PageRank citation ranking: bringing order to the web”, Technical Report, Stanford Digital Libraries Project. Sabidussi, G. (1966), “The centrality index of a graph”, Psychometrika, Vol. 31, pp. 581-603. Scott, J. (2000), Social Network Analysis: A Handbook, 2nd ed., Sage, London. Sicilia, M.A. and Garcı́a, E. (2004), “Fuzzy group models for adaptation in cooperative information retrieval contexts”, Lecture Notes in Computer Science 2932, Springer, New York, NY, pp. 324-34. Sicilia, M.A. and Garcı́a, E. (2005), “Filtering information with imprecise social criteria: a FOAF-based backlink model”, Proceedings of the Fourth Conference of the European Society For Fuzzy Logic and Technology (EUSLAT). Wasserman, S., Faust, K. and Iacobucci, D. (1994), Social Network Analysis: Methods and Applications (Structural Analysis in the Social Sciences), Cambridge University Press, Cambridge. To purchase reprints of this article please e-mail: reprints@emeraldinsight.com Or visit our web site for further details: www.emeraldinsight.com/reprints