Keywords

1 Introduction

Wikipedia has long presented itself as “the biggest multilingual free-content encyclopedia on the Internet” (quoted in Ensslin 2011: 555, note 1). It now brands itself as “the encyclopedia that everyone can edit,” but the multilingual and global ambitions remain explicit in the phrasing used in the opening of the Wikipedia article about Wikipedia:

Wikipedia […] is a multilingual, web-based, free encyclopedia based on a model of openly editable content. It is the largest and most popular general reference work on the Internet. (https://en.wikipedia.org/wiki last accessed 15 June 2018)

In the same vein, Wikipedia founder and original funder, Jimmy Wales, is known for having expressed the ambition that the encyclopedia should be “the sum of all knowledge” (a title used in Salor’s dissertation in 2012). Or as the vision of the Wikimedia Foundation proclaims:

Imagine a world in which every single human being can freely share in the sum of all knowledge. That’s our commitment. (https://wikimediafoundation.org/about/vision/ last accessed 15 September 2018)

All these declarations suggest that Wikipedia is a particularly fascinating website to study how the Internet shapes the new language map. This chapter is doing so with a contribution from political geography to Wikipedia studies. It does not deal with linguistic aspects itself but with geographical aspects of languages (in plural), linguistic diversity, and multilingualism. While the chapter entitled Chap. 198, “Writing the World in 301 Languages: A Political Geography of the Online Encyclopedia Wikipedia” (Mamadouh, this volume) presents a short history of Wikipedia and a short account of its organization before discussing how it circulates (disputed) geopolitical representations unevenly across the world, this chapter focuses on Wikipedia as a multilingual project. It discusses how linguistic diversity and multilingualism are represented and enhanced (Wikipedia as mirror of global linguistic diversity), how multilingualism is practiced by multilingual Wikipedians, what language policies have been developed for the establishment of a new language version, how languages are sometimes contested (Wikipedia as microcosm of global linguistic diversity), and how Wikipedia might affect the evolution of specific languages and the relations between language groups (Wikipedia as motor of global linguistic diversity).

2 Linguistic and Political Geographies of Wikipedia

Since its creation in 2001 as an English language online encyclopedia using wiki technology, a website enabling users to work collaboratively to modify its content and structure, Wikipedia has expanded dramatically as becoming the fifth most visited website in the world (according to the authoritative webwatcher Alexa Internet) and foremost a multilingual project, with almost 300 language versions. Each version is called a Wikipedia in everyday language in such a way that Wikipedia is actually a collection of monolingual Wikipedias. The oldest and largest one, English Wikipedia, is said to have outnumbered by words the most voluminous encyclopedia of all times, the Spanish language encyclopedia of the Spanish language Enciclopedia Espasa in December 2005 (Van Dijk 2009: 234). A few months later in March 2006, the English Wikipedia reached the symbolic threshold of one million articles, followed by German (December 2009), French (September 2010), Dutch (December 2011), and 11 others, since most recently Chinese (April 2018) and Portuguese (June 2018).

Linguists have long tapped into Wikipedia to generate a corpus for their research about a single language and about crosslinguistic communication. Sociolinguists have used it as a valuable source to study transcultural encounters. A majority of the work is quantitative using a large amount of data dumps to search for editing patterns. Computer scientists use them to trace networks and patterns of meaning, sometimes explicitly looking for a Ur-Wikipedia, i.e., the proto-Wikipedia, behind the diversities of the different versions (Warncke-Wang et al. 2012, see also Bao et al. 2012). Other studies are more qualitative. Kopf (2018), for example, applies critical discourse analysis to study the discussions on the talk page of the article on the European Union in the English Wikipedia between 2001 and 2015 to analyze how the EU was perceived and represented and how Wikipedians negotiated contested issues.

In this chapter languages are not studied for their linguistic characteristics but for their sociospatial ones. The objective is to discuss the way Wikipedia engages with the global linguistic diversity from a geographical perspective. For geographers, languages have spatial characteristics, not only regarding the distribution of their speakers but also for the way they are employed to make sense of the world and the ways people use them to contribute to place-making. As Claude Raffestin (1995, 2012) documented, language and territory are co-constitutive. Informed by political geography and critical geopolitics, this study is sensitive to the way languages are used politically, for example, to justify territorial claims and to make sense of lived territory and place-making. Territorial claims of states of nationalist movements are often justified by the foregrounding of language differences and/or similarities, and language claims can be instrumental to territorial projects: promoting a specific language variety as a language can be part of the project to achieve political autonomy or independence from the state in which one resides. On the other hand, states can promote one language as the national language and discourage or even forbid others to homogenize culturally their population and foster a sense of national identity rooted in a common language. Language conflicts (which are of course not conflicts between languages and often are not about languages but about conflicts between language groups and about power and resources) have been extensively studied by political geographers such as Claude Raffestin, Colin H. Williams, and Alexander B. Murphy.

By contrast the representation of linguistic diversity is an understudied topic in geography and commonsensically represented as a juxtaposition of monolingual territories. Monolingualism is taken for granted as default, and it is way too often perceived and represented as the norm, both at the individual and the collective level, while multilingualism is seen as a complication or even a danger for sanity and cohesion. In Europe where the modern territorial states emerged after the Peace of Westphalia, “one state, one nation, one language, one territory” has long been taken for granted as the ideal for a stable political organization. Many conflicts ranging from wars about mixed borderlands through forced cultural assimilation and planned linguicide to ethnic cleansing have been justified by narratives proclaiming the objective of achieving such ideal. More recently there is more attention for the fact that multilingualism is a more common condition and that the way it is organized socially and spatially can be studied. But then again, economic globalization and transnational migration have created new configurations of linguistic diversity, with an extreme linguistic variety in local contexts, of which the urban multilingualism of New York City, London, or even Amsterdam, is emblematic.

A linguistic geography of Wikipedia could entail an analysis of the representation of languages as geographic phenomena and as features of geographic objects such as places, countries, or networks. Such a geographical inquiry could examine how Wikipedians in different Wikipedias represent languages and more specifically their geographical reach, language groups, language contacts, and language conflicts. Likewise, it could examine the linguistic data provided in articles about places, for example, the languages noted as being spoken in a specific country. This focus, however, is not the purpose of this chapter. Instead it will discuss how Wikipedias by their very existence inform us about languages in plural, i.e., about the global linguistic diversity and how multilingualism is practiced.

The analysis is based both on secondary literature and on primary documents (Wikipedia articles, talk pages, Wikimedia articles, and debates). Due to the finite linguistic capabilities of the author, it focuses mainly on sources in English and other major European languages such as French, Dutch, German, Spanish, Italian, and Portuguese (all languages with a large Wikipedia featuring over one million articles). This is problematic because it overlooks the debates in smaller Wikipedias and in non-Western languages (unless they are reported in exchanges in these selected languages) and, therefore, miss important voices and insights. However, most meta-discussions on Wikipedia projects, especially those on the opening and the closure of Wikipedias, are carried out in English. The position of the English language as lingua franca and as a hegemonic language (see Mamadouh 2018) in the world of Wikipedia is, therefore, also an important issue.

3 Wikipedia as a Mirror

Wikipedia can first be studied for the way it reflects the global linguistic diversity, first by the mere existence of so many linguistic versions and second by the explicit representation of multilingualism, i.e., of contacts and crossings between languages.

3.1 A Plurality of Wikipedias

There are two ways to stress the scope of Wikipedia. One is to stress its size (the mere number of articles, editors, users, etc. For statistics in real time, see https://stats.wikimedia.org/EN/Sitemap.htm), and the second is to focus on the extraordinary number of language versions. New Wikipedias have been created since 2001 and amount now to almost 300 (see Chap. 198, “Writing the World in 301 Languages: A Political Geography of the Online Encyclopedia Wikipedia” (Mamadouh, this volume)). Figure 1 is a snapshot of an animation showing the extraordinary growth of the number of Wikipedias over time including both their creation and their growing size. This is truly unmatched by other projects. It does represent the global linguistic diversity adequately in the sense that it represents a fraction of all the languages spoken in the world (about 7097 according to ethnologue.com of which about half are not written at all) and that all larger and more institutionalized languages (e.g., the official languages of independent states) are well represented.

Fig. 1
figure 1

A still of the animation by Erik Zachte of the growth of the language Wikipedia. (Available at https://stats.wikimedia.org/wikimedia/animations/growth/AnimationProjectsGrowthWp.html. Accessed 21 September 2018)

Linguistic diversity is mirrored in the organization of Wikipedia through the coexistence of a plurality of monolingual Wikipedias called after that language (such as English Wikipedia, Portuguese Wikipedia, and Japanese Wikipedia). But Wikipedia actively engages in multilingualism, not just through the parallel existence of different language versions. Even more significantly, these Wikipedias are prominently linked to each other, through interwiki links, allowing the users to switch between equivalent articles in different language versions. These articles are not identical: although sometimes an article in one of the languages has been translated from another language, each being the result of a unique process of collective editing. Actually, quite a large share of the entries are unique to a language version, which means that they are not linked to any entry addressing the same topic in another language. The interwiki links are a prominent feature of a Wikipedia page, bringing linguistic diversity to each monolingual page since the name of the languages in which an equivalent article is available are listed in the original language (e.g., Deutsch for German, Tiếng Việt for Vietnamese, 日本語 for Japanese, etc.) in the left margin of every page (this is not true for displays on the Wikipedia app for mobile devices). Even if users are monolingual and do not click on these interwiki links, their visibility does convey a strong sense of global linguistic diversity to users.

3.2 Multilingualism on Display

The multilingual character of Wikipedia is announced upfront in the slogan “the biggest multilingual free-content encyclopedia on the Internet” (a phrase mentioned at the beginning of this chapter) as well as in the logo. It is a three-dimensional jigsaw puzzle representing a globe, and each piece of the puzzle is marked with a glyph (a letter or a sign) and in a unique script, representing the beginning of the name Wikipedia in a main language in these different scripts (Fig. 2).

Each piece bears a glyph (a letter or other character), or glyphs, symbolizing the multilingualism of Wikipedia. As with the Latin letter “W,” these glyphs are in most cases the first glyph or glyphs of the name “Wikipedia” rendered in that language. They are as follows:

  • Near the center is Latin ⟨W⟩. Above that is Japanese ⟨ウィ⟩ wi; below it are Cyrillic ⟨И⟩ i, Hebrew ⟨ו⟩ w, and (barely visible at the bottom) Tamil ⟨⟩ vi.

  • To the left of the ⟨W⟩ is Greek ⟨Ω⟩ ō, and below that are Chinese ⟨維⟩ wéi, Kannada ⟨⟩ vi, and (barely visible at the bottom) Tibetan ⟨⟩ wi.

  • At left, from the top down, are Armenian ⟨⟩ v, Cambodian ⟨⟩ vĕ (lying on its side), Bengali ⟨উ⟩ U, Devanagari वि vi, and Georgian ⟨⟩ v.

  • The rightmost column is Ethiopic ⟨⟩ wə, Arabic ⟨و⟩ w, Korean ⟨위⟩ wi, and Thai ⟨วิ⟩ wi.

  • The empty space at the top represents the incomplete nature of the project, the articles, and languages yet to be added (https://en.wikipedia.org/wiki/Wikipedia_logo).

Fig. 2
figure 2

The globe in the Wikipedia logo (For a series of the evolution of the logo from Nupedia to present the logo, see illustrations at https://es.wikipedia.org/wiki/Marcas_corporativas_de_Wikipedia#/media/File:WikipediaLogo-TheOfficialFive.jpg)

There have been debates about the logo as some have questioned the central positions of the Latin and the Greek alphabets (with the w and Ω) followed by the associated Kanji and the Cyrillic letter contrasting with the peripheral ones for the Arabic of the Devanagari characters for example. And there are more, including Klingon (a fictional language from the television series Star Trek) which was included and lasted until 2010 when it was replaced by Amharic and since users reported the lack of precision and care in the representation of the letters and signs in Japanese, Devanagari, Chinese, and Greek (Ensslin 2011, pp. 549–553).

The text at the bottom of the logo was in English, Wikipedia The Free Encyclopedia, and is available in translation for the main pages of the other Wikipedias, for example, Wikipédia L’encyclopédie libre in French, Wikipedia de vrije encyclopedie in Dutch, Wikipedia Vapaa tietosanakirja in Finnish, Vikipēdija Brīvā enciclopēdija in Latvian, Esperanta Vikipedio in Esperanto, ويكيبيديا الموسوعة الحرة in Arabic, etc. This way the logo is automatically localized depending on where one accesses the Internet.

The splash page of the overall project at https://www.wikipedia.org/ also reflects both diversity and clear language hierarchies. The largest Wikipedias are listed around the logo. Below is a search engine (with a menu for language choice among the larger Wikipedias), followed by a menu “Read Wikipedia in your own language.” (This page displays the interface in the language dominant at the location of the IP, but the names of the language are filed in the original language.) The menu includes a list of languages that are visible; the languages are grouped by size, and within each cluster, they are ranked alphabetically.

  • Those with 100,000+ articles

  • Those with 10,000+ articles (but less than 100,000)

  • Those with 1000+ articles (but less than 10,000)

  • Those with 100+ articles (but less than 1,000)

  • And a link to a list with other languages (the smaller Wikipedia with less than 100 articles)

A reproduction of the main page in Ensslin (2011, pp. 548) shows that these hierarchies have long been in place. The structure is intact in 2018, but the size categories and the exact list of languages have evolved since the language versions have not necessarily developed at the same pace.

For editors, that is, active Wikipedians contributing to the encyclopedia, multilingualism is additionally foregrounded in many features of Wikipedia apart from the interwiki links and in many projects of the Wikimedia Foundations. For example, it is clearly represented in the Babel user language template that Wikipedians can use on their personal page to announce their linguistic skills. To label a language, they use codes allocated to languages by the International Organization for Standardization (ISO), the main international standard-setting body based in Geneva and composed of representatives from various national standard organizations. The Babel user template uses ISO codes (en for English, fr for French, nl for Dutch, etc.) and a standardized classification of expertise: mother tongue, ranging from 1 (for basic ability) to 5 (for professional level). The box displays the description of the level of expertise in the language in question. For example, “This user is able to contribute with a professional level of English” for en-5 vs “This user is able to contribute with a basic level of English” for en-1 (Fig. 3).

Fig. 3
figure 3

Snapshot codes for user-en

It is also noteworthy that members of the language committee are listed along with their linguistic skills (not other information provided about the committee), and they are all multilingual, mentioning at least three languages, up to 17 (of which 11 are at the basic level) for one member (Fig. 4), and all mention fluency in English, at least en-3, i.e., “can write in this language with some minor errors.” The coding presupposes, however, some familiarity with the Latin alphabet, Arabic numbers, and the language codes used online; in other words “Wikipedia’s multilingual endeavours are skewed towards a power imbalance in favour of code-savvy Western (and specifically Anglophone) users” (Ensslin 2011, p. 554). So it celebrates multilingualism, but participates into the reproduction and deepening of English hegemony.

Fig. 4
figure 4

Snapshots of the Babel box of one of member of the language committee

Another expression of the multilingual habitus of Wikipedia is the Wiktionary, a sister project of Wikipedia to construct free-content multilingual dictionaries (now existing in 171 languages). Wiktionaries collect lexicographic data that can be used for various natural language processing tasks (see, e.g., the English Wiktionary at https://en.wiktionary.org/wiki/Wiktionary:Main_Page) which can be particularly useful resources for Wikipedians editing articles on Wikipedia.

Moreover, there is Babylon, the Wikimedia translators’ portal at https://meta.wikmedia.prg/wiki/Meta:Babylon with a talk page, a mailing list, a newsletter, and an IRC channel. There are multilingual Wikisources. Translation requests can be filed to encourage volunteers to work on specific articles deemed worthy of translation between Wikipedia A and Wikipedia B, either because the existing article has been noted for its qualities or because the topic is particularly relevant and prioritized, since the lack of an entry is assessed as an urgent need.

3.3 Multilingualism in Practice

Wikimedia provides a wide range of statistics about its own activities including Wikipedias at https://stats.wikimedia.org/, generating tables about articles, views, editors, etc. for the different Wikipedias. They document the wide differences between the Wikipedias and the uneven representations of languages (Table 1). These impressive figures should not distract the observers from the deep inequalities between the language versions: in size and in quality (comprehensiveness, readability, and reliability).

Table 1 Data for the 15 largest Wikipedias

An earlier comparative study of different Wikipedias warns about comparing sheer statistics provided by Wikipedia. For example, counting articles gives an approximation of the size of each Wikipedia, but remains a very global estimate. Articles widely differ in size and quality. Some are very long and detailed, refer to many sources, and feature plenty of illustrations. For example, the article about the European Union in English at https://en.wikipedia.org/wiki/European_Union is 26 pages long if printed and includes almost 300 references and a large number of maps, tables, and pictures. But some articles are one-sentence stubs that are very incomplete entries (e.g., the entry about the EU in Gagauz at https://gag.wikipedia.org/wiki/Evropa_Birlii), and sometimes articles appear in the wrong language (created or copied to another – often with the intend to get them translated – but not yet translated and who knows for how long they are in limbo) (Van Dijk 2009, p. 236). In a sample of 50 random articles for 53 language editions, Van Dijk found some major differences between what he calls “real” and “pseudo” articles (the first being content rich and with strong editing by individuals) with the Japanese Wikipedia ranking 100% real articles, while others were as low as 10% (Upper Sorbian and Corsican) and still others at 0% (in the artificial language Volapük). He uses these findings to distinguish four groups of Wikipedias:

  1. 1.

    Large ones (like German, French, Dutch, Russian, and Chinese with over 100,000 articles in 2008) covering a vast range of subjects and very active

  2. 2.

    Medium ones (like Catalan or Esperanto) quite similar, but more modest, more than 10,000 articles in 2008

  3. 3.

    Small ones (like Afrikaans, Swahili, and Bavarian) with then over 1000 real articles according to his estimation, covering only fragments of human knowledge and not very active

  4. 4.

    Micro ones with very few real articles and hardly active (his 2008 sample included Scots, Zealandic, and Tok Pisin) (Van Dijk 2009, pp. 237–238).

Almost 10 years later, the size of the Wikipedias has changed, especially the largest ones have become much larger, all now with more than one million articles (as shown earlier in this chapter), but the typology combining size and activity is still a useful categorization.

Generally speaking, the community does not promote the translation of articles without localization in the societal context associated with the language to serve the intended audience. For example, the article about James Joyce in Italian is expected to feature more details about his life in Trieste than the article in English, while the article about a US movie in language X is expected to feature details about the circulation and reception of the movie in that language zone, including the name of the voice actors dubbing for the main characters (when dubbing is applied) or the name of the translators of the subtitles.

Some versions have expanded dramatically using machine translation through the work of bots or web robots generating articles by translating them automatically from the other Wikipedias, often the English Wikipedia . In 2007 already the Volapük Wikipedia increased from about 800 articles to over 110,000, thanks to the use of bot-generating stubs (sketches of articles that are largely empty and need to be filled by editors with additional information). The most productive bot is called lsjbot; it has been created by the Swedish Wikipedian Sverker Johansson and used to generate content for the Swedish, Cebuano, and Waray-Waray (the latter two are languages in the Philippines where his wife is from) especially about animals and plants. He proudly calls himself on his personal page https://en.wikipedia.org/wiki/User:Lsj as the single biggest producer of articles. All three languages are now ranked in the top 15 (over one million articles). Despite these achievements, the use of bots and machine translation is highly disputed. In Winter 2017–2018, a lively discussion ensued about closing the Cebuano Wikipedia altogether because it consists mainly of bots-generated content (99% of the edits, see Table 1). The proposal to close it was eventually rejected, but is likely to resurface in the near future since the general discussion goes on.

The disagreement between proponents and opponents of machine translation boils down to quality assessment. The proponents think it is better to have a (poorly) translated article than nothing, the other fear that it circulates hegemonic Anglo-American representations. Moreover, it is disputed whether it is easier and more convenient for local editors to edit a translated article and localize it or to create a new one from scratch.

Notable disputes about new articles have been known as long as there have been ongoing debates between the inclusionists and the deletionists (Ford 2011, pp. 262–263); the latter are also called exclusionists. The first favor the inclusion of new articles, even if short and/or poorly written (note that this could mean also in a nonstandard version of the language). The second prioritize quality and favor the deletion of articles that do not match their high standards. The dispute was particularly serious among German Wikipedians (Wikimedia Deutschland 2011: especially pp. 164–182). The dilemma is as old as the making of encyclopedias. And Achim Raschka, the German editor opposing the use of bots, referred to this point in this context to an entry written by Denis Diderot for the Encyclopédie in 1751, titled “Aguaxima”:

Aguaxima, a plant growing in Brazil and on the islands of South America. This is all that we are told about it; and I would like to know for whom such descriptions are made. It cannot be for the natives of the countries concerned, who are likely to know more about the aguaxima than is contained in this description, and who do not need to learn that the aguaxima grows in their country. It is as if you said to a Frenchman that the pear tree is a tree that grows in France, in Germany, etc. It is not meant for us either, for what do we care that there is a tree in Brazil named aguaxima, if all we know about it is its name? What is the point of giving the name? It leaves the ignorant just as they were and teaches the rest of us nothing. If all the same I mention this plant here, along with several others that are described just as poorly, then it is out of consideration for certain readers who prefer to find nothing in a dictionary article or even to find something stupid than to find no article at all. (quoted in Wikipedia Signpost 29 June 2013, available at https://en.wikipedia.org/wiki/Wikipedia:Wikipedia_Signpost/2013-06-19/News_and_notes)

Geographical entries are particularly well suited for such exercises, since a lot of stubs can be generated stating no more than the existence of a place (see, e.g., the entry for Logan, Utah, in Turkish Wikipedia at https://tr.wikipedia.org/wiki/Logan,_Utah or the Finnish city of Mikkeli in Greenlandic https://kl.wikipedia.org/wiki/Mikkeli or in Chinese https://zh.wikipedia.org/wiki/%E7%B1%B3%E5%87%B1%E5%88%A9). Machine translation or editing based on foreign articles do not necessarily use the English Wikipedia as starting point. Articles in a related language (linguistically and/or culturally) might prove much more useful, for example, using a Danish article to produce one in Bokmål, a Czech article to produce one in Slovak, a Portuguese article to produce one in Galician, an Indonesian article to produce one in Malay, etc.

A final reason to acknowledge that Wikipedia mirrors global linguistic diversity including existing inequalities and hierarchies between languages is the realization that English is the editorial and auxiliary metalanguage, in other words English as a co-language (Ensslin 2011; Mamadouh 2018) used for discussions among editors and administrators across Wikipedias and other Wikiprojects, not to mention the Wikimedia Foundation itself. This typically is both the enabling and disabling function of English (Ensslin 2011, p. 555), making cross-cultural communication possible, but at the same time disabling some users/editors and favoring those in good command of the language, especially native speakers. In that sense, one might wonder if the English Wikipedia is an example of the use of English as lingua franca, i.e., a language shared by everyone or an instrument of English hegemony, i.e., the hegemonic position of the core speakers of English (the Anglosphere dominated by the UK and the USA). In any event, the English Wikipedia is different from the others because it clearly serves a global audience, while other versions serve more localized audience, even if the Portuguese, Spanish, and French Wikipedias also serves a public spread across different continents.

4 Wikipedia as a Microcosm

As an encyclopedia, and as a set of monolingual encyclopedias linked together, Wikipedia reflects the uneven relations between languages as well as the differences between languages regarding their status, prestige, and resources. But as an online community or as a network of online communities of editors, Wikipedias also deal with lingual diversity by “doing” multilingualism. As such Wikipedia is a microcosm of global linguistic diversity.

4.1 Negotiating Multilingualism

Multilingual users navigate between language versions to find the information they need if no relevant article is published in their first language and to find the most complete article in the languages they can read. But often the purpose is to compare the content of the articles in different language versions or sometimes as a translating tool between the languages they know in order to enlarge their vocabulary (i.e., to check how to translate the name of a disease, a plant, or a specific place or the proper transliteration of a person’s name in another script). This multilingual reading of Wikipedia greatly enriches multilinguals, expanding their mastering of the different languages as well as their understanding of cultural or environmental differences (e.g., the arrival of swallows associated with spring in French sayings, but with summer in Dutch ones who are linked to the location of the core areas of both languages: after the winter swallows arrive sooner in the south of France than in the Netherlands). It is also a reality check on taken-for-granted assumptions about the global notoriety of national artists, scientists, or politicians when they seem not worthy of a page in another language.

Studies have shown that editors work across language versions. According to Hale (2014a), 15% of active Wikipedia editors are multilingual and edit in multiple language versions. They sometimes edit in languages they do not master well in order to comply with layout rules or systematize interwiki links. Cross-language editing has been studied by Hale (2014a, b); he found a strong negative correlation between the size of the group of users primarily editing a language edition and the percentage of multilingual editors. In other words, he found a higher level of multilingualism among smaller language editions and a lower level of multilingualism among the larger language editions, with the Japanese Wikipedia being the most monolingual. Hara and Doney (2015) compared English and Japanese editors editing articles about Okinawa and found differences in content and interaction style (also between the interventions of the two languages for bilingual editors). Kim et al. (2016) analyzed the content written by multilingual editors in the English, German, and Spanish Wikipedias and found differences among the characteristics of the editors, the policies they adopted, and their behaviors. The English Wikipedia has the largest and most varied number of multilingual editors by any primary language (only 33% are primary users of English). Editors whose primary language was Spanish or German made more complex edits than those who edited in these languages as their second (or third or fourth…) language. By contrast editors working in the English Wikipedia as their second (or third or fourth...) language made edits as complex as primary users of English. It suggests that the English multilinguals were editing interwiki links, adding illustrations, standardizing layout, etc., while multilinguals working in English contribute to the contents. These findings stress the role of the English Wikipedia as a common source for a global community. Content is provided and edited by a linguistically diverse population.

Many studies have used Wikipedias as a source of information about cross-language and cross-cultural differences and to compare Wikipedias and their content. Despite the many interwiki links, it is very common that articles have no counterparts in another edition. For example, local politicians, artists, or scientists might have only a page in the main language of their country of origin. By contrast Donald Trump (since his election as president of the USA), Wolfgang Amadeus Mozart, and Albert Einstein are likely to have an article in a very large number of Wikipedias. Likewise, the USA, the United Nations, Paris (France), and the Olympic Games are likely to be well covered in most languages, while Moresnet, the Indian Ocean Tuna Commission, the small community of Paris, Iowa, or the Dutch Championship Frisian handball will generate many fewer entries.

Academic studies have systematically compared the coverage of main Wikipedias in terms of topics. They report very moderate overlap, even between languages with a similar cultural background and a very active community of editors (e.g., only 51% overlap between the English and the German, according to Hecht and Gergle 2010). A large number of articles are specific to a single Wikipedia (75% according to Hecht and Gergle 2010). Even smaller ones have a specific knowledge and are centered on the place where the language is used. Others have ranked people the most covered (in the most languages), networked (through links to other articles), and consulted by users. The Swedish botanist Carl Linnaeus scores high, for example, because in most languages most articles including those on botanical classification refer to an article about him.

Samoilenko et al. (2016) have carried a cluster analysis based on co-editing among 110 Wikipedias’ clustering languages that share a large number of concepts, that is, articles about the same concept that are linked together by an interwiki link. The resulting map (Fig. 2 on p. 8 to be found in the open access article at https://epjdatascience.springeropen.com/track/pdf/10.1140/epjds/s13688-016-0070-8) shows 23 clusters, based on linguistic distance (Romance languages), but also geographical proximity (e.g., Hungarian, Czech, Slovak, Romanian, and Esperanto; Japanese, Korean, Chinese, and Thai; or Scandinavian languages and Finnish despite important linguistic differences among them). Likewise, different patterns of controversies have been studied (Apic et al. 2011; Yasseri et al. 2014) identifying in samples of language clusters on the basis of shared contested topics.

4.2 Creating a New Wikipedia: How to Make Your Language Count in the World

As a community, the editors of Wikipedia have also to make decisions about the creation of new Wikipedias. The procedure they have developed, for what it is worth, is important since Wikipedians function as gatekeepers. The language proposal policy has been codified following a proposal in June 2006 (that can still be consulted at https://meta.wikimedia.org/wiki/Language_proposal_policy) following the sharp increase of the number of language versions (already 250 by then). A language committee was established (presently 14 members) and published resources to make the procedure more transparent with handbook for the requests for a new language and for the editors who want to participate in the decision-making process (see details about both members of the committee and for the handbooks at https://meta.wikimedia.org/wiki/Language_committee).

The specific steps to be followed when filing a request for new language are specified:

  • Check that the project does not already exist.

  • Obtain an ISO 639 code (for the language name).

  • Ensure the requested language is sufficiently unique that it could not exist on a more general wiki.

  • Ensure that there are a sufficient number of native editors of that language to merit an edition in that language.

    (see https://meta.wikimedia.org/wiki/Requests_for_new_languages)

The language community needs to develop an active test project and ensure it is active until approval (this is checked by a bot; the threshold is three active editors, i.e., editors having committed at least one edit in the past 30 days). There are required MediaWiki interface translations (to implement the architecture of the website and the interwiki links) in a process called localization.

The language code (a valid ISO 639-1 or 639-3 language code like en for English) should be accompanied with the name of the language in English and in the language itself (as it will eventually appear on the list of available languages for the interwiki links). The URL is modelled after the language code and reads like langcode.wikiproject.org and eventually, if accepted, langcode.wikipedia.org.

If there is no valid ISO 639 code available, the volunteers must obtain one, i.e., convince the standards organization to create an ISO 639 code. The evolution of the global regime of language recognition has been analyzed by Tomasz Kamusella who situates its origins in the Summer Institute of Linguistics (SIL) established in 1934 in the USA and closely connected to the Wycliffe Bible Translators (Kamusella 2012, p. 71) and their need to assess and classify indigenous languages to which the society wanted to bring the scripture. Since 1946 it merged into the United Bible Societies (UBS) which by 2010 have made the New Testament available in 1231 languages (Kamusella 2012, p. 71). With computer storage revolutionizing libraries in the 1960s, classification was later driven by the Library of Congress and other national libraries in need of a standard to categorize their holdings in a systematic way (Kamusella 2012, p. 63). In 1967 the International Organization for Standardization (ISO) came up with the ISO 639 standards covering the 184 main languages with two-letter codes. Eventually the Internet and its standards extended possibilities to include languages and the ISO-3 codes do not rely on Romanization anymore but on Unicode, an Internet standard that supports over 600 languages in about 160 scripts (Kamusella 2012, p. 65).

Invented language codes have been allowed occasionally (for 13 Wikipedias at the moment) mostly because standard language codes were not available (e.g., for Simple English, Ripuarian, or Dutch Low Saxon), but sometimes instead an official code or a new code was created after the Wikipedia page (e.g., for Samoglitian, Norman, Cantonese, or Classical Chinese).

Each proposal is discussed online with proponents of and opponents to the new Wikipedia, and arguments are filed (often in English or with an English translation). “The project will be assessed on its linguistic merits and chances of flourishing.” Discussions are public and can be retrieved via the same webpage (https://meta.wikimedia.org/wiki/Requests_for_new_languages). They might be very procedural, that is, participants tend to oppose languages with no ISO code, taking the existence of a code as an indicator of the singularity and notability of any language. Others oppose this rigid interpretation, but suggest the proponents seek to convince ISO first or allow for exceptions.

Other arguments revolve around the singularity of the language and its relations with other languages. When the proposed version is commonly perceived as a local variety (e.g., when Valencian is conceived as a variety of Catalan or Cajun French as a variety of French), a new Wikipedia is opposed, but arguments to prove similarities and differences can be linguistic (differences or similarities in vocabulary or syntax) or political (in the case of Valencian, the reference to the Valencian Autonomous Community, as distinct to the Catalan one, in Spain). Although political arguments are not receivable, since Wikipedia is meant to serve individuals not political communities, the political status of a language does intervene as it is closely linked to different cultural practices. For example, different political and social institutions generate different vocabularies between Belgium and the Netherlands or the USA, Canada, and the UK.

With regional languages, the discussions often revolve around the respective advantages to have one Wikipedia for one language family or of separate Wikipedias for more local language, and by definition smaller, ones. For example, the discussion about the eligibility of Gronings (a Dutch Lower Saxon variety) was opposed as detrimental to the existing Dutch Lower Saxon Wikipedia (including Gronings) despite the fact that Gronings had a specific ISO code and is often acknowledged as a specific language and emphasizing that using specific dialects was allowed in that Wikipedia. Therefore, the existing Wikipedia was seen as a place to cultivate intercomprehension between the different dialects. By contrast, the existence of a specific Dutch Lower Saxon Wikipedia had been justified – despite the absence of a ISO code for it and despite dialect continuum among Lower Saxon dialects across the border between the Netherlands and Germany – by the fact that speakers on each side of the state border use the national language (Dutch and German, respectively) to write in their regional language and to create new words such as cell phone (see also Van Dijk 2009 on these languages).

The number of living native speakers should be sufficient to form “a viable community and audience.” Nevertheless, Wikipedia has been proposed for historical languages such as Latin and for artificial languages such as Esperanto, and they could be approved (and maintained) if “a reasonable degree of recognition as determined by discussion” (the phrasing is from the policy, see https://meta.wikimedia.org/wiki/Language_proposal_policy) is demonstrated, according to the language committee. There are Wikipedias in Esperanto, Ido, Interlingua, Interlingue, Lojban, Volapük, and Novial, all constructed languages with a sizeable community of users, but Toki Pona has been removed for lacking any societal grounding.

If a language is verified as eligible, the project can be developed in the incubator (“at least five editors must edit that language regularly before a test project will be considered successful”), and eventually the Wikipedia can be published (or closed if there is little or no activity). To develop the project, a list has been consolidated of a thousand articles every Wikipedia should have pertaining to basic content regarding biographies, history, geography society, culture, science, technology, and mathematics. When the test wiki is finally approved, all pages developed in the test wiki are transferred to the actual Wikipedia.

Over time, ten of the smallest Wikipedia have been closed due to inactivity. Some were set back in the incubator where the community can develop it further. The list also mentioned Wikipedias that have been deprecated (Alsatian has been extended to Alemannic, Akan which is now considered a family of languages, but one Wikipedia is now in the Twi language (one of the Akan languages)). Some wikis (Toki Pona and Klingon, the first an artificial language, the other a fictional one) do not belong to the Wikipedia family anymore; they have been removed and are now hosted by wikia.com as well as a few rejected proposals (Lingua Franca Nova, Korean Hanja, and Prussian). Moldovan was closed because Moldovan was found to be a version of Romanian (even according to the 1989 Language Law of Moldova) written in Cyrillic, and there is a software to navigate the two scripts. Finally one language has been deleted: it is infamous case of the so-called Siberian language in 2006, a terrible embarrassment for the community (although it was created before the language policy was put in place following a proposal in June 2006, the very name given to that language should had alarmed editors); it was a fantasy language created by Yaroslav Zolotaryov, a Siberian separatist with a misogynic, xenophobic, and anti-Semitic political agenda.

4.3 Disputes About Languages

At first sight the criteria formalized in the language policy seem rather straightforward, but they are not. Even the number of editors is disputable, and having a Wikipedia can be a strategic step toward the revitalization of a language. A rather small group might start a virtuous circle mobilizing a larger number of participants. By contrast an enthusiastic group might quickly become exhausted and demotivated and stop writing articles.

Two other criteria are even trickier and might generate some discussion in specific cases. Mutual intelligibility is not a technical matter: it greatly depends on attitude and exposure, because linguistic differences reverberate power relations. The name given to a language (its recognition as a language as opposed to being a variety, a dialect, a sociolect, or a regiolect of another language) is politically motivated. Small differences in language use (pronunciation, vocabulary, syntax, idioms, etc.) pertain to linguistic norms and cultural differences between social groups, geographically bounded or not. Cultural differences are easier to tackle than linguistic ones. An article can provide information about different practices in different places and in different sections and make them visible with subheadings. For example, an article on academic degrees or on municipal institutions in English will discuss the matter in general terms before presenting the practices and associated concepts in different societal contexts, typically the USA, Canada, the UK, Ireland, and Australia.

By contrast issues regarding the linguistic code itself are less easy to solve in the article itself. Especially this is true when the language is monocentric, with a clear hierarchy between different varieties, such as French when the editors writing in another variety will be “corrected” and see their language framed as incorrect and improper for Wikipedia purposes. In English, similar problems arise for varieties beyond British and American English (but even between them regarding vocabulary and spelling, e.g., colour/color) especially postcolonial World Englishes (Kachru 1996) such as Indian, Nigerian, or Singaporean varieties. The style rules specify that consistency within an article should be achieved. Furthermore, the preferred variety depends on the content: for localized topics the spelling of the local variety is favored and for general topics the spelling and vocabulary most commonly used (meaning generally that the variety used by the largest group of speakers, i.e., American English, French of France, Brazilian Portuguese, German standard German, etc.).

The hegemony of the core users of a language applies both to the decision to add an article on a specific topic (that can be considered not worthy of an entry by editors who come from the core) and to the syntax and the vocabulary used in an entry. Notable disputes arise around the use of measurements especially in the English Wikipedia: most editors favor the metric system as a global scientific system, but others argue for American imperial metrics since the metric system is mystifying most of the American readers, the largest group of English speakers. Again, the global status of the English edition is the cause of the debates. The importance of a topic can also be very local and rejected by a wider community. There was, for example, a controversy surrounding an article for the English Wikipedia about Makmende, a Kenyan fictional superhero reactualized in a video that went viral in 2009, but that, nevertheless, was not seen as noteworthy for an entry by other English-speaking editors (see Ford 2011). The entry has since been stabilized (see https://en.wikipedia.org/wiki/Makmende).

The naming of the language is particularly sensitive. The existence of a Wikipedia can become important in political struggles as an endorsement of claims to autonomy. Political issues are particularly significant when nationalist separatist or irredentist movements are mobilizing around language distinction. Most notably, controversies emerged around the disintegration of Yugoslavia and the process of distinction that was strategically devised among the languages of the successor states of Yugoslavia: Croatian and Bosnian as opposed to Serbian. Montenegrin was rejected four times by the Wikipedia’s language committee, and a new discussion has been started in 2018 after an ISO code was attributed to the language. The Serbo-Croatian edition was revived after some discussion in 2005 by proponents of maintaining the language and a Wikipedia version promoting intercommunication between Wikipedians in the successor states of Yugoslavia and promoting intercomprehension between speakers of Serbian, Croatian, Bosnian, and Montenegrin to resist national language policies promoting further distinction between them for political reasons. Discussions about these languages show that some strongly oppose linguistic diversity within in the Serbo-Croatian Wikipedia (e.g., the use of different dialects and foremost Latin and Cyrillic scripts, and Serbian is more and more associated exclusively with the latter which was not the case in the past). In general code mixing seems to generate opposition. The Norwegian Wikipedia originally accepted articles in both official languages Bokmål and Nynorsk, but the users of Nynorsk felt marginalized, and in 2004 they split. There are now two Norwegian Wikipedias, one in Bokmål (no.wikipedia.org) and a much smaller one in Nynorsk (nn.wikipedia.org) (see also Van Dijk 2009, p. 243). Egyptian Arabic has its own Wikipedia, next to Arabic, and that has been disputed too, for fear of fragmentation of the latter if other varieties of Arabic also granted a separate version (Moroccan and Algerian have been verified as eligible, but a proposal for South Levantine Arabic, a Lebanon variety written in Latin script, has been rejected).

Next to disputes about creating or not a new Wikipedia, debates about proposals for closure are also worth considering. An overview and more information about closure proposals (and sometimes removal proposal) are available at https://meta.wikimedia.org/wiki/Proposals_for_closing_projects. The arguments put forward by Wikipedians to convince each other and move toward a consensus are insightful and echo political disputes about languages and linguistic identities in society at large. They also provide information about the working of Wikipedias since they reflect upon existing editing practices.

One dispute is slightly different, but particularly insightful regarding global linguistic diversity. It pertains to Simple English, a Wikipedia written in controlled English. With over 100,000 articles, it is a medium-sized Wikipedia. The Simple English Wikipedia is meant for “people with different needs, such as students, children, adults with learning difficulties, and people who are trying to learn English” (as phrased on https://en.wikipedia.org/wiki/Simple_English_Wikipedia). Editors use simpler vocabulary, shorter sentences than in the regular English version, and stick to commonly accepted facts (Dowling 2008, see also Yasseri et al. 2012).

In 2018 a third attempt to close Simple English led to a long discussion (54 printed pages) that shows the divide between Wikipedians. Arguments for closure include the fact that Simple English is not a separate language and that the Wikipedia should be merged with the English Wikipedia. Others criticized the poor quality and consistency of its articles and its liability to vandalism. It is also said to distract resources (volunteers’ time) from other Wikipedias and more notably from efforts to write in plain English in the regular English Wikipedia. Its raison d’être is contested: simplification is seen as contradictory to encyclopedic comprehensiveness. Its efficiency is contested: it does not reach the intended audience, that is, students and language learners do not know about it. Therefore, some want to close it altogether; others want to add more visible tabs on English articles to improve the interconnection between comprehensive and simple articles about the same topic. But then again, others want to prevent any association between English and Simple English for fear that what they perceive as poor quality of the latter might taint the reputation of the former.

Clearly Simple English is experienced in very different ways. Foreign learners write they want to read the real thing and to learn English, not some simplified language. On the other hand, some editors claim to read Simple English articles in technical domains they do not master, and some non-native speakers of English report they dare write and edit an article for Simple English, but not for the English Wikipedia. Finally, the exceptional status of Simple English is resented, as has been noted a number of times, and proposals to create another Wikipedia in simple language in other languages have been rejected (on the ground that they were no separated language and they had no language community). But, again Simple English was created before these rules were formulated and adopted. All in all the proposal to close Simple English was rejected on August 1, 2018. But the discussion is likely to be raised again in the near future.

Finally, and in a similar vein, the question is whether or not to have several Wikipedias in the same language, but this time using different scripts. This is another fascinating ongoing discussion about the creation of a Chinese Wikipedia in Pinyin (in the Latin alphabet instead of ideograms). Originally, there were virtually two Chinese Wikipedias under the names of “zh” (or “zh-cn”) and “zh-tw,” but from 2005 it has been made redundant by the availability of an automatic system to convert between traditional and simplified Chinese. Like Simple English, a Pinyin Wikipedia could be helpful for learners of Chinese, either children or foreigners. The matter is at the time of writing still under deliberation.

5 Wikipedia as a Motor

A last aspect to consider is whether Wikipedia can sort our changes taking place in a language offline. As we have seen, creating a Wikipedia version is a demanding process, and it could be well seen more and more as an up-to-date criteria to measure the vitality of a language and as such become an important aspect of sociolinguistics and language policies.

The complicated impact of Wikipedia (and more generally speaking Web 2.0) on “weaker” languages, i.e., language without much institutional support, needs more scrutiny. On the one hand, it offers an incredible opportunity (low-cost production and circulation of content), while on the other it generates a huge pressure on volunteers to standardize and harmonize their languages. Typically issues concern local varieties, spelling, and vocabulary especially when neologisms are needed (also contradicting Wikipedia core principle regarding “no new knowledge”). Moreover, Wikipedians are not or may not necessarily be well connected to established language activists and their institutions.

Baxter (2009) examined the process in more details for the Breton language, showing how the Breton Wikipedia evolves from being a terminology consumer to being a terminology provider. Disputes revolve around the rejection of French calques (as a result of some kind of inferiority complex) and alternatives such as the use of international calques for the nativization of new words (but then again which one? ie. Latvia following the name of the country in Latvian, English, and Slavic languages or Letonia following German, Dutch, Scandinavian, and Romance languages, cited in Baxter 2009, p. 70), borrowing from other Celtic languages (Welsh especially) as well as different shades of purism (reformist purism, elitist purism, xenophobic purism) (see Baxter 2009, pp. 71–72) and a preference for continuity with earlier translations (especially to serve children enrolled in Breton language schools). By its sheer volume, the Breton Wikipedia is the biggest corpus in Breton today and the only encyclopedia in Breton, and, therefore, it is bound to be influential in the evolution of the language, especially since in this case no reliable sources like established dictionaries and encyclopedias are available; Wikipedia is, unavoidably, becoming itself such a source of language standardization.

Regional languages lacking full-fledged state support are only one kind of lesser-resourced languages greatly affected by Wikipedia. Van Dijk (2009) compares them with third world languages (sic!) by which he means poorly institutionalized languages from the Global South; his analysis does not include Portuguese or Spanish, but he deals with Bahasa Indonesian, Arabic, and Swahili. At the time poor access to the Internet, lack of software standards (for long ASCII signs did not deal properly with Arabic characters), and censorship were likely explanations for their poor visibility in Wikipedia. Van Dijk also notes differences between them using local geographical knowledge as an indicator: the Swahili Wikipedia covers cities in the region much less than the English Wikipedia, the situation is slightly more balanced for Arabic, and Indonesian has more regional city content than the English Wikipedia. This suggests different roles for English in these regional contexts.

More generally Van Dijk (2009) signals interesting absences that relates to the different repertoires of multilingual speakers. For example, he observes that the Afrikaans Wikipedia is relatively small and most likely because Afrikaners do not feel the urge to create it and as they consult routinely the English Wikipedia instead. By contrast the Luxembourgian Wikipedia is much larger than expected despite the diglossic situation in Luxembourg and the role of umbrella language German as the standard language. But then again Swiss Germans seem to have no incentive to contribute to an Alemanic Wikipedia, and they happily consult the German Wikipedia , since High German is the proper language for an encyclopedia (Van Dijk 2009: 246). These examples show that Wikipedia can be both a tool of language revitalization and a tool of further marginalization. This is not only influencing the balance between minority and majority language in regional contexts (e.g., Breton and French in Bretagne) or between low and high varieties (e.g., Luxembourgian and German in Luxembourg). It can also affect the balance between national languages and English as global language. This phenomenon has been signaled very strongly for Icelandic which has been threatened by digital extinction in the age of the English Internet (see Henley 2018). Wikipedia contributes to this development even if there is an Icelandic Wikipedia featuring presently (summer 2018) less articles than the Luxembourgian one (see https://stats.wikimedia.org/EN/Sitemap.htm).

Last but not least, Wikipedia is producing new bridges between languages. The interwiki links allow for smooth crossing between any pair of languages (as long as they feature articles identified by editors and bots as dealing with the same topic). This is important because it undermines the hub function English has in the words of translation (both for literary and scientific publications) and is identified as the hypercentral function of English (De Swaan 2001). In Wikipedia a user can develop her or his bilingualism and knowledge through navigation between let say Finnish and Maltese, German and Russian, Chinese and Japanese, or Spanish and Cebuano, without necessarily passing through English. This “crossing” shows that Wikipedia can be both a tool of the promotion of English on the one hand (through the role of English as a language of communication among editors of different Wikipedias and through the wide use of the English Wikipedia), and on the other hand, it offers tools to counterbalance the hegemonic position of English.

6 Conclusion

Wikipedia, as an interlinked set of monolingual Wikipedias, shows a particularly sustained engagement with linguistic diversity. It is both a mirror and a motor of global linguistic diversity, which is represented in all its complexity, including difficult power relations between language groups. It has developed complex policies and mechanisms to regulate the creation of new Wikipedias and at the same time contributes to the changing world language map through the many bridges it can create between languages. The role of multilingual editors is particularly important in shaping the relations between Wikipedias. It is, however, dependent on local contingencies whether the tool adds to the pressure of the growing use of English on the Internet or whether it provides opportunities to revitalize smaller languages. Likewise, the special position of the English Wikipedia as a global encyclopedia serving both an English-speaking monolingual audience in the hegemonic power and a global, multilingual audience is noteworthy.