Wikipedia text download

Question

I am looking to download full Wikipedia text for my college project. Do I have to write my own spider to download this or is there a public dataset of Wikipedia available online?

To just give you some overview of my project, I want to find out the interesting words of few articles I am interested in. But to find these interesting words, I am planning to apply tf/idf to calculate term frequency for each word and pick the ones with high frequency. But to calculate the tf, I need to know the total occurrences in whole of Wikipedia.

How can this be done?

Although I have answered your question and that simply pointing out that google has your answers is frowned upon, if you googled 'download full Wikipedia text' the link is the first hit. I say this in the hope that it will help improve your google-fu. — Sam Holder, Apr 21, 2010 at 14:04
@Sam Holder Just want to confirm. Is this the correct link to download all the pages -dumps.wikimedia.org/enwiki/latest/… — Boolean, Apr 21, 2010 at 14:27
yeah that seems to be all current pages, and is probably what you want, though without knowing exactly its hard to say for sure — Sam Holder, Apr 21, 2010 at 16:50
Thanks a lot @Boolean. That was as simple as clicking your link <3 — Trect, Oct 2, 2018 at 19:04

Community · Accepted Answer · 2023-12-01 13:45:00Z

30

from wikipedia: http://en.wikipedia.org/wiki/Wikipedia_database

Wikipedia offers free copies of all available content to interested users. These databases can be used for mirroring, personal use, informal backups, offline use or database queries (such as for Wikipedia:Maintenance). All text content is multi-licensed under the Creative Commons Attribution-ShareAlike 3.0 License (CC-BY-SA) and the GNU Free Documentation License (GFDL). Images and other files are available under different terms, as detailed on their description pages. For our advice about complying with these licenses, see Wikipedia:Copyrights.

Seems that you are in luck too. From the dump section:

As of 12 March 2010, the latest complete dump of the English-language Wikipedia can be found at http://download.wikimedia.org/enwiki/20100130/ This is the first complete dump of the English-language Wikipedia to have been created since 2008. Please note that more recent dumps (such as the 20100312 dump) are incomplete.

So the data is only 9 days old :)

EDIT: new link as old is broken: https://dumps.wikimedia.org/enwiki/

edited Dec 1, 2023 at 13:45

CommunityBot

11 silver badge

answered Apr 21, 2010 at 13:58

Sam Holder

32.8k13 gold badges106 silver badges183 bronze badges

5

I upvoted your answer over the others simply because you did more then just post a link.
– UnkwnTech
Apr 21, 2010 at 14:00
2

Just want to confirm. Is this the correct link to download all the pages -dumps.wikimedia.org/enwiki/latest/…
– Boolean
Apr 21, 2010 at 14:28
yeah that seems to be all current pages, and is probably what you want, though without knowing exactly its hard to say for sure.
– Sam Holder
Apr 21, 2010 at 16:49
1

Link is broken.
– Filippo Costa
Jan 4, 2017 at 21:31
@FilippoCosta dumps.wikimedia.org/enwiki is still useful
– Wolf
May 28, 2020 at 11:08

Add a comment |

Máté Pataki · Accepted Answer · 2012-07-09 10:55:18Z

11

If you need a text only version, not a Mediawiki XML, then you can download it here: http://kopiwiki.dsd.sztaki.hu/

answered Jul 9, 2012 at 10:55

Máté Pataki

1111 silver badge2 bronze badges

3

Doesn't seem to be seeded by anyone at this time.
– yegeniy
Aug 5, 2018 at 2:52

Add a comment |

Luk · Accepted Answer · 2010-04-22 15:19:58Z

4

Considering the size of the dump, you would probably be better served using the word frequency in the English language, or to use the MediaWiki API to poll pages at random (or the most consulted pages). There are frameworks to build bots based on this API (in Ruby, C#, ...) that can help you.

answered Apr 22, 2010 at 15:19

Luk

5,4514 gold badges41 silver badges55 bronze badges

Add a comment |

Armand · Accepted Answer · 2010-04-21 13:57:27Z

1

http://en.wikipedia.org/wiki/Wikipedia_database#Latest_complete_dump_of_english_wikipedia

answered Apr 21, 2010 at 13:57

Armand

23.9k20 gold badges91 silver badges121 bronze badges

Add a comment |

orithena · Accepted Answer · 2010-04-21 13:59:03Z

1

See http://en.wikipedia.org/wiki/Wikipedia_database

answered Apr 21, 2010 at 13:59

orithena

1,4751 gold badge10 silver badges25 bronze badges

Add a comment |

illusionx · Accepted Answer · 2018-08-22 05:15:45Z

0

All the latest wikipedia dataset can be downloaded from: Wikimedia Just make sure to click on the latest available date

answered Aug 22, 2018 at 5:15

illusionx

3,4671 gold badge12 silver badges18 bronze badges

Add a comment |

ijka5844 · Accepted Answer · 2021-07-08 13:32:36Z

Use this script

#https://en.wikipedia.org/w/api.php?action=query&prop=extracts&pageids=18630637&inprop=url&format=json
import sys, requests
for i in range(int(sys.argv[1]),int(sys.argv[2])):
  print("[wikipedia] getting source - id "+str(i))
  Text=requests.get("https://en.wikipedia.org/w/api.php?action=query&prop=extracts&pageids="+str(i)+"&inprop=url&format=json").text
  print("[wikipedia] putting into file - id "+str(i))
  with open("wikipedia/"+str(i)+"--id.json","w+") as File:
    File.writelines(Text)
  print("[wikipedia] archived - id "+str(i))

1 to 1062 is at https://costlyyawningassembly.mkcodes.repl.co/.

Caridorc · Accepted Answer · 2023-03-16 18:38:38Z

I found out a relevant Kaggle dataset at https://www.kaggle.com/datasets/ltcmdrdata/plain-text-wikipedia-202011

From the dataset description:

Content

This dataset includes ~40MB JSON files, each of which contains a collection of Wikipedia articles. Each article element in the JSON contains only 3 keys: an ID number, the title of the article, and the text of the article. Each article has been "flattened" to occupy a single plain text string. This makes it easier for humans to read, as opposed to the markup version. It also makes it easier for NLP tasks. You will have much less cleanup to do.

Each file looks like this:

[
 {
  "id": "17279752",
  "text": "Hawthorne Road was a cricket and football ground in Bootle in England...",
  "title": "Hawthorne Road"
 }
]

From this it is trivial to extract the text with a JSON reader.

Collectives™ on Stack Overflow

Wikipedia text download

8 Answers 8

Content

Your Answer

Not the answer you're looking for? Browse other questions tagged
text
wikipedia
web-crawler
information-retrieval
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

8 Answers 8

Content

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged textwikipediaweb-crawlerinformation-retrieval or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
text
wikipedia
web-crawler
information-retrieval
or ask your own question.