How do I download and work with wikipedia data dumps?

Question

I want to count entities/categories in wiki dump of a particular language, say English. The official documentation is very tough to find/follow for a beginner. What I have understood till now is that I can download an XML dump (What do I download out of all the available different files), and parse it (?) to count entities (The article topics) and categories.

This information, if available, is very difficult to find. Please help with some instructions as to how do I work with it or resources where I can learn about it.

Thanks!

Did you try to search for the dumps, download a recent one and open it with the 'less' command from a bash terminal? — Debasis, Jul 23, 2020 at 8:07
See: stackoverflow.com/q/30387731/6276743. It helped a lot. What exactly are you trying to do, in terms of counting categories and stuff? — user6276743, Jul 26, 2020 at 21:58
Somewhat related: stackoverflow.com/questions/63934708/…. Download the .zim file then scrape the pages like regular web scraping. (or rely on dbpedia) — amirouche, Sep 20, 2020 at 8:49
There is also HDT dumps that might be easier to use rdfhdt.org/what-is-hdt — amirouche, Oct 29, 2020 at 15:14

Rick · Accepted Answer · 2022-12-25 16:00:44Z

The exact instructions what do to differ a lot based on your usecase. You can either download the dumps from https://dumps.wikimedia.org/enwiki/ and parse them locally, or you can also contact the API.

If you want to parse the dumps, https://jamesthorne.com/blog/processing-wikipedia-in-a-couple-of-hours/ is a good article that shows how one could do that.

However, parsing the dumps isn't always the best solution. If you want to know the three largest pages, for instance, you could use https://en.wikipedia.org/wiki/Special:LongPages.

In addition to all of this, you can also use https://quarry.wmcloud.org to query the live replica of Wikipedia's database. An example can be found at https://quarry.wmcloud.org/query/38441.

Matthias Winkelmann · Accepted Answer · 2020-07-24 21:52:27Z

4

The dumps are rather unwieldy: Even the small "truthy" dump is 25G. And because RDF is rather verbose, that expands to >100G. So my generic advice is to avoid the dumps.

If you can't help it, https://wdumps.toolforge.org/dumps allows you to create customised subsets of dumps with just the languages/properties/entities you want.

Then, just read it line-by-line and ... do something with each line

answered Jul 24, 2020 at 21:52

Matthias Winkelmann

16.2k8 gold badges66 silver badges76 bronze badges

how to download without dumps?
– ihsan
Jan 6, 2023 at 16:24

Add a comment |

Ciro Santilli OurBigBook.com · Accepted Answer · 2023-11-08 10:19:13Z

mysqldump-to-csv

https://stackoverflow.com/a/28168617/895245 pointed me to https://github.com/jamesmishra/mysqldump-to-csv which semi hackily converts the dumps to CSV, which allows one to bypass the MySQL slowness:

git clone https://github.com/jamesmishra/mysqldump-to-csv
cd mysqldump-to-csv
git checkout 24301dfa739c13025844ed3ff5a8abe093ced6cc
patch <<'EOF'
diff --git a/mysqldump_to_csv.py b/mysqldump_to_csv.py
index b49cfe7..8d5bb2a 100644
--- a/mysqldump_to_csv.py
+++ b/mysqldump_to_csv.py
@@ -101,7 +101,8 @@ def main():
     # listed in sys.argv[1:]
     # or stdin if no args given.
     try:
-        for line in fileinput.input():
+        sys.stdin.reconfigure(errors='ignore')
+        for line in fileinput.input(encoding="utf-8", errors="ignore"):
             # Look for an INSERT statement and parse it.
             if is_insert(line):
                 values = get_values(line)
EOF

Then I add csvtool as per How to extract one column of a csv file + awk to filter the columns:

zcat enwiki-latest-page.sql.gz | python mysqldump-to-csv/mysqldump_to_csv.py

You can then use the CSV with several standard tools like csvtool, e.g.:

here I extract the titles to txt: How to obtain a list of titles of all Wikipedia articles
here I import into SQLite to extract the category hierarchy: Wikipedia Category Hierarchy from dumps

Collectives™ on Stack Overflow

How do I download and work with wikipedia data dumps?

3 Answers 3

Your Answer

Not the answer you're looking for? Browse other questions tagged
wikipedia
information-retrieval
wikidata
knowledge-graph
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged wikipediainformation-retrievalwikidataknowledge-graph or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
wikipedia
information-retrieval
wikidata
knowledge-graph
or ask your own question.