11

I want to count entities/categories in wiki dump of a particular language, say English. The official documentation is very tough to find/follow for a beginner. What I have understood till now is that I can download an XML dump (What do I download out of all the available different files), and parse it (?) to count entities (The article topics) and categories.

This information, if available, is very difficult to find. Please help with some instructions as to how do I work with it or resources where I can learn about it.

Thanks!

4
  • Did you try to search for the dumps, download a recent one and open it with the 'less' command from a bash terminal?
    – Debasis
    Jul 23, 2020 at 8:07
  • See: stackoverflow.com/q/30387731/6276743. It helped a lot. What exactly are you trying to do, in terms of counting categories and stuff?
    – user6276743
    Jul 26, 2020 at 21:58
  • Somewhat related: stackoverflow.com/questions/63934708/…. Download the .zim file then scrape the pages like regular web scraping. (or rely on dbpedia)
    – amirouche
    Sep 20, 2020 at 8:49
  • There is also HDT dumps that might be easier to use rdfhdt.org/what-is-hdt
    – amirouche
    Oct 29, 2020 at 15:14

3 Answers 3

6

The exact instructions what do to differ a lot based on your usecase. You can either download the dumps from https://dumps.wikimedia.org/enwiki/ and parse them locally, or you can also contact the API.

If you want to parse the dumps, https://jamesthorne.com/blog/processing-wikipedia-in-a-couple-of-hours/ is a good article that shows how one could do that.

However, parsing the dumps isn't always the best solution. If you want to know the three largest pages, for instance, you could use https://en.wikipedia.org/wiki/Special:LongPages.

In addition to all of this, you can also use https://quarry.wmcloud.org to query the live replica of Wikipedia's database. An example can be found at https://quarry.wmcloud.org/query/38441.

4

The dumps are rather unwieldy: Even the small "truthy" dump is 25G. And because RDF is rather verbose, that expands to >100G. So my generic advice is to avoid the dumps.

If you can't help it, https://wdumps.toolforge.org/dumps allows you to create customised subsets of dumps with just the languages/properties/entities you want.

Then, just read it line-by-line and ... do something with each line

1
  • how to download without dumps?
    – ihsan
    Jan 6, 2023 at 16:24
1

mysqldump-to-csv

https://stackoverflow.com/a/28168617/895245 pointed me to https://github.com/jamesmishra/mysqldump-to-csv which semi hackily converts the dumps to CSV, which allows one to bypass the MySQL slowness:

git clone https://github.com/jamesmishra/mysqldump-to-csv
cd mysqldump-to-csv
git checkout 24301dfa739c13025844ed3ff5a8abe093ced6cc
patch <<'EOF'
diff --git a/mysqldump_to_csv.py b/mysqldump_to_csv.py
index b49cfe7..8d5bb2a 100644
--- a/mysqldump_to_csv.py
+++ b/mysqldump_to_csv.py
@@ -101,7 +101,8 @@ def main():
     # listed in sys.argv[1:]
     # or stdin if no args given.
     try:
-        for line in fileinput.input():
+        sys.stdin.reconfigure(errors='ignore')
+        for line in fileinput.input(encoding="utf-8", errors="ignore"):
             # Look for an INSERT statement and parse it.
             if is_insert(line):
                 values = get_values(line)
EOF

Then I add csvtool as per How to extract one column of a csv file + awk to filter the columns:

zcat enwiki-latest-page.sql.gz | python mysqldump-to-csv/mysqldump_to_csv.py

You can then use the CSV with several standard tools like csvtool, e.g.:

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Not the answer you're looking for? Browse other questions tagged or ask your own question.