22

If I have the url to a page, how would I obtain the Infobox information on the right using MediaWiki webservices?

5

9 Answers 9

27

Use the Mediawiki API through this Python library: https://github.com/siznax/wptools

Usage:

import wptools
so = wptools.page('Stack Overflow').get_parse()
infobox = so.data['infobox']
print(infobox)

Output:

{'alexa': '{{Increase}} 34 ( {{as of|2019|12|15|lc|=|y}} )',
 'author': '[[Jeff Atwood]] and [[Joel Spolsky]]',
 'caption': 'Screenshot of Stack Overflow in February 2017',
 'commercial': 'Yes',
 'content_license': '[[Creative Commons license|CC-BY-SA]] 4.0',
 'current_status': 'Online',
 'language': 'English, Spanish, Russian, Portuguese, and Japanese',
 'launch_date': '{{start date and age|2008|9|15}}',
 'logo': 'Stack Overflow logo.svg',
 'name': 'Stack Overflow',
 'owner': '[[Stack Exchange]], Inc.',
 'programming_language': '[[C Sharp (programming language)|C#]]',
 'registration': 'Optional',
 'screenshot': 'File:Stack Overflow homepage, Feb 2017.png',
 'type': '[[Knowledge market]]',
 'url': '{{URL|https://stackoverflow.com}}'}
2
  • i used wptools ftw! Oct 20, 2021 at 15:12
  • unable to get wptools to install on windows due to pycurl dependency. Tried for hours and given up. This works great on linux though. Dec 16, 2023 at 22:19
14

If you just want to parse the infobox or you want to get some digested data, a look at the DBPedia project: http://dbpedia.org

The DBPedia project scans the infoboxes in WP to create a RDF database from Wikipedia: https://github.com/dbpedia/extraction-framework/

0
11

There is no trivial way to do that. You can try fetching the page content using action=raw, i.e. http://en.wikipedia.org/w/index.php?action=raw&title=Douglas_Jardine Then find the start of the infobox by searching for {{Infobox. Then find the end by finding the matching }}, taking into account that the infobox itself can also contain {{-}} and {{{-}}} pairs.

8

Each Wikipedia page is associated with a Wikidata item, and all these items include the most parameters from the Wikipedia page's Infobox templates. So you need only to access the data associated with your Wikipedia page from Wikidata API.

An example of how to get the data for Wikipedia Donald Trump page from Wikidata item:

https://www.wikidata.org/w/api.php?action=wbgetentities&sites=enwiki&props=claims&titles=Donald Trump

The response will include: date and place of birth, image, religion, mother, father, children, height, signature, official website, etc..., all main info about Donald Trump included in the Wikipedia Infobox...

1
  • 1
    Wikidata is probably the way to go to extract semantic information. It seems way more robust and maintainable than parsing Wikipedia pages Apr 10, 2020 at 0:27
1

Tomxu - what you're talking about is a template - which is simple a page you can include on another page. For the infobox you need to start by looking at Template:Infobox. This gives you detailed instructions.

You can also press edit (or view code) and copy the contents to your own wiki. Bear in mind that templates tend to be in a hierarchy so you might need to copy other templates that Infobox uses (if you want to use them). Each template can be identified with {{}} so e.g. the Infobox template will look like this: {{Infobox}}.

I mentioned a hierarchy: you'll actually find multiple templates that all use Template: Infobox. To find them, just type this into Wikipedia's search field: Template:Infobox and then you'll find multiple examples, e.g. Template:Infobox writer

Update: if you mean Navboxes, then see this information.

1
  • 2
    The Template:Infobox page seems to be entirely about describing the data structure of an infobox, with no information on how to access that data on a specific page. Can you clarify how to use the information on that page? May 8, 2018 at 9:27
0

In our project we use queries for fetching data from wiktionary like this:

http://query.yahooapis.com/v1/public/yql?q=select%20*%20from%20html%20where%20url%3D%22http%3A%2F%2Fen.wiktionary.org%2Fwiki%2Flife%22%20and%20xpath%3D'%2F%2Fdiv%5B%40id%3D%22bodyContent%22%5D'&format=xml&diagnostics=false&env=store%3A%2F%2Fdatatables.org%2Falltableswithkeys&callback=recwiki

I have no comprehensive understanding of it, but it works. Output result can de filtered using jquery or something else.

0

What about using the Edit Mode? You could just start at the correct TextArea (most of the Time contains id="wpTextBox1") and parse the content of that TextArea ... The URL I used to find that out was (Note: section=0):

https://de.wikipedia.org/w/index.php?title=Pelephone&action=edit&section=0

Greetings

0

It is possible using pandas too:

import pandas as pd
page = 'https://pt.wikipedia.org/wiki/Python'
infoboxes = pd.read_html(page, index_col=0, attrs={"class":"infobox"})
print(infoboxes)
1
  • While this works, it's slower than other solutions and replaces special characters that separate the items in the box. So under an artist infobox 'Genre' it may say blues blues grass. How would you know how to correctly separate those, being that one is a two word phrase. Dec 16, 2023 at 22:23
-1

Using MediaWiki, you can view the infobox on the right of a Wikipedia page by using this link below. As you see, the format is in JSON (can be changed) and by changing the "hydrogen" word to the specific title you want you will get an page with an infobox.

https://en.wikipedia.org/w/api.php?action=parse&page=Template:Infobox%20hydrogen&format=json

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Not the answer you're looking for? Browse other questions tagged or ask your own question.