YSK: You can freely and legally download the entire Wikipedia database : r/YouShouldKnow Skip to main content

Get the Reddit app

Scan this QR code to download the app now
Or check it out in the app stores
r/YouShouldKnow icon

YSK: You can freely and legally download the entire Wikipedia database

Technology

Why YSK: Imagine a scenario with prolonged internet outages, such as wars or natural disasters. Having access to Wikipedia(knowledge) in such scenarios could be extremely valuable and very useful.

The full English Wikipedia without images/media is only around 20-30GB, so it can even fit on a flash drive.

Links:

https://en.wikipedia.org/wiki/Wikipedia:Database_download

or

https://meta.wikimedia.org/wiki/Data_dump_torrents

Remember to grab an offline-renderer to get correct formatting and clickable links.

Archived post. New comments cannot be posted and votes cannot be cast.
Share
Sort by:
Best
Open comment sort options
u/MayUrShitsHavAntlers avatar

Hell I might do that and throw it on my NAS. I wonder if there is a way to have it auto-update? That would be hella cool.

[deleted]
[deleted]

Shouldn't be too difficult to make a script and schedule it, but I'm not sure if there's a way to download only the changes instead of the entire new database.

I've never scripted but this would be a fun project to learn how. What would you recommend to use in order to build a script for such a task? If I wanted it to update and replace my file? Python, powershell, batch? I use Windows at both home and office and would like to learn powershell or batch for scripting things like this. Any info would be helpful!

Cron jobs, CURLs and text manipulation: these are the 3 main macro arguments that you should study from the perspective of the language you decide to implement. all your language proposals are valid, I would suggest Bash script since it's the most portable but it really doesn't matter, approach this with searches like "how to implement Cron job in language you choose" . Work your way from there, don't be afraid to ask for help, but ask for it when you have something to show so that the helper can eyeball your level of understanding and actually point you to a solution.

Edited

Since they're using Windows they'd be better off using powershell.

Also instead of implementing the scheduling in the language they'd be better off just using the built in Windows scheduler.

I'm not entirely sure how to just download the changes but zip files have a dictionary of stored files and their CRCs(basically like a hash). So you could download the first x bytes, read the size of the dictionary, then only download the next few bytes to get the dictionary. Then use the dictionary to work out which files have changed.

I'm not sure if you can start downloading from the middle of a file with FTP but there might be some fuckery you could do.

Edit: also for something this complicated I'd probably use python. Or another more fleshed out programming language, but I like python. Bash and powershell get unwieldy very quickly when you try and use them for complex tasks like this.

more replies More replies
u/MayUrShitsHavAntlers avatar

Thanks! I might give this a go

more reply More replies
u/Supergoose5000 avatar

Honest to Christ, as a none IT person if what you’ve said is actually legit then that it’s fucking insane, Well done you.

python would be well suited for the task as well

more replies More replies
More replies
u/therealmofbarbelo avatar

Might check out r/datahoarder

WHY ARE YOU SHOUTING?

more replies More replies

Thank you

More replies
u/Bliztle avatar

Personally I only know very basic PowerShell/bash scripting, so I would probably make a python script and schedule it on my raspberry to run a night a week.

This is actually a great idea for a hobby project I might make

u/MayUrShitsHavAntlers avatar

Nice. I might try it too

more reply More replies
[deleted]
[deleted]

I was thinking of having the script run on your NAS, in which case it would make the most sense to write it bash or whichever shell it uses. If you're using a preconfigured NAS, this could totally be done on a client device.

I'd advise against using batch since it's hard to make it to anything complex if you ever want to add additional functionality.

If you want something platform-agnostic, with intuitive syntax and a massive community, go with Python. If you want to be able to run the script on pretty much any Windows computer without installing anything beforehand, go with PowerShell.

Personally, I'd choose Python. It's by far the most powerful and versatile, and a great starting point if you're new at all this. If you're already somewhat familiar with programming, I'd suggest Learn Python in Y Minutes. Otherwise, check out Automate the Boring Stuff.

u/much_longer_username avatar

+1 for 'Automate The Boring Stuff'. That book will change your life if you're a even a little bit computer savvy and want to be lazy - I'm not being hyperbolic in the least.

more reply More replies

I’m saving this thread for when I know enough understand the replies

Good idea. I don't understand all of it yet. But understand some. I think I might do the same.

More replies
[deleted]
[deleted]
Edited

Lot of answers on here abut Python, relative merits of bash vs powershell, curl, etc.

The crux of this technical challenge will be how to download only the new/changed data.

You would need some way of comparing the data in the new file with the data in the old file on your NAS. You would need to do this without downloading all the data in the new file.

One way of doing this is to compute a hash of the data in the new file by running code on the remote server. You can then compare those hashes with ones computed on your local file and redownload any parts of the file where the hashes are different.

However you would need to compute hashes for small parts of the file not the entire file and you would need to run code on the remote servers which they won’t let you do.

Now your saving grace here might be the BitTorrent files. BitTorrent works by dividing files up in to chunks and then you can download each chunk from a different person. To facilitate this each chunk is hashed.

So it could be a simple as 1) download old file using BitTorrent 2) start downloading new file using BitTorrent then pause it and replace the partial new file with the old complete file 3) recheck your “new” file (actually a copy of the old one) and BitTorrent will compare each chunk of that to the chunks it is expecting in the new file, any chunks that are the same will be kept, any different will be downloaded.

There are BitTorrent clients that could be scripted or code libraries that you could use.

Even this might not work if the entire file is compressed (but that depends on how the compression has been done).

EDIT: I tested the BitTorrent option. Doesn’t work because of the compression. Even if the uncompressed data is largely the same between two versions of the Wikipedia dump, the compressed files appear to share no common chunks. The gz2 files do have a separate index listing each article in the wiki but this won’t work either as it doesn’t include a hash of the article.

I love the fact that you tested it and then edited the results. Thank you.

More replies
More replies

I’d imagine at the rate Wikipedia is getting edited it would be a nonstop write/rewrite schedule… you’re probably better off just redownloading it once a week

Rsync can probably do that.

[deleted]
[deleted]

The way its currently implemented there is no way to do this.

Really big git repo?

More replies
u/BaGaJoize avatar
More replies
u/fh3131 avatar

If you do, make it available for people to access via the internet...hang on...

u/thisisntadam avatar

It'd be great if people could edit and moderate your online version. We may be on to something here!

More replies
u/Bmandk avatar

https://dumps.wikimedia.org/

If you read the part about database backup dumps, it says you can just subscribe here: https://lists.wikimedia.org/postorius/lists/xmldatadumps-l.lists.wikimedia.org/

Pretty easy to set up a script that will react to any mails from that sender.

Nice. Thank you

More replies

Set up a github actions job that automatically updates your copy weekly and deploys it somewhere.

u/500ls avatar

If they have FTP access FileZilla has an option to only download more recent versions to update and new files.

u/57hz avatar

Yes, but it’s not in an easily readable format. It’s a pain in the ass to process it.

I heard about internet in a box that is this, plus Kahn academy plus a couple other things. If my memory serves, it can run on an rpi.

u/DonnerVarg avatar

Check out Stackdump if you’re interested in StackExchange offline. Dash or Zeal for computer programming language documentation offline.

Wouldn’t that be a negative? I’d prefer knowing it’s the current iteration of the website rather than running the risk of some attack or troll farm updating a bunch of stuff with wrong info and that syncing to my local version.

u/justjake274 avatar

How do you know the current version is 100% accurate right now? Have you checked every article?

No I haven’t. But it’s better than it could be in a worst case scenario, that’s what I mean. Let’s say some group attacks the site.

u/radicalelation avatar

Which could be at any given time you view a Wikipedia page anyway and you might never notice.

I'd rather have a shit ton of knowledge and information available at any given moment with some inaccuracies than none. As useful as libraries are, I'm not going to really count it against them if they have an outdated book or two, or even some full of lies.

And it so happens that the best defense against misinformation applies all over: Don't take everything at face value and if it's important enough to act on or spread, verify accuracy first.

You can download the entire database including page history if you are worried about that sort of thing.

But I think a few copies a few weeks apart would work fine too.

More replies
More replies

Run it once a week and keep last week's version. Since Wikipedia is updated multiple times per second, chances are it will have been caught and reverted within that time frame

More replies
More replies
[deleted]
[deleted]

I'd expected that to be much more GB... o.0

u/hl3official avatar

It's about 80gb uncompressed, but yeah it's pretty amazing.

Jesus, that still seems small lol.

u/hl3official avatar

You aren't wrong, but again this is without images and media, it's just the text.

But yeah, having access to so much knowledge in your pocket is truly a wonder. Humans are great (sometimes)

u/Uhh_JustADude avatar

Ok now I’m curious; how big is the entirety of Wikipedia, including media files?

more replies More replies
[deleted]
[deleted]

This might be a stupid question, but is it formatted? Or is just a big ol fuckoff .txt file

Literally in my pocket. I could download that on to my phone right now.

How is it organized? Like Is this in a format where, assuming I download and uncompress it and all that, I can just read through it on my phone? Say civilization ends but I've got this file on an ae-reader hooked to a solar panel. Am I good to rebuild society?

more reply More replies
More replies

In ASCII one character is 1 byte. Unicode is more complicated but still only 1 or 2 bytes(I can't be bothered to look it up right now).

If you think of it as 80 billion characters it's a lot more obvious. Similarly if you think most words are 5-10 characters that's 8 billion to 14 billion words.

Text is very small in terms of computer storage.

u/Fancy_o_lucas avatar

That’s roughly 80 billion letters worth of information.

more replies More replies
[deleted]
[deleted]

80GB of nothing but text is a lot of data.

80gb of text seems quite a lot

That means Wikipedia minus the images is smaller than the most recent call of duty.

Afaik it's one byte for each character if it's uncompressed. So around 80 billion letters and spaces

More replies

I remember downloading the 7gb file to my jailbroken iPod touch in high school back in like 2008.

The school didn’t have student WiFi, and the rich kids still had blackberries, so pulling out Wikipedia on demand to answer a question was always great.

What about including images/media?

u/Pons__Aelius avatar

The answer would always be, it depends.

Most images in Wikimedia are stored/available at several resolutions and also for images with a lot of text, in several languages.

One Example: https://en.wikipedia.org/wiki/File:Falaise_Pocket_map.svg

Seven resolutions and 4 different languages. So a possible 28 different combinations of a single graffic.

Do you grab one, a couple or all of them?

So I would expect the answer would be:

Somewhere between 100 and 10,000 times larger than the text only size.

More replies
More replies
[deleted]
[deleted]

Text doesn't take up much space at all. Try to create a gigabyte txt file.

Someone's going to take this literally

[deleted]
[deleted]

Go right ahead. Nothing wrong with someone wasting their time in front of a text document.

[deleted]
[deleted]

Comment deleted by user

That was basically my college experience anyways.

More replies
More replies
u/Amphorax avatar

cat /dev/urandom | base64 | head -c 1000000000 > 1gb.txt

u/LvS avatar

sudo journalctl > large-enough.txt

More replies
u/destroys_burritos avatar

Or go the other way and find out what a zip bomb is

More replies
More replies

That's only characters, though!!! NO medias (pics, sound, videos)!!! 30 Gb of letters and numbers compressed is still enormous!! 🙂

More replies

I did this! Lived and worked on a cruise ship and I did not want to catapult back to the dark ages when I couldn’t prove people wrong after we disagreed.

/s

I really did though. Highly recommend downloading information.

On a laptop? What hardware do people recommend for doing this? A tablet might be good as a dedicated device.

Ha! I wish I’d thought of that. I used my regular iPhone. Got max storage when I bought it though, knowing that I was about to join the ship.

I used my iPad solely for comfort movies and television season downloads. You live like a zombie on a ship crew. You literally can’t remember ever feeling alert in your life — old favorites sitcoms are your priceless treasure.

[deleted]
[deleted]

I did it for 3 years. Man, I am happy not needing to go back XD. The lack of internet and the crazy pricing is like torture lol. Also, screw those safety drills in the mornings haha

Really?? Those were my absolute favorite. But! Clarification: I worked hard in entertainment. Every minute of was booked with tasks synonymous with “jumping around!,” “cartwheels!,” “dance party!,” and “evening club party!”

Drills were my lifeblood because I got to stand still for a little bit. Wear my vest. Stand next to my friends. Say “here” when they called my name. Blessed quiet time.

More replies

“Comfort movies”

Princess Bride and Pirates of the Caribbean, but you’re funny :)

more reply More replies
More replies
More replies
More replies

How was your experience on that ship?

Hi! Oh I loved it. Time of my life. There were terrible downsides though, and I’m happy to elaborate on details, but I don’t want to inundate you with them.

But overall, fun!

It’s like a twilight zone — nothing is normal, nothing is what it seems. I’m not even being dramatic. I was left at the altar by a man I met, and was courted by, and knew for a long time, and thought I knew well! But it’s the Twilight Zone. I forgot that. And it turned out I didn’t know him at all, I was just in extremely close proximity to him all the time, which felt like the same thing.

I’m actually considering going back, to be honest with you. It’s been three years, my heart is no longer broken, I’m a bit stir crazy from the pandemic and my current office job. Sooooo…. I started my research last week! We’ll see.

What’s the bad? I’m curious it sounds like a fun thing to do after college for a lil bit

I hope everything turns out well for you. Also, I hope you eventually find what you deserve. About me, I'm finally close to my tour-guiding diploma and I contemplate working on a cruise ship for a season; that was the reason I asked.

More replies
More replies

Brilliant solution.

How were you able to browse it on your phone?

Ummm, it was an app! Be damned to remember the name now, but I’m curious enough to go look it up after this thread.

You had to set it all up and choose your settings and your content level and then leave your phone alone for like an hour. I always scheduled this activity a week before embark while I was still at home on fast wifi.

More replies
More replies

Once you have it downloaded, how do you view/acsess it?

u/hl3official avatar

Kiwix or WikiTax are great offline wiki renderers. There are more out there, but I've tried these two and they work great. They're not perfect, but they're pretty convenient.

More replies
u/National-Aardvark-72 avatar

How is working on a cruise ship? I’ve thought about applying to be a cook on one. What is the work culture like?

I honestly had the time of my life.

But. It is an extreme environment. Nothing is normal. You never ever ever get rid of your co-workers. Hope you like them, because you’ll be working together, then eating dinner together, then going to crew bar together, then probably sharing a small cabin. That’s the kind of extreme that you would never find on land, and people often aren’t prepared for.

Another example is complete loss of freedom. Again, not a reason not to join — but wildly extreme. What to wear, where you’re allowed to go and when, signing up for privileges that are assumed on land (for example, eating dinner in a restaurant) - that’s now a privilege, not a right, and you have to sign up for it and then keep it with good behavior.

Do allllll of your research. Or message me! I love talking about my experiences, I even write fiction about it. Whatever you do, don’t decide blind or show up blind. The surprises will be too much to handle and you’ll leave.

(Money is good, by the way. Not good good. But good as in - no bills or rent, and therefore you bank every penny you make, never have time to spend it, and thus it accumulates very fast.)

More replies
More replies
u/dugernaut1 avatar

i do this at least once a month, just in case i get transported back in time