Gopherpedia - The Free Encyclopedia via gopher
My last release for Project Dump week is Gopherpedia -- a mirror of Wikipedia in gopherspace. If you happen to have a gopher client, you can see it at gopherpedia.com on port 70. Otherwise, you can browse to gopherpedia.com and view it via a web proxy.
A couple of years ago, I landed on the idea of a gopher interface to Wikipedia. Originally it was probably a joke, but it stuck with me. So one day I registered a domain name and got to work. The first thing I needed to do was build a gopher server, because none of the currently available options were up to the task. So I built Gopher2000. Then, I quickly realized that the current gopher proxies weren't any good either, so I built GoPHPer. Once both of those were written (well over a year ago), it didn't seem like there was much left to be done -- gopherpedia should've been ready to launch.
But I hadn't reckoned on the challenges of churning through a database dump of Wikipedia.
Wikipedia is very open. They have an API which you can use to search and query documents, and they provide downloadable archives of their entire collection of databases. They encourage you to download these, mirror them, etc.
My first implementation of gopherpedia used the API. This worked well, but had two problems. First, it was a little slow, since it needed to query a remote server for every request. Second, Wikipedia prohibits using the API this way - if you want to make a mirror of their website, they want you to download an archive and use that, so their servers aren't overloaded.
So I downloaded a dump of their database, which is a single 9GB compressed XML file. Nine. Gigabytes. Compressed. A single file.
Then a took the opportunity to learn about streaming XML Parsers. Basically I wrote a parser script that parsed the file while it was reading it, as opposed to reading the whole thing into memory at once, which was clearly impossible. The script splits up wikipedia entries and stores them as flat text files. Running that script took a couple days on my extremely cheap Dreamhost server -- that's right, I have a gopher server hosted on Dreamhost.
So, when someone requests a page, the gopher server reads that file, does some parsing, and returns the result as a gopher query. Sounds simple, right? Not quite, because parsing the contents of a wikipedia entry is also a mess. It's part wikitext, part HTML, and there's plenty of places where both are broken. If I was just outputting HTML, I could probably get away with it. But since this is Gopher I really needed to format the results as plain text. I spent a while writing an incredibly messy parser, and the imperfect results are what you see on gopherpedia now. Sorry for all the flaws.
Anyway, this was a fun project, and it occupied a pleasant chunk of my spare time over the last year or two, but it's time to release it to the wild. Unless I'm mistaken, this is now the largest gopher site in existence. There are about 4.2 million pages on gopherpedia, totaling somewhere over 10GB of data.
Here's my favorite page on the site -- the gopherified wikipedia entry for Gopher.
Please note, this is in extreme beta, and is likely to break, just let me know if you have any problems. Enjoy!