My last release for Project Dump week is Gopherpedia --
a mirror of Wikipedia in gopherspace. If you happen to have a gopher
client, you can see it at gopherpedia.com on port 70. Otherwise, you
can browse to gopherpedia.com and view it via a web
A couple of years ago, I landed on the idea of a gopher interface to
Wikipedia. Originally it was probably a joke, but it stuck with me. So
one day I registered a domain name and got to work. The first thing I
needed to do was build a gopher server, because none of the currently
available options were up to the task. So I built
Gopher2000. Then, I quickly realized that the current
gopher proxies weren't any good either, so I built GoPHPer.
Once both of those were written (well over a year ago), it didn't seem
like there was much left to be done -- gopherpedia should've been
ready to launch.
But I hadn't reckoned on the challenges of churning through a database dump
Wikipedia is very open. They have an API which you can use to search
and query documents, and they provide
downloadable archives of their entire collection of
databases. They encourage you to download these, mirror them, etc.
My first implementation of gopherpedia used the API. This worked well,
but had two problems. First, it was a little slow, since it needed to
query a remote server for every request. Second, Wikipedia prohibits
using the API this way - if you want to make a mirror of their
website, they want you to download an archive and use that, so their
servers aren't overloaded.
So I downloaded a dump of their database, which is a single 9GB
compressed XML file. Nine. Gigabytes. Compressed. A single file.
Then a took the opportunity to learn about streaming XML Parsers.
Basically I wrote a parser script that parsed the file while
it was reading it, as opposed to reading the whole thing into memory
at once, which was clearly impossible. The script splits up wikipedia
entries and stores them as flat text files. Running that script took a
couple days on my extremely cheap Dreamhost server -- that's right, I
have a gopher server hosted on Dreamhost.
So, when someone requests a page, the gopher server reads that file,
does some parsing, and returns the result as a gopher query. Sounds
simple, right? Not quite, because parsing the contents of a wikipedia
entry is also a mess. It's part wikitext, part HTML, and there's
plenty of places where both are broken. If I was just outputting HTML,
I could probably get away with it. But since this is Gopher I really
needed to format the results as plain text. I spent a while writing an
incredibly messy parser, and the imperfect results are what you see on
gopherpedia now. Sorry for all the flaws.
Anyway, this was a fun project, and it occupied a pleasant chunk of my
spare time over the last year or two, but it's time to release it to
the wild. Unless I'm mistaken, this is now the largest gopher site in
existence. There are about 4.2 million pages on gopherpedia, totaling
somewhere over 10GB of data.
Here's my favorite page on the site -- the
gopherified wikipedia entry for Gopher.
Please note, this is in extreme beta, and is likely to break, just let
me know if you have any problems. Enjoy!