The Making of @lists_of_lists

16 August 2016

I thought I'd write something about how I made the bot @lists_of_lists, from start to finish. It's a relatively simple idea, so if you're interested in writing a bot for the first time, this might be a helpful guide.

I have a bit of an advantage for two reasons. First, I'm a professional programmer, and have been for many years. I know ruby very well, and it's the language I use to build most of my bots. Second, I wrote the library, that I use to make most of my bots, so it's basically adapted to my needs.

That said, if you are not a developer, but want to make a bot, you definitely can, but you should probably expect to have to learn a little bit about coding, and also a little bit about server management, because getting your bot to run consistently is sometimes the hardest part of the process.

The Idea

I spent a lot of time exploring wikipedia's data downloads when I was building gopherpedia. I knew that there were a lot of 'list of' pages, and that some of them were amusing and interesting. I decided to see if I could download a list of them so that I could play around with the data.

Wikipedia offers database dumps at https://dumps.wikimedia.org/. The main files here are gigantic XML files that represent the complete contents of the website. Depending on what you are interested in, some of these XML files are 12GB or larger. That's a single XML file! Parsing those is a real challenge.

Luckily, they offer a much smaller file of just page titles. I downloaded that file, and searched it for pages with the words 'list of' or 'lists of' in the title. I ended up running this a few times, so I combined it all into a single shell command that looks like this:

curl https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-all-titles-in-ns0.gz > enwiki-latest-all-titles-in-ns0.gz && gzcat enwiki-latest-all-titles-in-ns0.gz | grep -i 'List_of\|Lists_of' > lists.txt

At that point, I had a text file that looked a little like this:

"List_of_the_works_of_Charles_Cottet_depicting_scenes_of_Brittany
"List_of_the_works_of_Charles_Cottet_depicting_scenes_of_Brittany"
'List_of_Mongolian_musical_instruments
(List_of_Toni,la_Chef_episodes)
/List_of_Parliament_of_Australia_Reports_on_Sport
1996_World_Monuments_Fund_List_of_Most_Endangered_Sites
1996_World_Monuments_Watch_List_of_Most_Endangered_Sites
1998_World_Monuments_Fund_List_of_Most_Endangered_Sites
1998_World_Monuments_Watch_List_of_Most_Endangered_Sites
2000_World_Monuments_Fund_List_of_Most_Endangered_Sites

Sit on it for a year

Once I had the data, I had no idea what I actually wanted to do with it. I thought about running it through a Markov chain tool, or maybe swapping out words randomly, adding adjectives and modifiers, etc, etc.

I couldn't really decide what to do, so I didn't do anything. I let the data sit around for a year or so.

Eventually, I decided to just keep it simple and make a bot that would simply iterate through the list of lists. I randomized the data to make it a little more interesting:

gshuf lists.txt > lists-random.txt

(gshuf is an OSX command to randomly shuffle the lines of a file. If it's not installed already, you can install it via brew install coreutils. On Linux, there's a command called shuf that does the exact same thing. I suspect it's pre-installed on most Linux systems. Thanks to @ckolderup for pointing all of this out!

Start The bot

I had the data, now I needed the bot. Amazingly, when I went to Twitter to register a new account, my first choice was available, so @lists_of_lists was born.

I made myself a directory to hold onto my bot files, and copied the data there. Then, I setup a Gemfile and got ready to install chatterbot

mkdir lists_of_lists

I made a Gemfile that looks like this:

source "https://rubygems.org"
gem "chatterbot", :git => "git://github.com/muffinista/chatterbot.git"

Then I ran bundle to install chatterbot.

Chatterbot has a script which will walk you through the process of setting up a Twitter bot. It will also create a template file for the bot, and setup your credentials file. I ran it!

NOTE I ran all of this while being logged into Twitter as the account for the bot.

bundle exec chatterbot-register

It prints out a message telling me what happens next:

Welcome to Chatterbot. Let's walk through the steps to get a bot running.

Hey, looks like you need to get an API key from Twitter before you can get started.

Have you already set up an app with Twitter? [Y/N]

I haven't setup an app yet, so I put 'N'

> N
OK, I can help with that!

Please hit enter, and I will send you to https://apps.twitter.com/app/new to start the process.
(If it doesn't work, you can open a browser and paste the URL in manually)

Hit Enter to continue.

The form looked a lot like this (they change this a lot):

Once you've filled out that form, Twitter will issue you some API keys. I copied those keys into chatterbot-register, which was waiting for the input:

Once you've filled out the app form, click on the 'Keys and Access Tokens' link


Paste the 'Consumer Key' here: 123456
Paste the 'Consumer Secret' here: abcdefg

Now it's time to authorize your bot!

Do you want to authorize a bot using the account that created the app? [Y/N]

I do want to authorize this account, so I say so:

> Y
OK, on the app page, you can click the 'Create my access token' button
to proceed.

I do that, then I paste the results:

Paste the 'Access Token' here: 123456

Paste the 'Access Token Secret' here: 45678

Hooray, now I have two files! lists_of_lists.rb is a template file for my bot. It lists a bunch of features of chatterbot and gives you something to work from. lists_of_lists.yml has the credentials for the bot, and will also track some other information needed to send out tweets.

My idea for the bot is pretty simple. Each time it runs, it should open up the file with all the lists in it, read the next one, and tweet it out.

The bot will need to keep track of which line it sent out last, and update that value every time. One of the features of chatterbot is that the YAML file which holds the configuration data is accessible to the bot, and is updated with any changes each time the bot is run. This means you can use it to track variables that you need to persist over time, such as the last index of a file that you used.

So I start with some ruby to handle all of that:

SOURCE = "lines.txt"

bot.config[:index] ||= 0

if ENV["FORCE_INDEX"]
  bot.config[:index] = ENV["FORCE_INDEX"].to_i
end
 
data = File.read(SOURCE).split(/\n/)

source = data[ bot.config[:index] ]
puts source

# the page title will have underscores in it, get rid of those
tweet_text = source.gsub(/_/, " ")

This code sets the index variable, opens the file "lines.txt", turns it into an array by splitting on newlines, and then reads the proper value from that array.

Make it Nicer

At this point, I could just tweet that value out like this:

tweet tweet_text

And be done. I decided that would be a little boring though, and I started to wonder about pulling an image from the wikipedia page for the list. Some lists have images on them, and they can be pretty funny.

Wikipedia has an API, and there are a few ruby libraries for accessing it. I decided to check out the official client since I had never used it before. My assumption was that I would need to parse out images from the source text, but it turns out that there is a method you can use to get a list of images! Anyway, here's that code

page = Wikipedia.find( source )

opts = {}

# check if there are any images
if page.image_urls && ! page.image_urls.empty?
  puts page.image_urls.inspect

  # pick an image at random
  image_url = filter_images(page.image_urls).sample
  
  puts image_url
  if image_url && image_url != ""
    # make a local copy of the image
    opts[:media] = save_to_tempfile(image_url)
  end
end

I added a simple method filter_images which rejects any SVG files:

def filter_images(list)
  list.reject { |l| l =~ /.svg$/ }
end

And a second method save_to_tempfile which makes a local copy of the image:

def save_to_tempfile(url)
  uri = URI.parse(url)
  ext = [".", uri.path.split(/\./).last].join("")

  dest = File.join "/tmp", Dir::Tmpname.make_tmpname(['list', ext], nil)

  puts "#{url} -> #{dest}"

  open(dest, 'wb') do |file|
    file << open(url).read
  end

  # if the image is too big, let's lower the quality a bit
  if File.size(dest) > 5_000_000
    `mogrify -quality 65% #{dest}`
  end

  dest
end

This method has one additional twist, which is that it checks the size of the downloaded file. If it's too large, it runs the ImageMagick command mogrify on it to drop the quality down.

At this point, I have the text of a tweet, a page object from the Wikiedpedia API library, and a hash that might have a file in it. I combine it all together and tweet it out:

output = [ tweet_title, page.fullurl ].join("\n")

begin
  tweet(output, opts)
rescue Exception => e
  puts e.inspect
end

Finally, I increment the index variable.

bot.config[:index] += 1

When the script is done running, this value will be updated in the YAML config file for the bot.

During this whole process, I ran the script a couple times. Chatterbot has a debug_mode command, which you can use to run a script without actually sending a tweet, which is pretty handy.

I'm a pretty messy coder, especially when I'm working on personal side projects, so I fixed a couple bugs, spent awhile cleaning up my junky code, etc, etc. Once I was happy with it, I uploaded my code to the server where I run my bots.

Then I needed to setup a cron job to run the bot every few hours. I decided to run the bot every two hours for starters (I might slow it down later), and for variery I run it at 2 minutes past the hour. This is what the job looks like:

2 */2 * * * . ~/.bash_profile; cd /var/stuff/lists_of_lists/; bundle exec ./lists_of_lists.rb >> tweets.log 2>&1

The first bit specifies when the job runs. The rest of it is the command that executes the bot. cron jobs usually run in a different environment then you get when you login to a server via SSH, so you need to explicitly load your environment, cd into the directory where the script is, and run the script. the >> tweets.log 2>&1 bit sends any output into the tweets.log file, which I can check for any errors/etc.

Anyway, that's about it! I've put the code on github -- please feel free to take it and adapt it to your needs!