Real-Time Tracking of Swearing on Twitter

Today I'm launching, a website that tracks the rate and magnitude of swearing on Twitter, and displays the data in real-time with a couple of snazzy dynamic charts. I wanted to launch it in time for the election, and I just made it.

PREDICTION there will be a lot of swearing tomorrow and the next day on Twitter.

I have worked on a number of projects with Twitter - there are a handful of bots that mostly entertain people, and a library devoted to building them, along with a few other random unpublished/discarded projects. Anyway, Twitter can be fairly annoying, and I doubt their API will last in its current form for too much longer, but it's still fun to work with their data.

Over time I noticed that people on Twitter swear a lot. I started wondering - do people swear more online than they do in real life? And could you identify trends or useful information in the rates of cursing on Twitter? is an attempt to begin to answer those questions.

Using the Twitter Streaming API, I scan tweets for a collection a swear words and other curse-like expressions. I calculate two values from that data: the rate of tweets which contain swears to those that do not contain swears, and also the magnitude of sweariness in those tweets. For example, a tweet with more swears in it has a higher magnitude than one which only has one swear in it.

For fun, I invented a threat level scale, basically a spoof of the Homeland Security threat level, since the notion of color-coded threat levels hasn't been sufficiently mocked yet.

threat level

I also tried to keep in mind the DEFCON system which always gave me chills during my childhood. And of course, who could forget this scene from War Games:

The level is assigned according to the current rate of swearing, with a little math tossed in to predict if the rate is increasing or decreasing. The colored bars displayed on the graph correspond to the levels.

I made a decision when I started the project to only look at tweets that are reported as being in English, and to only look for English swear words. This meant that I couldn't get a really good idea of the global swearing status, but I don't really have the knowledge to implement a decent system for tracking swears in other languages. That said, I am really amazed to see that people swear a lot more what is roughly the evening hours in America.

I expected to see a more constant level of swearing through the day. It's not like there aren't reasons to swear in the morning or something. So while I worked on the website, I spent a little time researching the use of profanity in real life, to get an idea of how it compares to online usage. According to Wikipedia's article on Profanity:

Analyses of recorded conversations reveal that roughly 80–90 spoken words each day – 0.5% to 0.7% of all words – are swear words, with usage varying from between 0% to 3.4%. In comparison, first-person plural pronouns (we, us, our) make up 1% of spoken words.

Over two months of monitoring, I found that about 6.97% of tweets had a swear in them. I need to work out the word count for that, but it would seem to be roughly comparable to this analysis. By the way, the article cited by Wikipedia is fascinating.

But I find it really interesting that there's an evening peak in the data. Since we're measuring the rate here, and not the totals, I expected swearing to be at least somewhat constant -- it didn't seem like there would be a reason for there to be fewer sweary tweets in the morning as opposed to the evening. I need to do a little digging into the data to see if I can figure out if there's anything obvious that can explain what is happening here. I might also change the output a bit -- it's interesting to see the current rate of swearing, but it might also be interesting to know how much higher/lower it is than it usually is for the given time of day.

If you like to follow computer programs on Twitter, you can follow @WTFLevel, and get notifications whenever shit blows up or calms down.

Technical Notes

If you care about how things like this are implemented, then you can check out the WTFlevel implementation details.

Filed under projects and twitter
blog comments powered by Disqus