An analysis of 1 month of tweets for event detection

Although I’ve worked in a variety of industries doing data science and machine learning projects, I’d never actually worked directly with Twitter data before today, so this was a new challenge.  I found it an exciting exercise and am guilty of diving down the rabbit hole exploring nascent territory.

I knew immediately a major challenge was going to be preprocessing the data, as social media data is notoriously noisy.  How noisy is it?  Noisier than Gimli’s gastrointestinal tract after eating way too much lembas bread.  As A Bhoi (2017) described it succinctly: “Identification of named entities (NEs) from microblog contents like twitter is a challenging task due to their noisy and short nature and lack of contextual information.”  Couldn’t have said it better myself.

So first thing to do is take a look at the raw data. Let’s open the chicago.csv file in Excel and LibreOffice Calc. Turns out Excel and LibreOffice handle the data differently, particularly with regard to non-ASCII characters (e.g. emojis).

First things first. Create a virtual environment. I’ve found from past experience that keeping your projects isolated from each other is paramount to preventing chaos from unintended package versions being used with a given project. Virtual environments are a key tool to keeping your sanity and some semblance of organization. So too is a package manager. The old way was virtualenv & pip. Unfortunately, this requires the unacceptably tedious task of manually updating the requirements.txt file each time a new package is added/updated to your project. As Frank Costanza from Seinfeld, said, “There had to be another way!!” https://youtu.be/cFmEYOnpEkc

Fortunately, within the last year we have “pipenv”. I’m a fan and have found this to be the ideal tool for managing versions of packages and virtual environments.  I’m not the only one; it’s now the recommended packaging tool for the Python community from Python.org.

Previous Work

I’m a firm believer of not reinventing the wheel for the sake of time-efficiency.  So the first thing I did was a pretty thorough scientific journal lit review of extracting events from Twitter data.  Turns out there’s a fair amount of prior research.  This will come in handy.

Zhou et al (2016) wrote an article “Real world city event extraction from Twitter data streams” that outlines an unsupervised method to extract real world events from Twitter streams.  This is a great starting point, as Zhou delineates several of the previous attempts at event detection and their shortcomings:

There is also some work which focuses on open domain events. Becker et al.3 provide a combination of online clustering and classification to distinguish real world events and non-events, but the work does not provide detailed classifications nor any explanation of the detected events.

Ritter et al.7 develop an open-domain event extraction and categorization system for Twitter. The system applies an LDA-based algorithm to detect topic clusters but requires manual inspection of the clusters types.

Unfortunately, Zhou’s solution focuses more on real world events that affect city services (e.g. traffic flow, weather, natural disaster), and he uses the same framework of 7 categories that Ritter et al

He explains this

Several similar event types are also subsumed into a categorization that encompasses those types, e.g. concert, festival, parade into ‘culture’. Since these events will result in a similar influence on the city, it is unnecessary to classify them into separate types.

Okay, so they group all events of culture type into a single parent class “Culture”.  We actually want to identify these child class events (e.g. concert, festival, parade, sports game, protest), but again, this is a good starting point.

The great attribute of an unsupervised approach is you don’t have to worry about pre-labeling a ton of events for use in your training set.  This is good because in the real world and your test set events can and will be myriad.  So an LDA approach allows a much more flexible, dynamic approach to identifying potential events.

To design a generic solution and avoid the need of creating a training keyword set for each city, an unsupervised method based on Twitter-LDA (Twitter Latent Dirichlet Allocation) is proposed.

 

Twevent (Li et al, 2012)

It’s over five years old, but the Twevent Segment-based Event Detection from Tweets” paper is nearly tailor-made for this task of event detection and evaluation.

In a summary, Twevent solves the ED problem with three components:

  1. tweet segmentation,
  2. event segment detection, and
  3. event segment clustering

Twevent basically approaches event detection as a clustering problem with burstiness as the most important attribute of detecting an event.

Feature-Rich Segment-Based News Event Detection on Twitter (Y Qin et al, 2013)

Quin built upon the Twevent model, but __________________________

Event Detection in Twitter: A machine-learning approach based on term pivoting (F Kunneman, 2014)

This Dutch team expanded upon existing work that uses Twevent (2012) for event detection.  They build upon Qin et al, 2013 team, and focus on training a classifier on several features of an event to recognize significant events in contrast to mundane, insignificant events.

But rather than clustering based on segments, as Qin (2013) did, the Dutch team based their clustering model on unigrams.  When I read unigrams, it gave me pause, as most NLP-literate individuals recognize that with an n-gram model, setting n=1 (i.e. a unigram) is suboptimal in English texts.  Rather you can get much better predictive performance by using bigrams or even trigrams.  Kunneman explain their rationale hower:

…in Dutch…, word formation is characterized by compounding, which means that Dutch unigrams…capture the same information as English bigrams.  Compare, for instance, ‘home owner’ to ‘huizenbezitter‘…

This is a classic teaching moment to aspiring data scientists working on an NLP problem:  always understand the language you are dealing with, and question your assumptions.  English may be the de facto language in the US, but Twitter and social media are global platforms, and you will often find that an NLP approach that works with one language (e.g. English), may completely fall apart in another.  Moving on.

I also liked that Kunneman defined explicitly what they meant by “significant”:

As a definition of what makes an event significant, we follow the definition given by [8]: ’Something is significant if it may be discussed in the media.’ As a proxy, we borrow the idea of [7] to include the presence of a certain name or concept as an article on Wikipedia as a weight in determining the significance of the candidate cluster of terms.

 

Location-Specific Tweet Detection and Topic Summarization in Twitter — V Rakesh – 2013

Rakesh’s team argue (rightly) that the geolocation of users does not necessarily correspond with the location specificity of the event they are tweeting about.  They

classify a tweet to be location-specific “not only based on it based on it’s geographical information, but also based on the relevancy of it’s content with respect to that location. In this paper, we aim to discover such location-specific tweets by combining the tweets’ content and the network information of the user.

They built a weighting scheme called Location Centric Word Co-occurrence (LCWC) that uses both the content of the tweets and the network information of tweeters’ friends to identify tweets that are location specific.  Their LCWC uses the following to build a likelihood score:

  1. mutual information (MI) score of tweet bi-grams;
  2. the tweet’s inverse document frequency (IDF);
  3. the term frequency (TF) of tweets, and
  4. the user’s network score to determine the location-specific tweets.

Why the use of bi-grams?  In their own words:

users tend to use a combination of hash-tags and words to describe the event; therefore, relying simply on a uni-gram model cannot provide the much needed information about the event.

Now one drawback to Rakesh’s approach is that it relies heavily upon the network interaction of twitter users to infer geographic location of events.  This works well when a majority of your users have friends tweeting about the same event, but this is not always the case.

 

 

 

 

 

“Geoburst — Real-time local event detection in geo-tagged tweet streams” — C Zhang et al, 2016

Geoburst was considered “state of the art” not even two years ago, but unfortunately, has pretty terrible precision (~30%), which simply will not do.

The other big issue with Geoburst and most pre-2016 event detection models was that it’s really hard to make a universal ranking function for accurate candiate event filtering.

 

 

“Finding and Tracking Local Twitter Users for News Detection” — H Wei – 2017

 

Enter the best (as of early 2018) event detection system I’ve discovered:  Chao Zhang et al’s 2017 “TrioVecEvent: Embedding-Based Online Local Event Detection in Geo-Tagged Tweet Streams”.

Zhang’s team has successfully addressed and solved the big issues with existing social media event detection systems: notably:

  1. capturing short-text semantics, and
  2. filtering uninteresting activities

 

The use a two-step detection scheme:

  1. divide tweets in the query window into coherent geo-topic clusters
    1. learn multimodal embeddings of the location, time, and text
    2. cluster the tweets with a Bayesian mixture model
  2. extract feature set for best classifying the candidate events

What I found really cool about this approach was they ranked features of their model on importance.

Turns out latitude and longitude concentration and what they term “spatial unusualness” and “temporal unusualness”, along with “burstiness” are the most important features in the model

################################################################

Zhou & Chen (2015) wrote an excellent “An Unsupervised Framework of Exploring Events on Twitter: Filtering, Extraction and Categorization”

In particular, Alan Ritter wrote 2 papers: “Named Entity Recognition in Tweets:
An Experimental Study” (2011) and “Open Domain Event Extraction from Twitter” (2012).  Ritter’s github repo I found intriguing.

But the big win I discovered was “A Deep Multi-View Learning Framework for City Event Extraction from Twitter Data Streams” from Farajidavara et al (2017).  This team from the UK quite literally wrote the paper on extracting events from Twitter data.

After finishing the lit review I started doing some basic EDA, exploratory data analysis.  I always love this part of a project, as it’s getting familiar with the new dataset, analyzing relationships between the variables, and getting a high-level understanding of what the data is showing.

The task is to detect (and classify) events (e.g. concert, sports game, festival, etc) from Twitter data.  As with much of data science, there are many ways of skinning the proverbial cat.

How to infer that an event is occuring based on the data?  First and most simple approach is to naively search through each tweet caption for specific keywords that allude to or explicitly mention an event type.  This basically amounts to creating a specific event-name dictionary.  Problem is that people aren’t robots and few people would tweet something like, “I’m heading down to watch the sports game tonight!!”  Instead they’d probably substitute “Reds” or “#CincinnatiReds” for the generic “sports game”.  We might get away with this approach for certain events, as most people would refer to a concert as “concert”.  Same for a festival.  But even that is complicated by the myriad instances of the Concert object.  For example, someone tweets “I’m excited to see The War on Drugs tonight at Bogarts!”  Our model would fail to infer that “The War on Drugs” is a band, and the context of the tweet being that they are playing at a concert.  Indeed, the word “concert” is never explicitly mentioned.

So how to solve this conundrum?  Well as with much of data science there are several ways to skin the proverbial cat, each with strengths and weaknesses.

One straightforward way is the classic Named Entity Recognition (NER).  There are three excellent papers on this very task:

  1.   Vavliakis – ‎2013 actually wrote a decent paper on Event identification in web social media through named entity recognition and topic modeling that is highly relevant to our use case.
  2. Abinaya (2014) also has a great paper on this in Event identification in social media through latent dirichlet allocation and named entity recognition.
  3. Analysis of named entity recognition and linking for tweets by Derczynski (2015)

 

From LDA (Latent Dirichlet Allocation), to LSI (Latent Semantic Indexing), to HDP (Hierarchical Dirichlet Process) there are no shortage of methods to group texts into a set of topics.

One of many challenges is using LDA or LSI is tuning the hyperparameters, namely “How do I choose how many topics to classify?”  It’s not an easy question to answer: choose too few and your model is not finding enough variety of topics.  Choose too high and your model will be unusably complex.  For example instead of basketball game it identifies topics at a much more granular level of detail (e.g. College_basketball_game, high_school_basketball_game, basketball_game_being_played_by_elves_versus_dwarves, etc).

Peter Ellis writes about how to solve this using good ol’ cross validation in his blog post “Determining the number of “topics” in a corpus of documents”.  It’s unfortunately written in R rather than Python, but we won’t hold that against him 😉

Zhao, et al (2015) describe the issue:

“Lacking such a heuristic to choose the number of topics, researchers have no recourse beyond an informed guess or time-consuming trial and error evaluation. For trial and error evaluation, an iterative approach is typical based on presenting different models with different numbers of topics, normally developed using cross-validation on held-out document sets, and selecting the number of topics for which the model is least perplexed by the test sets… Using the identified appropriate number of topics, LDA is performed on the whole dataset to obtain the topics for the corpus. We refer to this as the perplexity-based method. Although the perplexity-based method may generate meaningful results in some cases, it is not stable and the results vary with the selected seeds even for the same dataset.”

Basically we train a bunch of models with a variety of values of k (where k is the number of latent topics to identify)

 

There are also several deep learning approaches to topic modeling.  In today’s age of new libraries and tools being released daily, it’s easy to get caught up in the latest new shiny toy.  It’s important to remember, however to always start with a stupid model. A wise guy named Al once said “Everything should be made as simple as possible, but not simpler.”

 

We could use a dictionary of event keywords to use for direct event detection, and that certainly could grab some low-hanging fruit, but many if not most events will not be explicitly spelled out in such a dictionary.  Thus we need a more flexible, dynamic approach to event detection.

Michael Kaisser present a pretty solid and relevant talk way back in 2013 at Berlin Buzzwords called “Geo-spatial Event Detection in the Twitter Stream“.  While somewhat outdated, I borrowed several techniques he proposed to generate a score for a twitter event.  One of the many challenges of identifying Twitter events is how likely a potential event “candidate” is, in fact, an actual event.  From a Bayesian perspective, we can never be fully confident that a candidate event is an actual event, but we can greatly increase our confidence as we find more evidence to support that a candidate event is an actual event.

One way of addressing this is counting the number of unique users who are tweeting about a given candidate event.  If only a single user is tweeting about a candidate event, we probably shouldn’t give a lot of confidence that it is an actual event.

In contrast, if, say 7 unique users all tweet about a related topic within a certain time window and perhaps even within a certain geolocation radius, we should (in true Bayesian fashion) update our prior beliefs and increase our confidence that said candidate event is an actual event.

If you have lots of different people posting from the same geographic location, then that indicates a high probability of an event.

So factors that should increase our confidence of a positive event-detection should include:

  1. number of unique users tweeting about the candidate event
  2. geographical proximity of the candidate event
  3. temporal proximity of the candidate event.
  4. lexical similarity of the text context of the candidate event’s tweets

So in general, if a set of tweets contains similar topics and similar words, being tweeted from roughly the same location, clustered over a similar time, from different users, we should have high confidence that there exists an Actual Event concerning that set of tweets.

We can further leverage toponym names in tweets and use them to extrapolate likely geo locations.  (e.g. “Having a beer tonight at @16BitBar” or “Great salsa dancing tomorrow evening down at #fountainsquare”, etc).  Ajao 2017 actually has a great paper on location inference techniques on Twitter.  But since all the tweets in this particular dataset contain geotagged coordinates, we will ignore the issue of inferring location from non-geotagged tweets.

Geographic proximity is probably a slightly stronger factor than temporal proximity, since events could potentially occur over hours, days, even weeks (e.g. the Cincinnati Reds baseball season, taken in totality, could be considered an event, though it occurs over several months).

To determine which set of features comprises the optimal model (as measured by Precision and Recall), we could apply a random forest model to try various feature sets

Evaluating the Model

So how do we evaluate our unsupervised model?  Since we don’t have a set of labeled training data, how do we know the model is valid?

There are two approaches to evaluating an unsupervised LDA model, and they both use a “Human in the Loop:

 

Pretty much every approach of event detection heretofore has used K-fold cross validation (usually setting K=10) to evaluate their model.  This is a sound approach, and with little reason to a solid method of evaluation, this is how I similarly evaluated my model.

 

Problem with evaluating an unsupervised model like LDA is there is no ground truth, so no cross validation (Nikolenko 2014).  So one solution is to hold out a subset of documents (tweets), and then check their likelihood in the resulting mode.

Another solution is

 


Summarizing a Detected Event

This is more of a secondary priority task to actual event detection, but once an event has been detected, we need to summarize the content of that event, at least for labeling purposes.  This too is a non-trivial task and Rudrapal et al 2018 wrote a 20 page article on the various ways to summarize an event.  Since I’ve read their article I’ll spare you the details and summarize their summary:  there are two main branches of Twitter topic summarization:

  1. based on summary content
    1. extractive summaries
    2. abstractive summaries
  2. based on event category
    1. generic summaries
    2. domain-specific summaries

 

The difference being that in the former, actual tweets concerning a given event are used to summarize said event, while in the latter,

There’s also a great summarization framework called Sumblr, which summarizes tweet streams with time line generation

Lin et al (2014) has established the de facto way of evaluating text summarization in his paper “ROUGE — A package for automatic evaluation of summaries” through several scores.

Leave a comment