Contents
Classification of Twitter posts using NLP
►, Joint work w/ Rachel Gorenstein
Link to full description of project
This project is the topic of the Masters project of my current advisee Rachel Gorenstein. The goal is to use techniques from Natural Language Processing to decide whether or not a Twitter post (Tweet) is an example of trolling. This task and Sentiment Analysis are along a similar, but not identical, vein. Currently, we want to improve the consistency of the model’s accuracy and, hopefully, the accuracy itself. The model uses a vector space model for words, ⤓ and a k-Nearest Neighbors classifier. ⤓
In trial runs, using around 12,000 Tweets in the training set, we have seen accuracy of 75% or more 1, but not consistently.
Intro to Ideas. What is…?
Natural Language Processing (read)
Sentiment Analysis (read; watch)
Neural Network (read)
k-Nearest Neighbors model (read)
Current tasks. Reduce number of distances between tweets that must be computed (they are computationally costly) — first method involves “centroid distance” between the centroids of vector representations of words in a tweet.
Augmented reality mobile game for practicing school mathematics
►, Joint work w/ Lin Deng, Jing Tian, Trevor Amburgey*, and Jeremy Keith*
(*: indicates a student)
Link to full description of project
(More coming soon)
Data analysis for insights in college basketball lineups
❚❚, Joint work w/ Derek Margulies*
(*: indicates a student)
Link to full description of project
(More coming soon)
Cycle diagrams, (Un)knotting, and pattern-avoiding permutations
❚❚, Joint work w/ Nathan McNew
Link to full description of project
(More coming soon)
Augmentation varieties in knot contact homology
❚❚, Paused
Link to full description of project
This topic was the main focus of my post-doctoral work. I believe it to be a fairly rich source of interesting questions at the intersection of knot theory and contact geometry. It is part of niche topic — the study of Legendrian DGAs — which is part of the subject of Symplectic Geometry, an area studied by a reasonably sized, but tight-knit, research community.
The goal is to understand the augmentation variety of knots and how it provides insights into, and ties together, a wide range of research in low-dimensional topology (an area with a different, but overlapping research community). While the topic might be considered to have a high entry point, there are many approachable sub-problems that could likely be solved without knowledge of any symplectic or contact geometry (or even topology!), but which would, nevertheless, be interesting to researchers within the community.
Intro to Ideas. What is…?
Big, wide-ranging topics (lots of jargon, even in the overviews)
Knot Theory. (Wikipedia) Cursory overview: read.
Contact & Symplectic Geometry. (Wikipedia) Cursory overviews: read & read. (For sort-of general audience) Interesting recent drama in the subject: read. (For the mathematically initiated!) MSRI lectures: watch; survey on CG: read; survey on SG: read.
Low-dimensional Topology. (Wikipedia) Cursory overview: read. (Mostly aimed at mathematicians) Good blog about LD topology: read.
Classification of Twitter Posts using NLP.
The data set was obtained from the online community kaggle.com.
Our approach is to build a Vector Space Model or Word Embedding (specifically, a Word2Vec model) from a set of training Tweets that have a given label, troll
or not_troll
. Roughly speaking, this involves an assignment of a vector representation (of a fixed length) to each word — to each word in a restricted dictionary, that is. The goal is to do it so that words which are very similar in meaning have vector representations that are very similar or near each other2. Once the vector assignment is made, we use a distance between two Tweets called the Word Mover Distance which uses the (Euclidean) distance between vector representations for the “cost of transportation.”3 Using this Word Mover Distance, we employ a k-Nearest Neighbors model for classification of the Tweet.
(⚠️ Technical details below
) 🔝
Vector Representations. To build the vector space model involves an unsupervised approach 4 that utilizes a neural network. The neural network has a structure like an autoencoder — a type of neural network that, mathematically speaking, equals a composition of two functions, and which tries to approximate the identity function (so the number of input nodes equals number of output nodes) while passing through a lower-dimensional space in the middle.
However, unlike an autoencoder, the loss function to be minimized does not simply check how closely the input is reconstructed by the output. Instead, to describe the loss function, fix a number and define the context of a word within a Tweet to be the union of the set of words to its left and the set of words to its right. Then the loss function measures how well the context can be predicted from the position of the vector representation5 and a stochastic gradient descent process is used to look for weights on the network that optimize the loss function.
Why do we get a semantically meaningful vector representation by trying to predict the context from only the position of the vector? Think of two words and that are essentially perfect synonyms. Then (presuming the training corpus of text is sufficiently large to “capture” their synonymity) and should be near interchangeable — the contexts of that appear in the training corpus are very much like the contexts of . This model would then highly favor having the vectors of and be near each other. 6
Word Mover Distance. Having a vector representation of words gives a notion of distance between two words: the Euclidean distance between their vectors. How do we extend this to a Tweet containing multiple words? The Word Mover Distance7. is a distance metric designed for exactly this purpose. It is a special case of the 1st Wasserstein or Kantorovich-Rubinstein metric8, which can be applied to transportation problems in optimization.
The idea is the following. The cost of moving word to word is the Euclidean distance between their vector representations. Given two Tweets and , the idea is to move each word in to a word in , with two constraints. The word mover distance between and is the minimal total (weighted) cost of doing this.
More precisely, suppose that there are words being represented by vectors. Write for the distance from the vector of to the vector of . Also, write for the fraction of words in that equal (similarly, for the fraction in ). The word mover distance is the minimum, over possible coefficients , of under the constraints that for each , and for each .
k-Nearest Neighbors Classifier (kNN). With the word mover distance defined, we use kNN to classify a Tweet. That is, there is some set of pre-labeled training Tweets (say there are of them), and some number we have determined. To classify a Tweet as troll
or not_troll
, find the Tweets , among those training Tweets, whose word mover distance to is smallest.
For the time being, we are using the “basic” kNN model that uses a simple majority vote to classify: having found , if the majority of them have label troll
, then label as troll
. If the majority are not_troll
, then label as not_troll
.
- Obtaining 75% accuracy in these trials is promising, particularly since there is a small training set (cf. accuracy comparisons here and here; note that the prediction of trolling/not_trolling is at least as difficult as positive/negative sentiment) ↩
- There are several ways this has been done, the Word2Vec model being a particular one. ↩
- The definition of this distance metric is inspired by the so-called Earth Mover Distance (or Wasserstein metric) between probability distributions. ↩
- No given information on what the vector representation “should” be ↩
- In this project, the “skip-gram” model is used, using a context with being small, 1 or 2 say. ↩
- This approach to semantic understanding is an application of the Distributional Hypothesis, saying that words that occur in the mostly the same contexts carry similar meaning. See Harris, Z. (1954). “Distributional structure”. Word. 10 (23): 146–162. doi:10.1080/00437956.1954.11659520 ↩
- Kusner et al., 2015, From word embeddings to document distances, Proceedings of Machine Learning Research 37 (2015): 957–956. ↩
- Also known as the Earth Mover Distance in computer science ↩