Thomas Yokota

You know who I am

Toxic Comment Classification Challenge

Wrap up: Jigsaw Toxic Comment Classification Challenge

4 minute read


Problem Definition

Founded by both Jigsaw and Google, the Conversation AI research initiative asked participants in the Toxic Comment Classification Challenge to develop a model that classifies text as toxic - threat, obscene, insult, and identity hate. The data was sourced from Wikipedia’s talk page edits. The competition was evaluated using the ROC AUC, a scale-invariant metric that measures how well predictions are ranked.


Word embeddings

Word embeddings allowed us to handle a large set of text data and saved us from needing to go deep with our neural network models. This was possible because word embeddings solve the high-dimensional problem with text data –think bag-of-words (BoW) and sparse matrices– by using a pre-defined vector space where words are mapped to vectors of real numbers. Our neural network models also did not have to go deep because unlike word frequency approaches where text features (e.g. counts of words) are independent of each other, embeddings are based on distributed representations. This means that we have additional information from the text such as semantic relationships that frequencies alone cannot provide.

Training word embeddings on the competition data

We used three pre-trained word embeddings: GloVe, fastText and LexVec. In addition, we used Facebook’s fastText library to train the word embeddings to the competition data set. This helped to maximize vocabulary coverage and consequently improved our neural network models.

Establishing a strong baseline

We established a strong baseline using a BoW approach. Our baseline model was motivated by Jeremy Howard of fame, who provided a baseline kernel using NB-SVM. We learned that the NB-SVM model served as a good baseline for this competition. We did not have to compromise on text features because of its ability to separate data at the margins rather than the features –it learns independent of the feature space dimensionality. In other words, this model was both fast and strong. In addition, we stacked a set of character-level n-gram features. This idea was motivated by a shared paper titled Words vs. Character N-Grams for Anti-Spam Filtering by Kanaris, J., et. al. The authors suggested that because character-level n-grams had fewer combinations than word combinations, they could provide both a rich and less sparse (i.e., fewer n-grams with zero frequency) feature set. Adding this set of features was quite simple:

text_vectorizer = TfidfVectorizer(
    ngram_range=(1, 1),
    max_features=30000)[train['comment_text'], test['comment_text']]))
train_word_features = text_vectorizer.fit_transform(train['comment_text'])
test_word_features = text_vectorizer.transform(test['comment_text'])

char_vectorizer = TfidfVectorizer(
    ngram_range=(1, 5),
    max_features=35000)[train['comment_text'], test['comment_text']]))
train_char_features = char_vectorizer.transform(train['comment_text'])
test_char_features = char_vectorizer.transform(test['comment_text'])

x = hstack([train_char_features, train_word_features]).tocsr()
x_test = hstack([test_char_features, test_word_features]).tocsr()

download code


Our best predictions came from recurrent neural network (RNN) models. I believe that this was mainly the case due to the power of word embeddings. We learned that shallow networks performed well and that “going deep” did not help. We also saw that the more coverage we had between our prepared vocabulary and word embeddings (i.e., fewer OOV tokens) the better our RNNs could learn and predict.


After reviewing Sijun’s post-competition analysis, it did make sense to stack –meta-ensemble– as many of our models had overfitted to the test data as can seen below (thanks, Sijun!).

Model Private Leaderboard Public Leaderboard Overfitting Delta
Extra Tree 0.9792 0.9805 -0.0013
NB-SVM 0.9821 0.9813 +0.0008
Shallow & Wide CNN 0.9835 0.9846 -0.0011
GRU + Attention 0.9856 0.9864 -0.0008
GRU + Capsule Net 0.9857 0.9863 -0.0006
Lasso Stacking 0.9870 0.9874 -0.0004
XGBoost Stacking 0.9870 0.9873 -0.0003
Average of Lasso & XGBoost Stacking 0.9872 0.9875 -0.0003

Lessons Learned

  • Word Embeddings: word embeddings are the future. They are efficient for handling large text data and provide semantic relationships that weren’t possible with word frequency methods.

  • Model Diversity: in regards to deep learning with NLP, it appears that models are more prone to overfit. With that said, it may be preferred to rely on many models rather than predictions from a single model.

  • Teamwork: I learned a lot about working alongside other data scientists such as reproducibility and communication. It was great working on GitHub and Slack to share ideas quickly. And I really enjoyed working with passionate data scientists –which is something that I have yet to experience working in Hawaii.

comments powered by Disqus