Thomas Yokota

You know who I am

Quora Insincere Questions Classification

Wrap up: Quora Insincere Questions Classification

5 minute read

quora_insincere

Problem Definition

Quora is a platform that allows users to ask and answer questions. In the Quora Insincere Questions Classification competition, participants were asked to identify insincere questions –defined as false questions that were intended to make a statement or were founded on false premises. The competition was evaluated using the F1 Score, which gives equal importance to precision and recall.

Solution

Maximizing vocabulary coverage

It was important to maximize vocabulary coverage with the word embeddings. Rather than rely solely on preprocessing during the data prep stage, our solution entailed preprocessing tokens as we built the embedding matrix. This meant that we could substitute word vectors for out-of-vocabulary (OOV) tokens.

Let’s say we searched for the word ‘Cat’ in our embedding lookup table and found no match. A random word vector would be assigned to this word rendering it somewhat meaningless. If, however, we found the word ‘cat’ in our lookup table, we could substitute its word vector to represent ‘Cat’.

$$ Cat = \begin{bmatrix} ???\end{bmatrix} $$

$$ cat = \begin{bmatrix} 0.4546 & -0.1112 & 0.8891 & -0.3439 & 0.7611 & 0.5111 & 0.4321 & 0.9999 \end{bmatrix} $$

$$ Cat = \begin{bmatrix} 0.4546 & -0.1112 & 0.8891 & -0.3439 & 0.7611 & 0.5111 & 0.4321 & 0.9999 \end{bmatrix} $$

Concatenating word vectors

We concatenated word embeddings to create a rich set of features for our neural network models, but at the cost of longer training times. In retrospect, I should not have prematurely written off the idea of averaging embeddings. The first place team shared that they had used a weighted average approach to leverage multiple embeddings. Having a smaller dimensional embedding has its advantages such as speed, which was important to consider for this kernels-only competition.

Concatenating word embeddings

def load_embedding(embedding):
    print(f'Loading {embedding} embedding..')
    def get_coefs(word,*arr): return word, np.asarray(arr, dtype='float32')
    if embedding == 'glove':
        EMBEDDING_FILE = f'{FILE_DIR}/embeddings/glove.840B.300d/glove.840B.300d.txt'
        embeddings_index = dict(get_coefs(*o.split(" ")) for o in open(EMBEDDING_FILE, encoding="utf8"))
    elif embedding == 'wiki-news':
        EMBEDDING_FILE = f'{FILE_DIR}/embeddings/wiki-news-300d-1M/wiki-news-300d-1M.vec'
        embeddings_index = dict(get_coefs(*o.split(" ")) for o in open(EMBEDDING_FILE, encoding="utf8") if len(o)>100)
    elif embedding == 'paragram':
        EMBEDDING_FILE = f'{FILE_DIR}/embeddings/paragram_300_sl999/paragram_300_sl999.txt'
        embeddings_index = dict(get_coefs(*o.split(" ")) for o in open(EMBEDDING_FILE, encoding="utf8", errors='ignore') if len(o)>100)
    elif embedding == 'google-news':
        from gensim.models import KeyedVectors
        EMBEDDING_FILE = f'{FILE_DIR}/embeddings/GoogleNews-vectors-negative300/GoogleNews-vectors-negative300.bin'
        embeddings_index = KeyedVectors.load_word2vec_format(EMBEDDING_FILE, binary=True)
    return embeddings_index

embeddings_index_1 = load_embedding('glove')
embeddings_index_2 = load_embedding('wiki-news')

def build_embedding_matrix(embeddings_index_1, embeddings_index_2, lower=False, upper=False):

    wl = WordNetLemmatizer().lemmatize

    word_index = tokenizer.word_index

    nb_words = min(num_words, len(word_index))
    embedding_matrix = np.zeros((nb_words, 601))

    something_1 = embeddings_index_1.get("something")
    something_2 = embeddings_index_2.get("something")
    something = np.zeros((601,))
    something[:300,] = something_1
    something[300:600,] = something_2
    something[600,] = 0

    def all_caps(word):
        return len(word) > 1 and word.isupper()

    hit, total = 0, 0

    def embed_word(embedding_matrix,i,word):
        embedding_vector_1 = embeddings_index_1.get(word)
        if embedding_vector_1 is not None:
            if all_caps(word):
                last_value = np.array([1])
            else:
                last_value = np.array([0])
            embedding_matrix[i,:300] = embedding_vector_1
            embedding_matrix[i,600] = last_value
            embedding_vector_2 = embeddings_index_2.get(word)
            if embedding_vector_2 is not None:
                embedding_matrix[i,300:600] = embedding_vector_2

    for word, i in word_index.items():
        if i >= num_words: continue
        if embeddings_index_1.get(word) is not None:
            embed_word(embedding_matrix,i,word)
            hit += 1
        else:
            if len(word) > 20:
                embedding_matrix[i] = something
            else:
                word2 = wl(wl(word, pos='v'), pos='a')
                if embeddings_index_1.get(word2) is not None:
                    embed_word(embedding_matrix,i,word2)
                    hit += 1
                else:                   
                    if len(word) < 3: continue
                    word2 = word.upper()
                    if embeddings_index_1.get(word2) is not None:
                        embed_word(embedding_matrix,i,word2)
                        hit += 1
                    else:
                        word2 = word.upper()
                        word2 = wl(wl(word2, pos='v'), pos='a')
                        if embeddings_index_1.get(word2) is not None:
                            embed_word(embedding_matrix,i,word2)
                            hit += 1
                        else:
                            embedding_matrix[i] = something  
        total += 1
    print("Matched Embeddings: found {} out of total {} words at a rate of {:.2f}%".format(hit, total, hit * 100.0 / total))
    return embedding_matrix

embedding_matrix = build_embedding_matrix(embeddings_index_1, embeddings_index_2, lower=True, upper=True)

del embeddings_index_1, embeddings_index_2
gc.collect()

download code

Establishing a cross-validation strategy

Many teams slid in placement after the second stage of this competition. I believe that this was due to many participants relying on publicly shared solutions early on. What many must have missed was that these solutions overfit to the public test set. I know this because I had performed cross-validation on these shared solutions –a lesson that I learned during my early days on Kaggle. With that said, I am still thankful for these solutions because it meant that I could spend most of my time on walking back from these overfit models to something that would be more generalizable.

parallelization

This was final key to our team’s success. We developed a pipeline that could run both DL and BoW models on the GPU and CPU simultaneously to create many predictions which were averaged for our final score.

Lessons Learned

  • PyTorch > Keras: Initially, I had wondered what was so great about going “low-level” with programming DL models. However, Wojtek re-coded our pipeline to use PyTorch, and this allowed us to make deterministic runs. Consequently, we were not shooting in the dark for most of the competition.

  • Planning ahead: I often hear many people assume that ML/DL/AI is “easy” as if you just throw data into a model and out comes great predictions. Kernels-only competitions are a sobering counterpoint to that misconception. They provide business-like constraints such as time and resources, which means that having a plan at every step of the competition is critical. With that said, I will leave you now with words from Dr. Jean-Francois Puget (aka Kaggle Grandmaster CPMP).

Kaggle proved to be way more competitive than I would have imagined. People who don’t enter Kaggle competitions have no idea of how elaborate and advanced winning solutions are.

–Dr. Jean-Francois Puget

source

comments powered by Disqus