Google QUEST Q&A Labeling

Improving automated understanding of complex question-answer content

A deep learning case study on a multi-target regression problem of Kaggle competition called Google Quest Q&A Labeling.

Photo by Franki Chamaki on Unsplash

Table of Contents

  1. Problem Statement
  2. Business objective & Constraints
  3. Data Source
  4. Spearman’s Rank-order Correlation
  5. Loss Function
  6. Existing Approaches
  7. Improvements
  8. Exploratory Data Analysis
  9. Models Explanation
  10. Models Comparision
  11. Error Analysis
  12. Final Model
  13. Deploying the Model on Google Compute Engine
  14. Summary and Future work
  15. References

1. Problem Statement

Computers are really good at answering questions with single, verifiable answers. But, humans are often still better at answering questions about opinions, recommendations, or personal experiences. Humans are better at addressing subjective questions that require a deeper, multidimensional
understanding of context — something computers aren’t trained to do well.

Unfortunately, it’s hard to build better subjective question-answering algorithms because of a lack of data and predictive models. That’s why the team at Google Research, has collected question-answer pairs data on a number of quality scoring aspects and challenged us to use this new dataset to build predictive algorithms for different subjective aspects of question answering.


2. Business objective & Constraints

  • Given are question-answer pairs and 30 question-answer quality parameters(for train data only) and we have to predict these 30 Q&A labels for the given Q&A (qa_id).
  • The evaluation Metric for this competition is mean column-wise Spearman’s correlation coefficient.
  • The following conditions must be met
    • CPU Notebooks <= 9 hours run-time
    • GPU Notebooks <= 2 hours run-time
    • Internet must be turned off

Demonstrating these subjective labels can be predicted reliably can shine a new light on this research area. Results from this competition will inform the way future intelligent Q&A systems will get built.

3. Data Source

This case study is based on the Kaggle Competition called Google QUEST Q&A Labeling, the dataset can be downloaded from

Data Description

Datasets consist of two files train.csv, test.csv having the following fields:-

  • train.csv — the training data contains a total of 41 columns out of which target labels are the last 30 columns those we need to predict.
screenshot of training data
  • test.csv — the test set contains a total of 11 columns (you must predict 30 labels for each test set row).
screenshot of test data
  • Submission file — For each qa_id in the test set, you must predict a probability for each target variable. The predictions should be in the range [0,1]. A sample submission file will look like this:
screenshot of sample submission file

4. Spearman’s Rank-order Correlation

The evaluation metric for this competition is mean columns wise Spearman’s rank-order correlation. Spearman’s rank-order correlation is a measure of strength and direction of the association between two ranked variables. Pearson works well when there is a perfectly (or a good) linear relationship between two variables but it fails when there is a non-linear relationship between two variables.

monotonic non-monotonic function

Spearman’s rank-order correlation measures the strength and direction of the monotonic association between two variables. It is often denoted by the Greek letter ρ(rho), or “rs” and ranges between -1≤ rho ≤ 1 with a value closer to -1 or +1 shows the stronger monotonic association.

Spearman vs Pearson correlation

The key take away here was because it considers the order of values, so making the output distribution similar to the input distribution will result in improved rho value. Consider the example below —

Spearman’s rank-order correlation examples

5. Loss Function — MSE or BCE?

This is a multi-target regression problem and the values of the 30 Q&A labels need to be predicted and the predictions should be in the range of [0,1]. The values need to be predicted are real so shouldn’t we use MSE ( Mean Square Error ) as a loss function? The answer is ‘No’ because it’s not a great fit for this specific problem, why? Look at the unique values in some target labels in train data —

unique values in some target labels

The values are very small and range between [0,1]. Let’s calculate the MSE

mean squared error
code snippet for the loss function explain

The MSE we got is very small and if it’s very small then the weights change will be very very small which leads to a vanishing gradient problem and the model will not learn anything. So what loss function is good for this problem? Now check the binary-cross-entropy (BCE) calculated in the code snippet above. The BCE is we got is much better than MSE and that makes the binary-cross-entropy a good fit for this specific problem.


6. Existing Approaches


Every second kernel in this competition has used transformers and indeed transformers are the best fit for this competition because transformers can better learn the contextual relationship between the words in the text than other models and the main features of this dataset are question-answers pairs. The winning submission for this competition has also used transformers. They trained 4 models: 2 BERT, 1 Roberta, 1 BART model. You can find the link for the winning submission here — link

7. Improvements

I experimented with the following things —

CNN and LSTM Models with FastText word vectors

As my first cut approach, I tried fastText word vectors for the question_title, question_body, and answer columns with CNN and LSTM models. The CNN model with fastText word vectors gave the cross-validation rho of 0.2958 and the LSTM model with fastText embeddings gave the cross-validation rho of 0.3107.

Universal Sentence Encoder (USE) Embeddings + MLP

In this model, I used the USE sentence embeddings. Instead of converting words into high dimensional vectors, I used a pre-trained USE model that encodes the whole sentence into a high dimensional vector. I calculate the embeddings for question_title, question_body, and answer column. With USE embeddings I trained a simple MLP model that gave the best cross-validation score of rho = 0.38.

BERT-Uncased, BERT-Multi-cased

I trained two BERT models, BERT-uncased and BERT-Multi-cased. I used the pre-trained Bidirectional Encoder Representations from Transformers (BERT) models from TF-Hub and calculated the embeddings for question_title, question_body, and answer columns for the whole sentence(pooled output). With these embeddings, I trained simple MLP models.

USE + BERT-Uncased

To combine the power of two good models I merged the embeddings from the USE and BERT model and trained a simple MLP model with these embeddings.

8. Exploratory Data Analysis (EDA)

Let’s analyze the data and try to discover the main characteristics, patterns, and anomalies in the data —

8.1 Sample Question and Answer

Let’s see how a question and answer pairs look like in the given dataset —

question_title —

Given Ohm’s law, how can current increase if voltage increases, given fixed resistance?

question_body —

According to Ohm’s law, V=IR (voltage equals current times resistance). So if the voltage increases, then the current increases provided that the resistance remains constant. I know that Voltage or potential difference means work done per unit positive charge in bringing that charge from one point to another. So according to Ohm’s law, if the work done per unit charge increases then current will increase. How can this be true? Point out my mistakes.

answer —

It is better to think of Ohm’s Law as I=V/R. What it is telling you is that if you apply a voltage (V) to a resistive material (characterised by R), then that voltage is capable of driving a current I. The material could be anything, a piece of copper, or the plasma in a star. The voltage is constantly supplying energy to the electrons in the material, but the resistivity is constantly taking that energy back out (converting it to thermal energy). The higher the voltage, the more energy you can give to the electrons and hence the higher the current. On the other hand, the higher the resistance, the more energy is taken away from the electron flow and hence the lower the current.

8.2 Category

Looking at the plot clearly shows that there are a total of 5 categories namely Culture, Technology, Life_Arts, Stackoverflow, and Science. There are more question-answer pairs related to the Science Category in train data and more question-answer pairs related to the Technology category in the test data.

code snippet to countplot category
Countplot of train and test category column

8.3 Question and Answer Quality Labels

These are the 30 questions and answer quality labels we need to predict. Out of these 30 target labels, there are 21 question quality labels and 9 answer quality labels. Plotting the histogram of these target labels shown that Most of the Labels are skewed distributed and some of them having almost one-two values only.

Code snippet for quality labels
question-answer quality labels: part 1
question-answer quality labels: part 2

8.4 Host

The host is the platform (website) from which these questions and answers were gathered. Barplotting the top 10 hosts shown that most of the question-answer pairs are gathered from the StackOverflow website. The top 10 hostnames are stackoverflow, english, superuser, electronics, serverfault, math, physics, tex, askubuntu, programmers.

Code snippet of host barplot
Barplot of top 10 hosts name

8.5 Question Title

There are almost half of the duplicate question_title. Question title length ranges between 2 and 31 and 75% of question title length is between 2–11.

question_title length
Distplot of question_title length

8.6 Question Body

There are almost half of the question_body is duplicate. Questions body length ranges between 1 and 2041 words and most of the question’s body has a length below 500 words, so we calculated the percentiles that revealed 99% of the question body have a length ≤ 765 words.

Distplot of question_body length

8.7 Answer

All answers are unique. Answers length ranges between 3 and 2464 words and most of the answers have a length below 500 words, so we calculated the percentiles that revealed 99% of the answers have a length ≤ 772 words.


8.8 Question’s Sentiment Analysis

Plotting the heatmap of question quality labels revealed that some of the question quality labels hold considerable correlation with the sentiment features that may be useful during modeling.

Code snippet calculating sentiments of features
Heatmap of question quality labels: part 1
Heatmap of question quality labels: part 2

8.9 Answer Sentiment Analysis

Plotting the heatmap of answer quality labels revealed that some of the answer quality labels hold considerable correlation with the sentiment features that may be useful during modeling.

Heatmap of answer quality labels

9. Models Explanation

The explanation of the models I experimented with is as follows —

9.1 CNN and LSTM Models with FastText word vectors

FastText word vectors

The key idea was to handle the OOV words, so to get the word vectors for question_title, question_body, and answer columns I used the fastText pre-trained words vectors.

The fastText pre-trained word vectors can be downloaded from here — fastText. It’s a collection of 1 million word vectors trained on Wikipedia 2017, UMBC webbase corpus, and news dataset (16B tokens).

Bidirectional LSTM

Long Short-Term Memory (LSTM) can store the sequence information that makes them suitable for dealing with text data. The key idea was to train the LSTM model with the bidirectional layers and fastText word vectors. The model has four input layers one each for question_title, question_body, and answer, and the fourth layer is for one hot encoded category feature. There are 3 bidirectional LSTM layers one each for question_title, question_body, and answer columns. The structure of the LSTM model is as follows —

Convolutional neural network ( CNN )

The structure for the CNN model is the same as the LSTM model. There are 4 input layers four convolution layers one each for question_title, question_body, answer, and category features. The category feature is one hot encoded and rest three features are word vectors calculated from the pre-trained fastText word vectors. The structure for the CNN model is as follows:

code snippet for CNN model
structure of CNN model

9.2 Universal Sentence Encoder (USE) v4 + MLP

USE is a pre-trained sentence encoding model that encodes the whole sentences into embedding vectors. In this model, I have used the pre-trained USE model to get the text embeddings then a simple MLP model is applied with these embeddings to get the final output.

USE ( v4 ) Embeddings

TF-Hub has a collection of pre-trained models that can be used for different purposes with minimal code. USE v4 model can be downloaded from the TF-Hub website — USE v4 Model. USE converts the whole sentence into a 512-dimensional vector rather than converting the words into vectors. USE embeddings can be calculated as follows—


You can run out of memory if you will try to pass the whole list to embed() at once. So my suggestion is to pass the sentences in small batches and combine the embeddings at the end.

Multilayer perceptron ( MLP )

With this pre-trained USE model, I calculated the embeddings for question_title, question_body, and answer columns and applied the simple MLP model with three dense layers one each for question_title, question_body, and answer. With this model, I got the best cross-validation spearman rho score of 0.3864. This model tends to overfit quickly a slow learning rate is required to get the best results. The best learning rate I got for this model is 0.0009 with Adam as an optimizer.

code snippet model with USE
structure of the MLP model
train and validation loss

9.3 BERT Embeddings + MLP

In this model, I calculated the BERT embeddings from BERT pre-trained model and applied a simple MLP model with these embeddings.

Bidirectional Encoder Representations from Transformers ( BERT )

BERT is a state of the art natural language processing model. BERT is developed with a two-stage process, pre-training and fine-tuning. In pre-training, the model is trained on a certain task with a large amount of text (books, Wikipedia, etc.), and by the end of the training, the model acquires the language processing capabilities. Then in the second stage, you just need to finetune the pre-trained model for the specific language-related tasks. Pre-training takes a lot of time and requires a computationally powerful machine. TF-Hub has a collection of pre-trained BERT and other models. You can download a pre-trained BERT model from TF-Hub — link

I have used the BERT base uncased model — bert_base_uncased. It has L=12 hidden layers (i.e., Transformer blocks), a hidden size of H=768, and A=12 attention heads. To get the embeddings you just need to feed three inputs to the BERT model input_word_ids, input_mask, and segment_ids. Look at the example below —


The input_word_ids can be calculated through tokenizer. To download the tokenizer use this — link.


The tokenizer converts text into tokens( words ) then you call the convert_tokens_to_ids() method to convert tokens into ids. For BERT you also need to pass two additional tokens ‘[CLS]’ and ‘[SEP]’. The ‘[CLS]’ token means the start of a sentence and ‘[SEP]’ means the next sentence has begun. After converting to ids you need to pad the tokens. You can calculate input_words_ids as follows —

# convert to tokens
tokens = tokenizer.tokenize(sentence)
# convert tokens to ids
ids = tokenizer.convert_tokens_to_ids( [‘[CLS]’] + tokens + [‘[SEP]’] )
# apply padding
input_words_ids = ids + [0] * ( max_seq_len - len(ids) ) # Zero Padding

Zero padding is applied to the token ids to make the length of the list of input_words_ids equal for all sentences(qa_id). The input_mask separates input_word_ids from padding and can be calculated as follow —

input_mask = [1] * len(input_word_ids) + [0] * ( max_seq_len — len(input_word_ids) )

The maximum sequence length BERT can handle is 512, you can set any length below this. The segment_ids are token type ids, here you can separate two sentences by ids. I was calculating embeddings for question_title, question_body, and answer separately so, I calculated segment_ids as follows:

segment_ids = np.zeros(max_seq_len)

Passing input_word_ids, input_mask, segment_ids to the BERT model will give two output a pooled_output of shape [batch_size, 768] with representations for the entire input sequences and a sequence_output of shape [batch_size, max_seq_length, 768] with representations for each input token (in context). You can save the pooled_output and use this output as embeddings.

The Model

Once you get the embeddings for question_title, question_body, and answer you can pass these embeddings to an MLP model. With this model, I got the cross-validation spearman score of 0.3392. The structure of the MLP model is as follows —

code snippet for MLP model
structure of the MLP model
train and validation loss

BERT Multi-cased

This is the Multi-cased BERT model. Here multi-cased means Inputs have been “cased”, meaning that it distinct between lower and upper case letters and the accent markers have been preserved. The TF-Hub BERT-Multicased model has been pre-trained for Multilingual on the Multilingual Wikipedia, which means it can also handle multilingual words. The multi-cased BERT can be downloaded from here — tfhub_bert_multicased. The embeddings can be calculated just like the BERT-Uncased model you just need to change the path of the BERT-Uncased model to the path which points to the BERT-Multicased model. The structure of the MLP model after calculating embeddings is also the same. With this model, I got the Spearman score of rho=0.36037.

code snippet for MLP model
train and validation loss

9.4 USE + BERT-Uncased

In this model, I combine the embeddings of question_title, question_body, and answer column from the USE and BERT models and passed these embeddings to the MLP model. With this model, I got the cross-validation spearman score of 0.3672. The structure of this model is as follows:

structure of the MLP model

10. Models Comparison

I trained a total of 6 models with holdout cross-validation. The loss function I have used is binary_cross_entropy and I have optimized the BCE individually for all target values and calculate the average. The summary of the models is as follows:

models comparison table

11. Error Analysis

Error analysis is a process where we try to find out patterns or features for which the model is not performing well so that the performances of the models can be further improved. The diagram below is a plot of spearman rho scores calculated individually for each of the 30 target columns for cross-validation data. The Spearman’s score plot clearly shows that the worst predicted target labels by the model are question_not_really_a_question, question_type_consequence, question_type_spelling, answer_plausible, and answer_well_written.

Individual spearman score of target labels

To get a better visualization we can plot the dist plot of the worst predicted labels. The dist plot revealed two things 1. The model is performing well but not the whole value which is resulting in a bit smaller spearman score and 2. The model is not performing well for the target labels having highly imbalanced data.

When I analyzed the answers related to these worst predicted labels by their prediction error, it showed me that there are some multilingual words in the answers having a high prediction error. And that was a good finding because after finding this only I applied the BERT-Mulicased model which can handle multilingual words also and with the BERT-Multicased model I got the 2% increase in spearman’s score. You can check the screenshot of one of the of answer having multilingual words below —

screenshot of a multilingual answer

Similarly, the error analysis can be done for the other features which may reveal some other patterns. Doing error analysis on the category feature a word cloud has been plotted for the category of the questions having high prediction error. This revealed that the questions related to the ‘Technology’ category are the ones having a high prediction error compared to other categories.

word cloud of category feature for question_not_really_a_question

12. Final Model


The training data is small and if we further split the train data into train and cross-validation for hold out cross-validation, this will make the train data further small which can result in less learning or the model can overfit, and we have seen in the EDA that almost half of the question’s question_title and question_body is duplicate which can result in data leakage when using hold-out cross-validation. We can use K-Fold cross-validation here but it can also suffer from the data leakage problem because of the high number of duplicate questions.

GroupKFold is a better option for this dataset, it splits the data into unique groups and makes sure the train and validation dataset didn’t have data points from the same group. Check the example below for better clarification —

>>> from sklearn.model_selection import GroupKFold

>>> X = [0.1, 0.2, 2.2, 2.4, 2.3, 4.55, 5.8, 8.8, 9, 10]
>>> y = ["a", "b", "b", "b", "c", "c", "c", "d", "d", "d"]
>>> groups = [1, 1, 1, 2, 2, 2, 3, 3, 3, 3]

>>> gkf = GroupKFold(n_splits=3)
>>> for train, test in gkf.split(X, y, groups=groups):
... print("%s %s" % (train, test))
[0 1 2 3 4 5] [6 7 8 9]
[0 1 2 6 7 8 9] [3 4 5]
[3 4 5 6 7 8 9] [0 1 2]

For the final submission, I have used 5 folds with GroupKFold cross-validation and taken the average of predictions for the 5 folds for each of the USE, BERT-Uncased, and BERT-Multicased, models.


Different models can learn different things about the data and this is the essence of the ensemble. In Machine Learning, Ensemble is a paradigm of combining results from several models trained on the same data to improve the overall performance (result).

The simple ensemble techniques are Majority Voting, Averaging, and Weighted Averaging. Advance ensemble techniques are Bagging, Boosting, Stacking, and Blending.

I have used weighted averaging of USE, BERT-Uncased and BERT-Multicased models where more weightage was given to the USE model as it gave the best results. The better way is to plot the weights vs accuracy plot and can take weights combination which gives the best results.

final_prediction = 0.5 * use + 0.3 * bert_multicased + 0.2 * bert_uncased

Post Processing

Now, as discussed in Spearman’s section I applied post-processing to the final prediction which gives me the best results with these models. In the post-processing, I tried to make the distribution of test prediction similar to the train data. I tried for many target columns but the columns for which I got the positive gain are “question_conversational”, “question_type_compare”, “question_type_definition”, “question_type_entity”, “question_has_commonly_accepted_answer”. You can check the post-processing example code here — link

With just the ensemble of three simple models and post-processing, I got the Spearman’s score of private lb = 0.35334 and public lb = 0.38834.

13. Deploying the Model on Google Compute Engine

In the real world, you will build the machine learning models for a particular business problem so you need an API or Web app to serve the model as a solution, where you can feed the test data (or unseen data) and the output can be used as a solution or can be used as the input to some other application. For demonstration purpose, I have created a simple web app based on the Flask API that will take the input data, will make the prediction based on USE, BERT-Uncased, and BERT-Multicased models, will calculate the weighted ensembles from these models, and finally will give the output as 30 questions and answers quality labels after applying the post-processing. The steps are as follows —

  • Create a web app with Flask
  • Deploy the app on a google compute engine

13.1 Create a Web app with Flask

Flask is a simple, lightweight, and minimalist web framework of python. To create an API with flask you can first set up the virtual environment and install the Flask like described in this link. Once you setup the virtual environment and installed the Flask you can use flask-restful to create a simple API or you can use blueprints to build large modular applications. Blueprints provide a better way to manage and build large applications in Flask. The directory structure of my web app is given below.

The module directory contains the main API logic file called It has two HTTP methods GET and POST, the GET method is used to take the input through the web form and the POST method is used to process the input and to display the final output to the browser. I have used the weights of the 5 folds to make 5 predictions for each data point, for each of the USE, BERT-uncased and BERT-cased models and calculated the mean of the 5 folds predictions. That way we got the final prediction array for the USE, BERT-uncased and BERT-cased models. Then we applied the ensembling and post-postprocessing. The workflow and the code of the file is given below—

The workflow of API
code snippet for Flask API

The and files contain the code to calculate the embeddings and to initialize the model's architecture so that saved models weights (in .hdf5 format) can be loaded through the model instance and the model can make predictions. The code snippet of to calculate the USE embeddings and to build the model is given below.

code snippet to calculate embeddings and to build a model

The static directory contains the CSS, Jquery, and images files used to build the front end of the web app. The templates contain the HTML file to display a web form to take the input through the browser. The code for the web app is available on my GitHub repo, to run the app you just need to type the command > python

Screenshot of the web app

13.2 Deploy the App on Google Compute Engine

Google Compute Engine is a virtual machine instance service provided by Google Cloud Platform (GCP) that lets you run the on-demand, configurable virtual machine instance on the cloud. If you don’t have a GCP account you can create one, GCP gives you one year of free access to the limited services via free quota. To deploy the app on GCP you first need to create a virtual machine instance, you can follow this tutorial to create a virtual machine instance — create_vm_instance. Once you created a virtual machine instance you can log in to the virtual machine via ssh from the following google console panel.

GCP console panel

Once you logged in and installed all the necessary packages, clone the GitHub repo, and change the IP to and port in the according to your virtual machine instance firewall rules and run the command >python The content of the file is below

from app import appif __name__ == "__main__":'', port=5000, debug=True)

You can check the inference through the video of the working web app below

14. Summary & Future Work

The final submission can be further improved with an ensemble of other models like BART, ROBERTA, etc with the models of USE and BERT. So, in my future work, I am looking forward to experimenting with some interesting ideas like the SWA, pseudo labeling, using sequence and pooled output of transformers together, etc. and the following things —


Different models can learn different things about the data. I am looking forward to experimenting with other models like BART, ROBERTA, etc. Ensembling these models with the current models should most probably give an improved score.

If you have any questions or comments please feel free to get in touch using the comment section below or through LinkedIn

Aspiring Machine Learning Engineer.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store