Price Recommendation for online sellers using Machine Learning

Rahul lodhi
The Startup
Published in
12 min readJul 1, 2020

--

Mercari Price Suggestion Challenge

My first machine learning self case study on a Regression problem and on a Kaggle competition.

Photo by Sebastian Unrau on Unsplash

1. Business Problem

Mercari is Japan’s biggest community-powered shopping app. Some items on Mercari cannot be sold because their listing prices are too high compared to the market price. Conversely, If the listing price is lower than the market price, customers lose out. Product pricing gets even harder at the sale, considering just how many products are sold online.

Therefore, listing becomes easier if we automatically display a suitable price for users when they list an item.

In this problem statement, the goal is to predict the suitable price of an item given its user-inputted text descriptions of their products, product category name, brand name, and item condition in order to minimize the difference between predicted price and actual price.

source: https://www.kaggle.com/c/mercari-price-suggestion-challenge

2. Use of Machine Learning / Deep Learning

This is a Regression problem as the price is a continuous variable which we have to predict. Machine learning is an area that deals with different ml/dl regression (predictive modeling) algorithms that can be applied to learn the relationship amongst the dataset variables like product name, category, prices, etc. and can predict the product prices based on that. In this post, we’ll discuss how effectively some of the regression algorithms like Ridge, SGD, MLP, etc. have solved the problem.

3. Data Source

This case study is based on the Kaggle Competition Mercari Price Suggestion Challenge, the dataset can be downloaded from https://www.kaggle.com/c/mercari-price-suggestion-challenge/overview/evaluation

Data Description

Datasets consists of two files train.tsv, test.tsv having following fields:-

  1. train_id or test_id — the id of the listing
  2. name — the title of the listing. Note that we have cleaned the data to remove text that looks like prices (e.g. $20) to avoid leakage. These removed prices are represented as [rm]
  3. item_condition_id — the condition of the items provided by the seller
  4. category_name — the category of the listing
  5. brand_name — brand name of the product
  6. price — the price that the item was sold for. This is the target variable that you will predict. The unit is USD. This column doesn’t exist in test.tsv since that is what you will predict
  7. shipping — 1 if the shipping fee is paid by seller and 0 by buyer
  8. item_description — the full description of the item. Note that we have cleaned the data to remove text that looks like prices (e.g. $20) to avoid leakage. These removed prices are represented as [rm]
train data

4. Evaluation Metric — RMSLE

The evaluation metric (or error metric) for this competition is Root Mean Squared Logarithmic Error (RMSLE) which needs to be optimized. The lower the score is, the higher the accuracy of the price suggestion function will be.

https://hrngok.github.io/images/cost.jpg

Why RMSLE? If the predicted price is high than the actual price of the product then RMSLE will below and if the predicted price is lower than the actual price then RMSLE will be high. And the business problem is to suggest the prices of products to the online sellers then it only makes sense if we suggest high prices than their actual prices. RMSLE is also robust to outliers, as the price distribution of given products follows a log-normal distribution effect of outliers will be small.

5. Exploratory Data Analysis (EDA)

It is not easy to look at a data field (column) or a whole spreadsheet and determine important characteristics of the data. EDA is a crucial step in understanding how the data fields are distributed in the dataset to summarize their main characteristics, discover patterns and anomalies, which can be useful in building a better machine learning model. Let’s analyze the data and summarize characteristics:

5.1. Univariate Analysis

Analysis based on one feature or variable only.

5.1.1 Price

Plotting price histogram shown that price follows a log-normal distribution i.e most of the products lie in between a small range and few of the products have high prices. We can also get sight of the price range lying between 0 and 2500$.

code snippet for plotting price histogram
Distribution of price

Boxplot gives a nice visualization of data through quartiles. Boxplotting price gave the intuition that 25% of the products range 0–10$, 50% of the product price range between 0–17$%, and 75% of product price is in the range between 0–29$. Investigating further through plotting percentiles revealed that only 1% of the products have a price of more than 170$.

code snippet for price boxplot
Price Distribution visualization through Boxplot

Boxplot gives a nice visualization of data through quartiles. Boxplotting price gave the intuition that 75% of product price is in the range between 0–29$. Investigating further through plotting percentiles revealed that only 1% of the products have a price of more than 170$.

90–100 percentile value of the price

5.1.2 Brand Name

Plotting the Barplot of top 10 brand_name revealed that almost 50% of the products don’t have brand_name(i.e having brand_name = missing). Top 10 brands that are widely used are ‘missing’, ‘PINK’, ‘Nike’, “Victoria’s Secret”, ‘LuLaRoe’, ‘Apple’, ‘FOREVER 21’, ‘Nintendo’, ‘Lululemon’, ‘Michael Kors’.

code snippet for brand_name barplot
BarPlot of Top 10 Brand Names

5.1.3 Category Name

This is the category of the item for eg. Men/Tops/T-shirts, Women/Jewelry/Necklaces, etc. It’s separated by ‘/’ into three parts so to better generalize we’ll split category_name into 3 categories namely category_one, category_two, and category_three.

5.1.3.1 Category One

Looking into the histogram we can say this’s the main category and rest two are its subcategories. There are 10 unique category_one values are ‘Women’, ‘Beauty’, ‘Kids’, ‘Electronics’, ‘Men’, ‘Home’, ‘Other’, ‘Vintage & Collectibles’, ‘Handmade’, ‘Sports & Outdoors’. There are more products with women and beauty cateogry_one which reveals that there are more products for women than the rest of the other categories.

BarPlot of category_one

5.1.3.2 Category two

There are 111 unique category_two values and category_two value that occurred most is “athletic apparel”.

Top 10 category_two values that occurred most are plotted by the barplot.

Barplot of Top 10 Category Two

5.1.3.3 Category three

There are 825 unique category_three values and the most occurred category_three value is t-shirt. Top 10 category_three values are plotted bt barplot.

BarPlot Top 10 Category Three

5.1.4 Shipping

Shipping is either 0 — if paid by seller or 1 — if paid by the buyer.

Distribution of shipping

5.1.5 Item Description

Description of the product or item is given by item_description and to get a quick visualization of what it contains we plotted WordCloud of item_description. Frequently occurring words tend to have large font sizes and less frequent words tend to be smaller in size.

WordCloud of item_description

5.1.5 Item Condition ID

This is the condition of the items provided by sellers for eg. brand new, used, etc. The item_condition_id has 5 values that vary from 1 to 5.

5.2. Bivariate Analysis

Analysis based on two features or variables to find out the empirical relationship between them.

5.2.1 Price and Shipping

Boxplotting revealed that there is some relationship between price and shipping as prices are high if the shipping is paid by the seller (i.e shipping = 0) and prices are low if the shipping is paid by the buyer(i. shipping=1).

Boxplot shipping and price

5.2.2 Item Condition Id and Price

  • Prices for item_condition_id 5 are higher than others.

5.2.3 Category One and Price

The price for men category is higher than in other categories.

BoxPlot of category_one and price

6. Feature Engineering

Extracting new features from raw data is feature engineering. These features can be useful in improving the performance of the machine learning model.

Filling missing brand_name — Mining into text_description and name feature gave insights that there are ample text_description and name feature values that contain the brand name of the products that brand_name are missing in the dataset. So we thought of filling the missing 50% brand_name by the common words of name and item_description feature that reduced the ‘missing’ brand_name percentage to around 20% by filling brand_name with reasonable words. Filling brand_name with this method was good as it helped in reducing the RMSLE.

NLP Features — All the NLP features I tried amongst them only features that worked in my case are item_description length(desc_len) and count of words in item_description(desc_cnt).

Magic Feature — I saw in a kernel that adding text features by ‘_’ found to be useful so I tried making three new features by combining category, item_condtiion_id, and shipping in this way category_one+’_’+shipping’_’+item_condition_id that reduced the RMSLE by 1%. Other features that contributed to reducing RMSLE are sentiments scores of item_description, length of the item_description, and the number of words in item_description.

df['cat_1_ship'] = df.category_one + '_' + df.shipping.astype(str) + '_' + df.item_condition_id.astype(str)df['cat_2_ship'] = df.category_two + '_' + df.shipping.astype(str) + '_' + df.item_condition_id.astype(str)df['cat_3_ship'] = df.category_three + '_' + df.shipping.astype(str) + '_' + df.item_condition_id.astype(str)

Sentiments Score — A score ranging between 0–100% that shows the polarity of the sentence i.e how much is a sentence positive, negative, and neutral. A score of 100% shows total positivity and vice versa. An overall score is given by the compound. There is a nice method in nltk called SentimentIntensityAnalyzer() which I have used to get sentiments score of item_description.

si_obj = SentimentIntensityAnalyzer()
si_obj.polarity_scores(sentence)

Correlation Heatmap — Features showing a good correlation with the target variable (in this scenario price) can be useful for linear models. So to determine the credibility of the added features I plotted a heatmap of features that revealed desc_cnt(item_description word count), desc_len(item_description length), positive, negative, neutral and compound features have some correlation with the price, so we can consider these features. Applying the Ridge model with these features showed that only desc_cnt and desc_len features are helpful in reducing RMSLE, so in my final submission I ignored the sentiments features.

Correlation HeatMap of features

7. Existing Approaches

Some of the existing approaches to this problem are given below

7.1. Ridge Model — link

Used dameraulevenshtein distance to fill missing brand_name and item_description field and discovered regex to find patterns in name and item_description field with negative weights that can reduce RMSLE.

7.2 CNN with GloVE — link

A single Deep Learning model (CNN with GloVE for word embeddings initialization).

7.3 LGB and FM — link

Ensemble of Light GBM model and Factorization Machine.

8. Improvements

Encoding and ColumnTransformer

ColumnTransformer is very useful when you have a heterogeneous dataset and different transformer needs to be applied to each column.

I applied ColumnTransformer with transformers — Counvectorizer, TfidfVectorizer, and Normalizer to transform the text and numerical features into a single feature space that speeds up the encoding process, saved the memory and made code looks clean.

code snippet data encoding

Cross-Validation and Ridge Regression Model

I used RidgeCV() with 5-fold cross-validation to improve the performance of the Ridge regression model. I tried with different alphas = [0.1, 0.5, 1, 5, 7, 10] and it turned out that alpha = 7 gives the best RMSLE score of 0.43.

ridgeReg = RidgeCV( alphas = [0.1, 0.5, 1, 5, 7, 10], cv = 5)
ridgeReg.fit(vec_fea_train, y_train)
code snippet ridge model

ELI5 — Debugging Machine learning Model

ELI5 is a nice library that allows us to visualize, debug, and explain machine learning models. ELI5 has a show_weights() method that allows us to visualize the positive and negative weights given by a model to the features. It helped me in improving my data cleaning process (added lemmatization, stop words, removed emoji symbols) and to discover some patterns in item_description and name feature that helped me in reducing RMSLE by a bit.

Some useful findings that eli5 revealed are —

  • stopwords=[‘why’, ‘x’, ‘w’, ‘s’, ‘e’, ‘ty’, ‘wgf’] for name feature
  • WordNetLemmatizer() — there were some inflected words in name features which were contributing to high negative weights like sticker and stickers, coupon and coupons, etc.
  • sz — in name feature ‘sz’ was used as the abbreviation of size, replacing sz by size with regex was useful.
# top = ( positive, negative )
eli5.show_weights(ridgeReg, top=(1000,7000), vec=vectorizer)
Debugging model with Eli5

LGBM Model

Light GBM (LGBM) is a high performance and fast gradient boosting framework based on the decision tree algorithm. Applying LGBM model after hyperparameters tuning with RandomizedSearchCV() gave the RMSLE of 0.45

code snippet for hyperparameter tuning LGBM Model

Best hyperparameters returned by RandomizedSearchCV are —

params = { 
'objective': 'regression',
'learning_rate': 0.5,
'max_depth': 7,
'n_estimators': 600,
'num_leaves': 120
}

SGD Model

SGD Regressor model applied with hyperparameter tuning gave the RMSLE of 0.47. Best hyperparameters returned by GridSearchCV are —

alpha = 1e-08, l1_ratio = 0.3
code snippet for SGD Regressor

MLP Model

Simple feedforward neural network with densely connected hidden layers gave an RMSLE of 0.48. The best submission for this challenge has used MLP model with sparsely connected features, so the performance of this MLP Model can be further improved when used with sparsely connected layers, which is a little bit tricky so in my future work I would like to implement this model with sparse features. The structure of the MLP model with dense layers is given below

the plot of MLP model

9. Comparison of Models

As the Ridge was outperforming the rest of the models I first made a submission by the weighted ensemble of 4 Ridge models. 3 models were trained per category and the 4th model was trained with all the categories and the final weighted ensemble of 4 Ridge models gave the RMSLE of 0.44 on Kaggle submissions but the solution wasn’t sufficient enough to place my submission under the leaderboard. Below is the summary of the models —

screenshot of models comparison

10. Final Submission — Stacking

For the final submission, I ensembled the Ridge and LGBM models through StackingRegressor(). Stacking is an ensembling method in which the output of multiple estimators (models) becomes the input of the final estimator which learns how to best combine the output of each estimator to give the final solution. Stacking of two Ridge Models with alpha = 7 and alpha = 10 respectively, one LGBM model with tweaking max_depth and another Ridge model with alpha = 7 as a final estimator gave the RMSLE score of 0.4215

code snippet for ensemble

My final submission at kaggle with ensembling through stacking regressor gave the RMSLE score of 0.42912 which would be under the top 8% of the submissions.

screenshot of Kaggle submission

11. Future Work

  • Sparse MLP — Best submission for this problem has used the MLP model with sparsely connected hidden layers and I’ll be focusing on that to further improve the submission.
  • Wordbatch, FM— wordbatch with RNN and Factorization machine(FM) can also be tried.
  • Multiprocessing — Running models per system core is one of the key ideas that can be very useful when you have a small timeframe for training and prediction.

12. Summary, References

It was a great experience doing this case study and I hope this blog will help you in solving similar problems like this case study. A lot can be done to further improve the model's performance but due to time limitations, I am stopping this blog here.

For suggestions, code related queries or if you want to connect with me you can follow me on medium or can join my LinkedIn network — LinkedIn Profile

References

  1. https://www.kaggle.com/c/mercari-price-suggestion-challenge/discussion/50256
  2. https://www.kaggle.com/dromosys/ridge-lb-0-41943-716acd
  3. https://www.kaggle.com/mchahhou/second-place-solution
  4. https://www.kaggle.com/lopuhin/mercari-golf-0-3875-cv-in-75-loc-1900-s
  5. https://www.appliedaicourse.com/course/11/Applied-Machine-learning-course
  6. https://www.kaggle.com/lopuhin/eli5-for-mercari

7. https://www.kaggle.com/peterhurford/lgb-and-fm-18th-place-0-40604

--

--