Price Recommendation for online sellers using Machine Learning

Rahul lodhi

Published in

The Startup

12 min readJul 1, 2020

Mercari Price Suggestion Challenge

My first machine learning self case study on a Regression problem and on a Kaggle competition.

1. Business Problem

Mercari is Japan’s biggest community-powered shopping app. Some items on Mercari cannot be sold because their listing prices are too high compared to the market price. Conversely, If the listing price is lower than the market price, customers lose out. Product pricing gets even harder at the sale, considering just how many products are sold online.

Therefore, listing becomes easier if we automatically display a suitable price for users when they list an item.

In this problem statement, the goal is to predict the suitable price of an item given its user-inputted text descriptions of their products, product category name, brand name, and item condition in order to minimize the difference between predicted price and actual price.

source: https://www.kaggle.com/c/mercari-price-suggestion-challenge

2. Use of Machine Learning / Deep Learning

This is a Regression problem as the price is a continuous variable which we have to predict. Machine learning is an area that deals with different ml/dl regression (predictive modeling) algorithms that can be applied to learn the relationship amongst the dataset variables like product name, category, prices, etc. and can predict the product prices based on that. In this post, we’ll discuss how effectively some of the regression algorithms like Ridge, SGD, MLP, etc. have solved the problem.

3. Data Source

This case study is based on the Kaggle Competition Mercari Price Suggestion Challenge, the dataset can be downloaded from https://www.kaggle.com/c/mercari-price-suggestion-challenge/overview/evaluation

Data Description

Datasets consists of two files train.tsv, test.tsv having following fields:-

train_id or test_id — the id of the listing
name — the title of the listing. Note that we have cleaned the data to remove text that looks like prices (e.g. $20) to avoid leakage. These removed prices are represented as [rm]
item_condition_id — the condition of the items provided by the seller
category_name — the category of the listing
brand_name — brand name of the product
price — the price that the item was sold for. This is the target variable that you will predict. The unit is USD. This column doesn’t exist in test.tsv since that is what you will predict
shipping — 1 if the shipping fee is paid by seller and 0 by buyer
item_description — the full description of the item. Note that we have cleaned the data to remove text that looks like prices (e.g. $20) to avoid leakage. These removed prices are represented as [rm]

4. Evaluation Metric — RMSLE

The evaluation metric (or error metric) for this competition is Root Mean Squared Logarithmic Error (RMSLE) which needs to be optimized. The lower the score is, the higher the accuracy of the price suggestion function will be.

https://hrngok.github.io/images/cost.jpg

Why RMSLE? If the predicted price is high than the actual price of the product then RMSLE will below and if the predicted price is lower than the actual price then RMSLE will be high. And the business problem is to suggest the prices of products to the online sellers then it only makes sense if we suggest high prices than their actual prices. RMSLE is also robust to outliers, as the price distribution of given products follows a log-normal distribution effect of outliers will be small.

5. Exploratory Data Analysis (EDA)

It is not easy to look at a data field (column) or a whole spreadsheet and determine important characteristics of the data. EDA is a crucial step in understanding how the data fields are distributed in the dataset to summarize their main characteristics, discover patterns and anomalies, which can be useful in building a better machine learning model. Let’s analyze the data and summarize characteristics:

5.1. Univariate Analysis

Analysis based on one feature or variable only.

5.1.1 Price

Plotting price histogram shown that price follows a log-normal distribution i.e most of the products lie in between a small range and few of the products have high prices. We can also get sight of the price range lying between 0 and 2500$.

code snippet for plotting price histogram

Boxplot gives a nice visualization of data through quartiles. Boxplotting price gave the intuition that 25% of the products range 0–10$, 50% of the product price range between 0–17$%, and 75% of product price is in the range between 0–29$. Investigating further through plotting percentiles revealed that only 1% of the products have a price of more than 170$.

code snippet for price boxplot

Price Distribution visualization through Boxplot

Boxplot gives a nice visualization of data through quartiles. Boxplotting price gave the intuition that 75% of product price is in the range between 0–29$. Investigating further through plotting percentiles revealed that only 1% of the products have a price of more than 170$.

5.1.2 Brand Name

Plotting the Barplot of top 10 brand_name revealed that almost 50% of the products don’t have brand_name(i.e having brand_name = missing). Top 10 brands that are widely used are ‘missing’, ‘PINK’, ‘Nike’, “Victoria’s Secret”, ‘LuLaRoe’, ‘Apple’, ‘FOREVER 21’, ‘Nintendo’, ‘Lululemon’, ‘Michael Kors’.

code snippet for brand_name barplot

5.1.3 Category Name

This is the category of the item for eg. Men/Tops/T-shirts, Women/Jewelry/Necklaces, etc. It’s separated by ‘/’ into three parts so to better generalize we’ll split category_name into 3 categories namely category_one, category_two, and category_three.

5.1.3.1 Category One

Looking into the histogram we can say this’s the main category and rest two are its subcategories. There are 10 unique category_one values are ‘Women’, ‘Beauty’, ‘Kids’, ‘Electronics’, ‘Men’, ‘Home’, ‘Other’, ‘Vintage & Collectibles’, ‘Handmade’, ‘Sports & Outdoors’. There are more products with women and beauty cateogry_one which reveals that there are more products for women than the rest of the other categories.

5.1.3.2 Category two

There are 111 unique category_two values and category_two value that occurred most is “athletic apparel”.

Top 10 category_two values that occurred most are plotted by the barplot.

5.1.3.3 Category three

There are 825 unique category_three values and the most occurred category_three value is t-shirt. Top 10 category_three values are plotted bt barplot.

5.1.4 Shipping

Shipping is either 0 — if paid by seller or 1 — if paid by the buyer.

5.1.5 Item Description

Description of the product or item is given by item_description and to get a quick visualization of what it contains we plotted WordCloud of item_description. Frequently occurring words tend to have large font sizes and less frequent words tend to be smaller in size.

5.1.5 Item Condition ID

This is the condition of the items provided by sellers for eg. brand new, used, etc. The item_condition_id has 5 values that vary from 1 to 5.

5.2. Bivariate Analysis

Analysis based on two features or variables to find out the empirical relationship between them.

5.2.1 Price and Shipping

Boxplotting revealed that there is some relationship between price and shipping as prices are high if the shipping is paid by the seller (i.e shipping = 0) and prices are low if the shipping is paid by the buyer(i. shipping=1).

5.2.2 Item Condition Id and Price

Prices for item_condition_id 5 are higher than others.

5.2.3 Category One and Price

The price for men category is higher than in other categories.

6. Feature Engineering

Extracting new features from raw data is feature engineering. These features can be useful in improving the performance of the machine learning model.

Filling missing brand_name — Mining into text_description and name feature gave insights that there are ample text_description and name feature values that contain the brand name of the products that brand_name are missing in the dataset. So we thought of filling the missing 50% brand_name by the common words of name and item_description feature that reduced the ‘missing’ brand_name percentage to around 20% by filling brand_name with reasonable words. Filling brand_name with this method was good as it helped in reducing the RMSLE.

NLP Features — All the NLP features I tried amongst them only features that worked in my case are item_description length(desc_len) and count of words in item_description(desc_cnt).

Magic Feature — I saw in a kernel that adding text features by ‘_’ found to be useful so I tried making three new features by combining category, item_condtiion_id, and shipping in this way category_one+’_’+shipping’_’+item_condition_id that reduced the RMSLE by 1%. Other features that contributed to reducing RMSLE are sentiments scores of item_description, length of the item_description, and the number of words in item_description.

df['cat_1_ship'] = df.category_one + '_' + df.shipping.astype(str) + '_' + df.item_condition_id.astype(str)df['cat_2_ship'] = df.category_two + '_' + df.shipping.astype(str) + '_' + df.item_condition_id.astype(str)df['cat_3_ship'] = df.category_three + '_' + df.shipping.astype(str) + '_' + df.item_condition_id.astype(str)

Sentiments Score — A score ranging between 0–100% that shows the polarity of the sentence i.e how much is a sentence positive, negative, and neutral. A score of 100% shows total positivity and vice versa. An overall score is given by the compound. There is a nice method in nltk called SentimentIntensityAnalyzer() which I have used to get sentiments score of item_description.

si_obj = SentimentIntensityAnalyzer()
si_obj.polarity_scores(sentence)

Correlation Heatmap — Features showing a good correlation with the target variable (in this scenario price) can be useful for linear models. So to determine the credibility of the added features I plotted a heatmap of features that revealed desc_cnt(item_description word count), desc_len(item_description length), positive, negative, neutral and compound features have some correlation with the price, so we can consider these features. Applying the Ridge model with these features showed that only desc_cnt and desc_len features are helpful in reducing RMSLE, so in my final submission I ignored the sentiments features.

7. Existing Approaches

Some of the existing approaches to this problem are given below

7.1. Ridge Model — link

Used dameraulevenshtein distance to fill missing brand_name and item_description field and discovered regex to find patterns in name and item_description field with negative weights that can reduce RMSLE.

7.2 CNN with GloVE — link

A single Deep Learning model (CNN with GloVE for word embeddings initialization).

7.3 LGB and FM — link

Ensemble of Light GBM model and Factorization Machine.

8. Improvements

Encoding and ColumnTransformer

ColumnTransformer is very useful when you have a heterogeneous dataset and different transformer needs to be applied to each column.

I applied ColumnTransformer with transformers — Counvectorizer, TfidfVectorizer, and Normalizer to transform the text and numerical features into a single feature space that speeds up the encoding process, saved the memory and made code looks clean.

code snippet data encoding

Cross-Validation and Ridge Regression Model

I used RidgeCV() with 5-fold cross-validation to improve the performance of the Ridge regression model. I tried with different alphas = [0.1, 0.5, 1, 5, 7, 10] and it turned out that alpha = 7 gives the best RMSLE score of 0.43.

ridgeReg = RidgeCV( alphas = [0.1, 0.5, 1, 5, 7, 10], cv = 5)
ridgeReg.fit(vec_fea_train, y_train)

code snippet ridge model

ELI5 — Debugging Machine learning Model

ELI5 is a nice library that allows us to visualize, debug, and explain machine learning models. ELI5 has a show_weights() method that allows us to visualize the positive and negative weights given by a model to the features. It helped me in improving my data cleaning process (added lemmatization, stop words, removed emoji symbols) and to discover some patterns in item_description and name feature that helped me in reducing RMSLE by a bit.

Some useful findings that eli5 revealed are —

stopwords=[‘why’, ‘x’, ‘w’, ‘s’, ‘e’, ‘ty’, ‘wgf’] for name feature
WordNetLemmatizer() — there were some inflected words in name features which were contributing to high negative weights like sticker and stickers, coupon and coupons, etc.
sz — in name feature ‘sz’ was used as the abbreviation of size, replacing sz by size with regex was useful.

# top = ( positive, negative )
eli5.show_weights(ridgeReg, top=(1000,7000), vec=vectorizer)

LGBM Model

Light GBM (LGBM) is a high performance and fast gradient boosting framework based on the decision tree algorithm. Applying LGBM model after hyperparameters tuning with RandomizedSearchCV() gave the RMSLE of 0.45

code snippet for hyperparameter tuning LGBM Model

Best hyperparameters returned by RandomizedSearchCV are —

params = { 
'objective': 'regression',
'learning_rate': 0.5,
'max_depth': 7,
'n_estimators': 600,
'num_leaves': 120
}

SGD Model

SGD Regressor model applied with hyperparameter tuning gave the RMSLE of 0.47. Best hyperparameters returned by GridSearchCV are —

alpha = 1e-08, l1_ratio = 0.3

code snippet for SGD Regressor

MLP Model

Simple feedforward neural network with densely connected hidden layers gave an RMSLE of 0.48. The best submission for this challenge has used MLP model with sparsely connected features, so the performance of this MLP Model can be further improved when used with sparsely connected layers, which is a little bit tricky so in my future work I would like to implement this model with sparse features. The structure of the MLP model with dense layers is given below

9. Comparison of Models

As the Ridge was outperforming the rest of the models I first made a submission by the weighted ensemble of 4 Ridge models. 3 models were trained per category and the 4th model was trained with all the categories and the final weighted ensemble of 4 Ridge models gave the RMSLE of 0.44 on Kaggle submissions but the solution wasn’t sufficient enough to place my submission under the leaderboard. Below is the summary of the models —

10. Final Submission — Stacking

For the final submission, I ensembled the Ridge and LGBM models through StackingRegressor(). Stacking is an ensembling method in which the output of multiple estimators (models) becomes the input of the final estimator which learns how to best combine the output of each estimator to give the final solution. Stacking of two Ridge Models with alpha = 7 and alpha = 10 respectively, one LGBM model with tweaking max_depth and another Ridge model with alpha = 7 as a final estimator gave the RMSLE score of 0.4215

code snippet for ensemble

My final submission at kaggle with ensembling through stacking regressor gave the RMSLE score of 0.42912 which would be under the top 8% of the submissions.

11. Future Work

Sparse MLP — Best submission for this problem has used the MLP model with sparsely connected hidden layers and I’ll be focusing on that to further improve the submission.
Wordbatch, FM— wordbatch with RNN and Factorization machine(FM) can also be tried.
Multiprocessing — Running models per system core is one of the key ideas that can be very useful when you have a small timeframe for training and prediction.

12. Summary, References

It was a great experience doing this case study and I hope this blog will help you in solving similar problems like this case study. A lot can be done to further improve the model's performance but due to time limitations, I am stopping this blog here.

For suggestions, code related queries or if you want to connect with me you can follow me on medium or can join my LinkedIn network — LinkedIn Profile

References

7. https://www.kaggle.com/peterhurford/lgb-and-fm-18th-place-0-40604