Fri May 10 2019

E-Commerce Product Recommendation Engine [Python]

In this post I will be taking about different recommendation systems and how I developed a basic product recommendation engine of wayfiar.

First we will discuss different types of recommendation system. There are three types of recommendation system.

Content Based Recommendation System: This typoe of recommendation system analyzes different parameters of the product (product name, brand, price, description, features). This system takes in the product name as input and returns all the similar products based on these parameters.
User Based Recommendation System: This is also known as collaborative filtering. This takes user demographic data, activities, and their preferences into consideration. If person A and person B have similar profiles (age, liking, activity, etc) and person A has liked a particular product or has viewed a particular product, then the recommendation system will recommend person B those products that person A had viewed and vise-versa.
Popularity Based Recommendation System: This system keeps track of views, likes, and products bought and recommends those products that are popular in that region by analysing these parameters.

Since I dont have access to Wayfair's user data and other necessary dataset, I will be making content based recommendation system.I will be using data that is already pubilc and will be scraping thier website using a service called import.io.

I will only scrape these particular fields (product name', 'product_name_link', 'manufacturer', 'price','original_price', 'review count', 'review data', 'shipping time','product descrip', 'product data). And I will only consider these particular products (Lamp, Sofa, Cribs)

Lets get started

Step 1: Import necessary libraries

import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

Step 2: Import data and clean data

df = pd.read_csv("wayfair_uniq.csv")
df_new=df.iloc[:,[3,4,5,7,9,11,13,15,17,18,19]]
df_new['index'] = df_new.index
df_new.shape

(17209, 12) This tells us that there are 17209 products and 12 fields.

df_new.head()

Screen Shot 2019-05-12 at 11.31.47 PM

I had to clean the data so that the recommendation system is accurate.

I will remove the "$" sign from prices and convert it to float and drop the old fields.

df_new['original_price'] = df_new.original_price.str.replace('$', '')
df_new['original_price'] = df_new.original_price.str.replace(',', '').astype(float)

new_price = df_new["price"].str.split(";\s\$", n = 1, expand = True)
df_new["new_price"] = new_price[1]
df_new.drop(columns =["price"], inplace = True)

There are few rows in the price column that are of ($200 - $900) format. I will split the string, and take the lower value. First I searched for "-" in the string and then applied split.

price_range = df_new[df_new["new_price"].str.contains("-",regex=True)]
price_range['new_price'] = (price_range['new_price'].str.split("\s-", n = 1, expand = True))[0]
df_new.loc[price_range.index] = price_range
df_new['new_price'] = df_new.new_price.str.replace(',', '').astype(float)

Now that the price data is clean. I will create price buckets to categorize products based on their price.

bins = [0,500,1000,1500,2000,2500,3000,3500,4000,4500,35000]
labels = ["<$500","$500-$1000","$1000-$1500","$1500-$2000","$2000-$2500","$2500-$3000","$3000-$3500","$3500-$4000","$4000-$4500",">$4500"]
df_new['price_bracket'] = pd.cut(df_new['new_price'], bins, labels=labels)

Screen Shot 2019-05-12 at 11.43.05 PM

Now I will toke the fields that would be required for the recommendation system and remove all the NA values.

features = ["product_data1","product_data2","product_descrip","manufacturer"]

for feature in features:
df_new[feature] = df_new[feature].fillna('')

Now I will combine all the fields into a large string. and name it as "combined_features"

def combine_features(row):
try:
    return row['product_data1'] +" "+row['product_data2']+" "+row["product_descrip"]+" "+row["price_bracket"]+" "+row["manufacturer"]
except:
    print("Error:", row)
    
df_new["combined_features"] = df_new.apply(combine_features,axis=1)

I will remove all the stopwords from "combined_features". Stopwords are words like (I, on, in , the, an, etc). This will make the recommendation system accurate.

stop_words = stopwords.words('english')
df_new['combined_features'] = df_new['combined_features'].str.lower().str.split()
df_new["features"]=df_new["combined_features"].apply(lambda x: [word for word in x if word not in stop_words])
df_new["features"]=df_new["features"].apply(lambda x: " ".join(x))

Step 3: Recommendation Engine

I will now find count matrix. Count matrix is basically numnber of occurances of a each word in each feature. This is done using "CountVectorizer()" function.

cv = CountVectorizer()
count_matrix = cv.fit_transform(df_new["features"])

I will now find Cosine Similarity between these to find how similar they are to each other. This will be done using "cosine_similarity()" function

cosine_sim = cosine_similarity(count_matrix)

Now I have created 3 functions that will give product name and product URL from product index, and vise-versa.

def get_title_from_index(index):
    return df_new[df_new.index == index]["product_name"].values[0]

def get_home(index):
    return df_new[df_new.index == index]["product_name_link"].values[0]

def get_index_from_title(title):
    return df_new[df_new.product_name == title]["index"].values[0]

Now I will take user input of the product name.

product_user_likes = 'Rosalie Sofa'

Now I will extract the index of the product and create a list of similar products and sort based on high cosine similarity value.

product_index = get_index_from_title(product_user_likes)

similar_products =  list(enumerate(cosine_sim[product_index]))

sorted_similar_products = sorted(similar_products,key=lambda x:x[1],reverse=True)

Now I will print the product name and product URL of first 50 similar Products.

i=0
for element in sorted_similar_products:
    print (get_title_from_index(element[0]),get_home(element[0]))
    i=i+1
    if i>50:
        break

List of all recommended Products

Screen Shot 2019-05-13 at 1.47.06 AM

Now if I check the product links...

Product that i searched for:

Screen Shot 2019-05-13 at 1.49.20 AM

First Product that was recommended Screen Shot 2019-05-13 at 1.49.31 AM

The Products are silimar to each other. This recommendation system is working.

Next Step

Next step would be to gather data such as product like, clicks, user demographic, preferences, search words, etc, to make the system more robust.
Take sentence format and structure into consideration and remove common terms.
Use visualization to better understand the recommendation algorithm.