In this post I will be taking about different recommendation systems and how I developed a basic product recommendation engine of wayfiar.
First we will discuss different types of recommendation system. There are three types of recommendation system.
Since I dont have access to Wayfair's user data and other necessary dataset, I will be making content based recommendation system.I will be using data that is already pubilc and will be scraping thier website using a service called import.io.
I will only scrape these particular fields (product name', 'product_name_link', 'manufacturer', 'price','original_price', 'review count', 'review data', 'shipping time','product descrip', 'product data). And I will only consider these particular products (Lamp, Sofa, Cribs)
Lets get started
Step 1: Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
Step 2: Import data and clean data
df = pd.read_csv("wayfair_uniq.csv")
df_new=df.iloc[:,[3,4,5,7,9,11,13,15,17,18,19]]
df_new['index'] = df_new.index
df_new.shape
(17209, 12) This tells us that there are 17209 products and 12 fields.
df_new.head()
I had to clean the data so that the recommendation system is accurate.
I will remove the "$" sign from prices and convert it to float and drop the old fields.
df_new['original_price'] = df_new.original_price.str.replace('$', '')
df_new['original_price'] = df_new.original_price.str.replace(',', '').astype(float)
new_price = df_new["price"].str.split(";\s\$", n = 1, expand = True)
df_new["new_price"] = new_price[1]
df_new.drop(columns =["price"], inplace = True)
There are few rows in the price column that are of ($200 - $900) format. I will split the string, and take the lower value. First I searched for "-" in the string and then applied split.
price_range = df_new[df_new["new_price"].str.contains("-",regex=True)]
price_range['new_price'] = (price_range['new_price'].str.split("\s-", n = 1, expand = True))[0]
df_new.loc[price_range.index] = price_range
df_new['new_price'] = df_new.new_price.str.replace(',', '').astype(float)
Now that the price data is clean. I will create price buckets to categorize products based on their price.
bins = [0,500,1000,1500,2000,2500,3000,3500,4000,4500,35000]
labels = ["<$500","$500-$1000","$1000-$1500","$1500-$2000","$2000-$2500","$2500-$3000","$3000-$3500","$3500-$4000","$4000-$4500",">$4500"]
df_new['price_bracket'] = pd.cut(df_new['new_price'], bins, labels=labels)
Now I will toke the fields that would be required for the recommendation system and remove all the NA values.
features = ["product_data1","product_data2","product_descrip","manufacturer"]
for feature in features:
df_new[feature] = df_new[feature].fillna('')
Now I will combine all the fields into a large string. and name it as "combined_features"
def combine_features(row):
try:
return row['product_data1'] +" "+row['product_data2']+" "+row["product_descrip"]+" "+row["price_bracket"]+" "+row["manufacturer"]
except:
print("Error:", row)
df_new["combined_features"] = df_new.apply(combine_features,axis=1)
I will remove all the stopwords from "combined_features". Stopwords are words like (I, on, in , the, an, etc). This will make the recommendation system accurate.
stop_words = stopwords.words('english')
df_new['combined_features'] = df_new['combined_features'].str.lower().str.split()
df_new["features"]=df_new["combined_features"].apply(lambda x: [word for word in x if word not in stop_words])
df_new["features"]=df_new["features"].apply(lambda x: " ".join(x))
Step 3: Recommendation Engine
I will now find count matrix. Count matrix is basically numnber of occurances of a each word in each feature. This is done using "CountVectorizer()" function.
cv = CountVectorizer()
count_matrix = cv.fit_transform(df_new["features"])
I will now find Cosine Similarity between these to find how similar they are to each other. This will be done using "cosine_similarity()" function
cosine_sim = cosine_similarity(count_matrix)
Now I have created 3 functions that will give product name and product URL from product index, and vise-versa.
def get_title_from_index(index):
return df_new[df_new.index == index]["product_name"].values[0]
def get_home(index):
return df_new[df_new.index == index]["product_name_link"].values[0]
def get_index_from_title(title):
return df_new[df_new.product_name == title]["index"].values[0]
Now I will take user input of the product name.
product_user_likes = 'Rosalie Sofa'
Now I will extract the index of the product and create a list of similar products and sort based on high cosine similarity value.
product_index = get_index_from_title(product_user_likes)
similar_products = list(enumerate(cosine_sim[product_index]))
sorted_similar_products = sorted(similar_products,key=lambda x:x[1],reverse=True)
Now I will print the product name and product URL of first 50 similar Products.
i=0
for element in sorted_similar_products:
print (get_title_from_index(element[0]),get_home(element[0]))
i=i+1
if i>50:
break
List of all recommended Products
Now if I check the product links...
Product that i searched for:
First Product that was recommended
The Products are silimar to each other. This recommendation system is working.
Next Step