Filming market is great interest all the time. As time passes, the competition has become more fierce than ever, and the survival competition among the productions and film‐distributing agencies has become fierce. In this situation, improving the quality of the movie itself is not enough to maintain their current market status, and another factor they may have to invest their resources is digital market.
The objectives in this project is to provide the recommendation of the products based on the existing customer experience.
Source: Amazon review data (2018) http://deepyeti.ucsd.edu/jianmo/amazon/index.html
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from IPython.display import clear_output
import os
from library.sb_utils import save_file
pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 100)
pd.options.display.float_format = '{:.4f}'.format
The goal of this project is to create Hybrid Recommendation System
In this notebook, we wrangle data and make data ready for EDA, which will be the next step.
xxx
import os
import json
import gzip
import pandas as pd
from urllib.request import urlopen
# Load in the data
metaPath = '../data/meta_Movies_and_TV.json.gz'
reviewPath = '../data/Movies_and_TV.json.gz'
# http://deepyeti.ucsd.edu/jianmo/amazon/categoryFilesSmal/Movies_and_TV_5.json.gz
# http://deepyeti.ucsd.edu/jianmo/amazon/categoryFilesSmall/Movies_and_TV.csv
metaData = []
reviewData = []
with gzip.open(metaPath) as f:
for l in f:
metaData.append(json.loads(l.strip()))
with gzip.open(reviewPath) as f:
for l in f:
reviewData.append(json.loads(l.strip()))
# convert list into pandas dataframe
meta = pd.DataFrame.from_dict(metaData)
meta.info()
# convert list into pandas dataframe
review = pd.DataFrame.from_dict(reviewData)
review.info()
meta.head(2)
review.head(2)
Among the data, the columns necessary for recommendation system are those which may contribute predictive information to how a given user might rate an item.
meta data
review data
# # make a small size dataframe of meta
# dfm = meta[[
# 'category',
# 'description',
# 'title',
# 'brand',
# 'asin',
# ]]
# For meta data, we use raw data
dfm = meta
# make a small size dataframe of review
dfr = review[[
'overall',
'reviewTime',
'reviewerID',
'asin',
]]
import gc
del meta
del review
gc.collect()
# convert dataframe values to string
dfm = dfm.astype(str)
dfr = dfr.astype(str)
# save the data to a new csv file
datapath = '../data'
save_file(dfm, 'step1_dfm.csv', datapath)
save_file(dfr, 'step1_dfr.csv', datapath)
The key processes and findings from this notebook are as follows.
The remaining columnns