Step 01_Introduction¶

Context¶

Filming market is great interest all the time. As time passes, the competition has become more fierce than ever, and the survival competition among the productions and film‐distributing agencies has become fierce. In this situation, improving the quality of the movie itself is not enough to maintain their current market status, and another factor they may have to invest their resources is digital market.

Criteria for success¶

The objectives in this project is to provide the recommendation of the products based on the existing customer experience.

Source of a data set¶

Source: Amazon review data (2018) http://deepyeti.ucsd.edu/jianmo/amazon/index.html

Major features and labels¶

asin: Amazon Standard Identification Number. This string uniquely identifies a product and can be treated as the product ID.
reviewerID: The unique ident for each customer

Import¶

import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from IPython.display import clear_output

import os
from library.sb_utils import save_file

Configuration¶

pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 100)
pd.options.display.float_format = '{:.4f}'.format

Objective¶

The goal of this project is to create Hybrid Recommendation System

In this notebook, we wrangle data and make data ready for EDA, which will be the next step.

xxx

Load data¶

import os
import json
import gzip
import pandas as pd
from urllib.request import urlopen

# Load in the data
metaPath = '../data/meta_Movies_and_TV.json.gz'
reviewPath = '../data/Movies_and_TV.json.gz'

# http://deepyeti.ucsd.edu/jianmo/amazon/categoryFilesSmal/Movies_and_TV_5.json.gz
# http://deepyeti.ucsd.edu/jianmo/amazon/categoryFilesSmall/Movies_and_TV.csv


metaData = []
reviewData = []


with gzip.open(metaPath) as f:
    for l in f:
        metaData.append(json.loads(l.strip()))

with gzip.open(reviewPath) as f:
    for l in f:
        reviewData.append(json.loads(l.strip()))

# convert list into pandas dataframe

meta = pd.DataFrame.from_dict(metaData)
meta.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 203766 entries, 0 to 203765
Data columns (total 19 columns):
 #   Column           Non-Null Count   Dtype 
---  ------           --------------   ----- 
 0   category         203766 non-null  object
 1   tech1            203766 non-null  object
 2   description      203766 non-null  object
 3   fit              203766 non-null  object
 4   title            203766 non-null  object
 5   also_buy         203766 non-null  object
 6   tech2            203766 non-null  object
 7   brand            203766 non-null  object
 8   feature          203766 non-null  object
 9   rank             203766 non-null  object
 10  also_view        203766 non-null  object
 11  main_cat         203766 non-null  object
 12  similar_item     203766 non-null  object
 13  date             203766 non-null  object
 14  price            203766 non-null  object
 15  asin             203766 non-null  object
 16  imageURL         203766 non-null  object
 17  imageURLHighRes  203766 non-null  object
 18  details          195392 non-null  object
dtypes: object(19)
memory usage: 29.5+ MB

# convert list into pandas dataframe

review = pd.DataFrame.from_dict(reviewData)
review.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8765568 entries, 0 to 8765567
Data columns (total 12 columns):
 #   Column          Dtype  
---  ------          -----  
 0   overall         float64
 1   verified        bool   
 2   reviewTime      object 
 3   reviewerID      object 
 4   asin            object 
 5   style           object 
 6   reviewerName    object 
 7   reviewText      object 
 8   summary         object 
 9   unixReviewTime  int64  
 10  vote            object 
 11  image           object 
dtypes: bool(1), float64(1), int64(1), object(9)
memory usage: 744.0+ MB

meta.head(2)

review.head(2)

Extract only the necessary data¶

Among the data, the columns necessary for recommendation system are those which may contribute predictive information to how a given user might rate an item.

meta data
- category
- description
- title
- brand
- asin
- and the others
review data
- overall
- reviewTime
- reviewerID
- asis

# # make a small size dataframe of meta
# dfm = meta[[
#     'category',
#     'description',
#     'title',
#     'brand',
#     'asin',
# ]]


# For meta data, we use raw data
dfm = meta

# make a small size dataframe of review
dfr = review[[
    'overall',
    'reviewTime',
    'reviewerID',
    'asin',
]]

import gc

del meta
del review
gc.collect()

63

# convert dataframe values to string
dfm = dfm.astype(str)
dfr = dfr.astype(str)

# save the data to a new csv file
datapath = '../data'
save_file(dfm, 'step1_dfm.csv', datapath)
save_file(dfr, 'step1_dfr.csv', datapath)

A file already exists with this name.

Do you want to overwrite? (Y/N)y
Writing file.  "../data/step1_dfm.csv"
A file already exists with this name.

Do you want to overwrite? (Y/N)y
Writing file.  "../data/step1_dfr.csv"

Summary¶

The key processes and findings from this notebook are as follows.
- Original rows in the data: 203766 rows in meta data, 8765568 rows in review data.
- Rows deleted: 0 rows
The remaining columnns
- 5 columns for meta data, 4 columns for review data.
The review data will be used for analysis, and the meta data is necessary to look up for the details of a product using with asin which is the unique ID of a product.
The summary and vote have been dropped but they are potentially good features.

	overall	verified	reviewTime	reviewerID	asin	style	reviewerName	reviewText	summary	unixReviewTime	vote	image
0	5.0000	True	03 11, 2013	A3478QRKQDOPQ2	0001527665	{'Format:': ' VHS Tape'}	jacki	really happy they got evangelised .. spoiler a...	great	1362960000	NaN	NaN
1	5.0000	True	02 18, 2013	A2VHSG6TZHU1OB	0001527665	{'Format:': ' Amazon Video'}	Ken P	Having lived in West New Guinea (Papua) during...	Realistic and Accurate	1361145600	3	NaN

	category	tech1	description	fit	title	also_buy	tech2	brand	feature	rank	also_view	main_cat	similar_item	date	price	asin	imageURL	imageURLHighRes	details
0	[Movies & TV, Movies]		[]		Understanding Seizures and Epilepsy	[]			[]	886,503 in Movies & TV (	[]	Movies & TV				0000695009	[]	[]	NaN
1	[Movies & TV, Movies]		[]		Spirit Led—Moving By Grace In The Holy S...	[]			[]	342,688 in Movies & TV (	[]	Movies & TV				0000791156	[https://images-na.ssl-images-amazon.com/image...	[https://images-na.ssl-images-amazon.com/image...	NaN