Step 01_Introduction

Context

Filming market is great interest all the time. As time passes, the competition has become more fierce than ever, and the survival competition among the productions and film‐distributing agencies has become fierce. In this situation, improving the quality of the movie itself is not enough to maintain their current market status, and another factor they may have to invest their resources is digital market.

Criteria for success

The objectives in this project is to provide the recommendation of the products based on the existing customer experience.

Source of a data set

Source: Amazon review data (2018) http://deepyeti.ucsd.edu/jianmo/amazon/index.html

Major features and labels

  • asin: Amazon Standard Identification Number. This string uniquely identifies a product and can be treated as the product ID.
  • reviewerID: The unique ident for each customer

Import

In [1]:
import warnings
warnings.filterwarnings('ignore')
In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from IPython.display import clear_output

import os
from library.sb_utils import save_file

Configuration

In [3]:
pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 100)
pd.options.display.float_format = '{:.4f}'.format

Objective

The goal of this project is to create Hybrid Recommendation System

In this notebook, we wrangle data and make data ready for EDA, which will be the next step.

xxx

Load data

In [4]:
import os
import json
import gzip
import pandas as pd
from urllib.request import urlopen
In [5]:
# Load in the data
metaPath = '../data/meta_Movies_and_TV.json.gz'
reviewPath = '../data/Movies_and_TV.json.gz'

# http://deepyeti.ucsd.edu/jianmo/amazon/categoryFilesSmal/Movies_and_TV_5.json.gz
# http://deepyeti.ucsd.edu/jianmo/amazon/categoryFilesSmall/Movies_and_TV.csv


metaData = []
reviewData = []


with gzip.open(metaPath) as f:
    for l in f:
        metaData.append(json.loads(l.strip()))

with gzip.open(reviewPath) as f:
    for l in f:
        reviewData.append(json.loads(l.strip()))
In [6]:
# convert list into pandas dataframe

meta = pd.DataFrame.from_dict(metaData)
meta.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 203766 entries, 0 to 203765
Data columns (total 19 columns):
 #   Column           Non-Null Count   Dtype 
---  ------           --------------   ----- 
 0   category         203766 non-null  object
 1   tech1            203766 non-null  object
 2   description      203766 non-null  object
 3   fit              203766 non-null  object
 4   title            203766 non-null  object
 5   also_buy         203766 non-null  object
 6   tech2            203766 non-null  object
 7   brand            203766 non-null  object
 8   feature          203766 non-null  object
 9   rank             203766 non-null  object
 10  also_view        203766 non-null  object
 11  main_cat         203766 non-null  object
 12  similar_item     203766 non-null  object
 13  date             203766 non-null  object
 14  price            203766 non-null  object
 15  asin             203766 non-null  object
 16  imageURL         203766 non-null  object
 17  imageURLHighRes  203766 non-null  object
 18  details          195392 non-null  object
dtypes: object(19)
memory usage: 29.5+ MB
In [7]:
# convert list into pandas dataframe

review = pd.DataFrame.from_dict(reviewData)
review.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8765568 entries, 0 to 8765567
Data columns (total 12 columns):
 #   Column          Dtype  
---  ------          -----  
 0   overall         float64
 1   verified        bool   
 2   reviewTime      object 
 3   reviewerID      object 
 4   asin            object 
 5   style           object 
 6   reviewerName    object 
 7   reviewText      object 
 8   summary         object 
 9   unixReviewTime  int64  
 10  vote            object 
 11  image           object 
dtypes: bool(1), float64(1), int64(1), object(9)
memory usage: 744.0+ MB
In [8]:
meta.head(2)
Out[8]:
category tech1 description fit title also_buy tech2 brand feature rank also_view main_cat similar_item date price asin imageURL imageURLHighRes details
0 [Movies & TV, Movies] [] Understanding Seizures and Epilepsy [] [] 886,503 in Movies & TV ( [] Movies & TV 0000695009 [] [] NaN
1 [Movies & TV, Movies] [] Spirit Led&mdash;Moving By Grace In The Holy S... [] [] 342,688 in Movies & TV ( [] Movies & TV 0000791156 [https://images-na.ssl-images-amazon.com/image... [https://images-na.ssl-images-amazon.com/image... NaN
In [9]:
review.head(2)
Out[9]:
overall verified reviewTime reviewerID asin style reviewerName reviewText summary unixReviewTime vote image
0 5.0000 True 03 11, 2013 A3478QRKQDOPQ2 0001527665 {'Format:': ' VHS Tape'} jacki really happy they got evangelised .. spoiler a... great 1362960000 NaN NaN
1 5.0000 True 02 18, 2013 A2VHSG6TZHU1OB 0001527665 {'Format:': ' Amazon Video'} Ken P Having lived in West New Guinea (Papua) during... Realistic and Accurate 1361145600 3 NaN

Extract only the necessary data

Among the data, the columns necessary for recommendation system are those which may contribute predictive information to how a given user might rate an item.

  • meta data

    • category
    • description
    • title
    • brand
    • asin
    • and the others
  • review data

    • overall
    • reviewTime
    • reviewerID
    • asis
In [10]:
# # make a small size dataframe of meta
# dfm = meta[[
#     'category',
#     'description',
#     'title',
#     'brand',
#     'asin',
# ]]


# For meta data, we use raw data
dfm = meta
In [11]:
# make a small size dataframe of review
dfr = review[[
    'overall',
    'reviewTime',
    'reviewerID',
    'asin',
]]
In [12]:
import gc

del meta
del review
gc.collect()
Out[12]:
63
In [13]:
# convert dataframe values to string
dfm = dfm.astype(str)
dfr = dfr.astype(str)
In [14]:
# save the data to a new csv file
datapath = '../data'
save_file(dfm, 'step1_dfm.csv', datapath)
save_file(dfr, 'step1_dfr.csv', datapath)
A file already exists with this name.

Do you want to overwrite? (Y/N)y
Writing file.  "../data/step1_dfm.csv"
A file already exists with this name.

Do you want to overwrite? (Y/N)y
Writing file.  "../data/step1_dfr.csv"

Summary

  • The key processes and findings from this notebook are as follows.

    • Original rows in the data: 203766 rows in meta data, 8765568 rows in review data.
    • Rows deleted: 0 rows
  • The remaining columnns

    • 5 columns for meta data, 4 columns for review data.
  • The review data will be used for analysis, and the meta data is necessary to look up for the details of a product using with asin which is the unique ID of a product.
  • The summary and vote have been dropped but they are potentially good features.