PROBLEM STATEMENT

Netflix provided a lot of anonymous rating data, and a prediction accuracy bar that is 10% better than what Cinematch can do on the same training data set. (Accuracy is a measurement of how closely predicted ratings of movies match subsequent actual ratings.)

ALGORITHMS

XGBoost, Matrix Factorization, Support Vector, KNN Base Line Models.

OBJECTIVES

Predict the rating that a user would give to a movie that he ahs not yet rated.
Minimize the difference between predicted and actual rating (RMSE and MAPE).

DATA FORMAT

Each line in the file corresponds to a rating from a customer and its date in the following format:

CustomerID,Rating,Date
MovieIDs range from 1 to 17770 sequentially. CustomerIDs range from 1 to 2649429, with gaps. There are 480189 users. Ratings are on a five star (integral) scale from 1 to 5. Dates have the format YYYY-MM-DD.

TYPE OF MACHINE LEARNING PROBLEM

For a given movie and user we need to predict the rating would be given by him/her to the movie. The given problem is a Recommendation problem It can also be converted as a Regression problem.

PERFORMANCE METRIC

Mean Absolute Percentage Error
Root Mean Square Error

MACHINE LEARNING OBJECTIVE and CONSTRAINTS

Minimize RMSE.
Try to provide some interpretability.

EXPLORATORY DATA ANALYSIS

Preprocessing

Spliting data into Train and Test(80:20)
Exploratory Data Analysis on Train data
Distribution of ratings
Number of Ratings per a month
Analysis on the Ratings given by user
Analysis of ratings of a movie given by a user
number of ratings on each day of the week
Creating sparse matrix from data frame
Finding Global average of all movie ratings, Average rating per user, and Average rating per movie
Computing Similarity matrices
Computing User-User Similarity matrix

Training the model

Trying with all dimensions (17k dimensions per user)
Trying with reduced dimensions (Using TruncatedSVD for dimensionality reduction of user vector)
Computing Movie-Movie Similarity matrix
Finding most similar movies using similarity matrix

Machine Learning Steps

Sampling Data
Build sample train data from the train data
Build sample test data from the test data
Finding Global Average of all movie ratings, Average rating per User, and Average rating per Movie (from sampled train)
Featurizing data
Featurizing data for regression problem

Featurizing train and test data

GAvg : Average rating of all the ratings
Similar users rating of this movie: sur1, sur2, sur3, sur4, sur5 ( top 5 similar users who rated that movie.. )
Similar movies rated by this user: smr1, smr2, smr3, smr4, smr5 ( top 5 similar movies rated by this movie.. )
UAvg : User’s Average rating
MAvg : Average rating of this movie
rating : Rating of this movie by this user.

This problem can be solved using following machine learning techniques.

Recommendation Model
Regression Model

`RECOMMENDATION MODEL`

Transforming data for Surprise models:

Transforming train data:

We can’t give raw data (movie, user, rating) to train the model in Surprise library.
They have a saperate format for TRAIN and TEST data, which will be useful for training the models like SVD, KNNBaseLineOnly….etc..,in Surprise.
Form the trainset from a file, or from a Pandas DataFrame.

Transforming test data:

Testset is just a list of (user, movie, rating) tuples. (Order in the tuple is impotant)

Applying Machine Learning model.

Global dictionary that stores rmse and mape for all the models….

It stores the metrics in a dictionary of dictionaries
```
  keys : model names(string)

  value: dict(key : metric, value : value )
```

XGBoost with initial 13 features

TEST DATA

RMSE : 1.0890322448240302

MAPE : 35.13968692492444

Suprise BaselineModel

TEST DATA

RMSE : 1.0865215481719563

MAPE : 34.9957270093008

XGBoost with initial 13 features + Surprise Baseline predictor

TEST DATA

RMSE : 1.0891181427027241

MAPE : 35.13135164276489

Surprise KNNBaseline with user user similarities

Test Data

RMSE : 1.0865005562678032

MAPE : 35.02325234274119

Surprise KNNBaseline with movie movie similarities

Test Data

RMSE : 1.0868914468761874

MAPE : 35.02725521759712

XGBoost with initial 13 features + Surprise Baseline predictor + KNNBaseline predictor

First we will run XGBoost with predictions from both KNN’s ( that uses User_User and Item_Item similarities along with our previous features.
Then we will run XGBoost with just predictions form both knn models and preditions from our baseline model.

TEST DATA

RMSE : 1.088749005744821

MAPE : 35.188974153659295

`REGRESSION MODEL`

Matrix Factorization Techniques

SVD Matrix Factorization User Movie intractions

Test Data

RMSE : 1.0860031195730506

MAPE : 34.94819349312387

SVD Matrix Factorization with implicit feedback from user ( user rated movies )

Test Data

RMSE : 1.0862780572420558

MAPE : 34.909882014758175

XgBoost with 13 features + Surprise Baseline + Surprise KNNbaseline + MF Techniques

TEST DATA

RMSE : 1.0891599523508655

MAPE : 35.12646240961147

XgBoost with Surprise Baseline + Surprise KNNbaseline + MF Techniques

TEST DATA

RMSE : 1.095123189648495

MAPE : 35.54329712868095

Model Performence

Netflix-Movie-Recommendation-System

PROBLEM STATEMENT

ALGORITHMS

OBJECTIVES

DATA FORMAT

TYPE OF MACHINE LEARNING PROBLEM

PERFORMANCE METRIC

MACHINE LEARNING OBJECTIVE and CONSTRAINTS

EXPLORATORY DATA ANALYSIS

Preprocessing

Training the model

Machine Learning Steps

Featurizing train and test data

This problem can be solved using following machine learning techniques.

RECOMMENDATION MODEL

Transforming data for Surprise models:

Transforming train data:

Transforming test data:

Applying Machine Learning model.

XGBoost with initial 13 features

TEST DATA

Suprise BaselineModel

TEST DATA

XGBoost with initial 13 features + Surprise Baseline predictor

TEST DATA

Surprise KNNBaseline with user user similarities

Test Data

Surprise KNNBaseline with movie movie similarities

Test Data

XGBoost with initial 13 features + Surprise Baseline predictor + KNNBaseline predictor

TEST DATA

REGRESSION MODEL

Matrix Factorization Techniques

SVD Matrix Factorization User Movie intractions

Test Data

SVD Matrix Factorization with implicit feedback from user ( user rated movies )

Test Data

XgBoost with 13 features + Surprise Baseline + Surprise KNNbaseline + MF Techniques

TEST DATA

XgBoost with Surprise Baseline + Surprise KNNbaseline + MF Techniques

TEST DATA

`RECOMMENDATION MODEL`

`REGRESSION MODEL`