Skip to content

Academic data science project on the IMDB 5000 movie dataset featuring EDA, visualization, and ML (Regression & Classification) using R language.

License

Notifications You must be signed in to change notification settings

LT-Ripjaws/imdb-movie-data-science-project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

7 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐ŸŽฌ IMDB Movie Analysis & Prediction

Contributors Technologies Status

Intro

๐Ÿ“– Overview

My first academic data analysis and machine learning project using the IMDB 5000 Movies dataset to predict movie ratings and analyze factors influencing box office success using the R language. Also features Visualizations, EDA, Data-preprocessing.

This project performs:

  • ๐Ÿงน Data Cleaning & Preprocessing: Handling missing values, removing duplicates, and performing feature engineering.
  • ๐Ÿ“Š Exploratory Data Analysis (EDA): Identifying patterns and insights in movie data through 15+ visualizations.
  • ๐Ÿค– Machine Learning: Predicting IMDb scores using Linear Regression and Random Forest models, Classify movies as profitable or not profitable using classification models (logistic, random forest).
  • ๐Ÿ“ˆ Model Evaluation: Assessing performance with metrics like RMSE, MAE, and Rยฒ.

๐ŸŽฏ Main Objectives

  • Primary Task: Predict IMDb movie ratings (imdb_score) based on features such as budget, genre, director, duration, and social metrics.
  • Secondary Task: Classify movies as profitable or not profitable โ€” predicting box office success before production.

๐Ÿง  Analysis Goals

  • Which factors most influence movie ratings?
  • Can we predict box office success accurately?
  • How have movie trends evolved over time?
  • Which directors consistently produce high-quality films?
  • Whatโ€™s the probability a movie will be profitable?

๐Ÿ“ Project Structure

imdb-movie-data-science-project/
โ”œโ”€โ”€ README.md                    # Project documentation (you are here!)
โ”œโ”€โ”€ Required_dependencies.R      # Setup script
โ”œโ”€โ”€ main.R                       # Master script to run everything
โ”œโ”€โ”€ .gitignore                   # gitignore file
โ”œโ”€โ”€ EXAMPLE_USAGE.R              # Example usage for this project, so it contains functions and things we can use to predict or analyze.
โ”œโ”€โ”€ LICENSE                     
โ”‚
โ”œโ”€โ”€ movie-metadata/
โ”‚   โ”œโ”€โ”€ movie-metadata.csv       # The movie data-set that we download if needed.
โ”‚  
โ”œโ”€โ”€ data/
โ”‚   โ”œโ”€โ”€ raw/                     # Original dataset (auto-downloaded)
โ”‚   โ””โ”€โ”€ processed/               # Cleaned datasets with timestamps
โ”‚
โ”œโ”€โ”€ R/
โ”‚   โ”œโ”€โ”€ 01_data_loading.R        # Load data with caching
โ”‚   โ”œโ”€โ”€ 02_data_cleaning.R       # Complete cleaning pipeline
โ”‚   โ”œโ”€โ”€ 03_eda.R                 # 15+ visualizations & insights
โ”‚   โ”œโ”€โ”€ 04_modeling.R            # ML Regression models (LR + RF)
โ”‚   โ””โ”€โ”€ 05_classification.R      # ML Classification
โ”‚
โ”œโ”€โ”€ utils/
โ”‚   โ””โ”€โ”€ helper_functions.R       # 12 reusable utility functions
โ”‚
โ””โ”€โ”€ output/
    โ”œโ”€โ”€ figures/                 # All visualizations (auto-generated)
    โ””โ”€โ”€ models/                  # Saved ML models (.rds files)

๐Ÿš€ Quick Start

# 1. Clone or download this repository
# 2. Open R or RStudio and set the working directory to the project root (in main.r)

# 3. Run setup (installs required packages and creates folders)
source("Required_dependencies.R")

# 4. Run the complete analysis pipeline
source("main.R")

# 5. Make predictions
source("EXAMPLES.R")

# Regression: Predict IMDb score
predict_movie_score(100000000, 140, "Action")

# Classification: Predict profitability
predict_profitability(100000000, 140, "Action", 7.5)

# Investment decision
investment_decision(100000000, 140, "Action", 7.5)

๐Ÿ“ฆ Requirements

R Packages

# Core data manipulation
- dplyr          # Data wrangling
- janitor        # Data cleaning

# Visualization
- ggplot2        # Plotting
- reshape2       # Data reshaping

# Machine Learning
- caret          # ML framework
- randomForest   # Random Forest algorithm

๐Ÿ“Š Dataset

  • Source: IMDB 5000 Movies Dataset
  • Original Size: ~5,000 movies
  • After Cleaning: ~3,800 movies
  • Time Period: Various years (focus on recent decades)

Taken from kaggle: Imdb 5000 movie data-set

๐Ÿ”‘ Key Variables

Variable Description Type
imdb_score IMDb rating (1โ€“10) ๐ŸŽฏ Target
budget Production budget (USD) ๐Ÿ”ข Numeric
gross Box office revenue (USD) ๐Ÿ”ข Numeric
duration Movie length (minutes) โฑ๏ธ Numeric
primary_genre Main genre ๐Ÿท๏ธ Categorical
director_name Director name ๐ŸŽฌ Categorical
num_voted_users Number of IMDb votes ๐Ÿ”ข Numeric

๐Ÿง  Engineered Features

Feature Description
profit gross - budget
roi profit / budget (Return on Investment)
budget_category Categorized as Low / Medium / High
is_profitable Binary success indicator (1 = profitable)
rating_category Grouped as Excellent / Good / Average / Poor

๐Ÿ”ง Data Processing Pipeline

Data Loading (01_data_loading.R)

โœ… Download from github
โœ… Local caching (avoid re-downloading)
โœ… Column name cleaning
โœ… Initial quality assessment


Data Cleaning (02_data_cleaning.R)

โœ… Remove duplicate entries (~200 rows)
โœ… Filter critical missing values (budget, gross, imdb_score)
โœ… Clean text columns (trim whitespace)
โœ… Feature engineering (profit, ROI, categories)
โœ… Outlier detection and flagging
โœ… Select 18 relevant columns


Exploratory Data Analysis (03_eda.R)

๐Ÿ“Š 15+ Visualizations Generated:

  • Distribution plots (IMDb scores, budget, ROI)
  • Scatter plots (budget vs gross, score vs profit)
  • Box plots by genre and content rating
  • Time trends (scores, profit, ROI over years)
  • Correlation heatmap
  • Top 10 rankings and leaderboards

Modeling (05_modeling.R + 06_classification.R)

โš™๏ธ Functions (20+)

Function Description
prepare_modeling_data() Feature selection
build_linear_model() Linear Regression
build_random_forest() Random Forest (500 trees)
build_logistic_model() Logistic Regression
build_rf_classifier() Random Forest Classifier
evaluate_model() Regression metrics
evaluate_classifier() Classification metrics
save_model_with_metadata() Save model + metadata
run_modeling_pipeline() Complete regression workflow
run_classification_pipeline() Complete classification workflow

๐Ÿงฉ Models Implemented

๐ŸŽฌ Linear Regression (IMDb Score)

  • Baseline regression model
  • Interpretable coefficients
  • Fast training

๐ŸŒฒ Random Forest Regression (IMDb Score)

  • 500 trees
  • Feature importance ranking
  • Better performance

๐Ÿ’ฐ Logistic Regression (Profitability)

  • Binary classification
  • Probability-based predictions
  • Ideal for investment decision support

๐Ÿง  Random Forest Classifier (Profitability)

  • Ensemble classification approach
  • ROC curve and AUC analysis
  • Confusion matrix evaluation

๐Ÿ“ˆ Evaluation Metrics

๐Ÿ”น Regression

  • RMSE (Root Mean Square Error)
  • MAE (Mean Absolute Error)
  • Rยฒ (Coefficient of Determination)

๐Ÿ”ธ Classification

  • Accuracy, Precision, Recall
  • F1-Score, Specificity
  • AUC (Area Under ROC Curve)
  • Confusion Matrix

๐Ÿ—‚๏ธ Model Outputs

  • Actual vs Predicted plots
  • Residual plots (regression)
  • Confusion matrices (classification)
  • ROC curves (classification)
  • Feature importance plots
  • Model comparison tables
  • Saved models with metadata (.rds)

๐Ÿ“ˆ KEY INSIGHTS

๐Ÿ“Š Model Performance Summary

๐ŸŽฌ Regression Models (IMDb Score Prediction)

Model RMSE MAE Rยฒ Status
Random Forest 0.65 0.49 NA โœ… Best
Linear Regression 0.77 0.58 NA โœ… Baseline

๐Ÿ“– Interpretation:
Predicts IMDb scores within ยฑ0.65 points on a 1โ€“10 scale.

๐Ÿ’ฐ Classification Models (Profitability Prediction)

Model Accuracy Precision Recall F1-Score Status
Random Forest 82% 0.84 0.86 0.85 โœ… Best
Logistic Regression 75% 0.78 0.82 0.80 โœ… Baseline

๐Ÿ“– Interpretation:
Correctly predicts movie profitability 82% of the time.

๐Ÿ’ฐ Financial Insights

  • Budget Sweet Spot: Medium-budget films ($10Mโ€“$50M) achieve 40% higher ROI than blockbusters
  • Profitability Rate: ~70% of movies are profitable overall
  • Quality Matters: Movies with IMDb > 7.0 have 85% profitability rate
  • High Budget โ‰  Profit: Correlation between budget and profit is -0.95
  • There are more comedy movies, but horror movies tend to do better for profit even if the imdb score for them is lower.

๐ŸŽฌ Quality Insights

  • Genre Impact: Drama/Biography rate 0.5 points higher than Horror
  • Director Effect: Top directors maintain 7.0+ average vs 6.5 overall
  • Score Trend: IMDb scores slightly declining over past decade
  • Duration: Optimal movie length is 90โ€“150 minutes

๐Ÿ”ฎ Predictive Insights

  • Top Predictor (Regression): Budget (financial scale)
  • Top Predictor (Classification): Number of voted users (engagement)

FIGURE EXAMPLE:

Intro

๐Ÿ› Troubleshooting

Common Issues

  • โŒ Issue: "Cannot open file"
    โœ… Solution: Check working directory with getwd(), use setwd() to set the correct path

  • โŒ Issue: "Package not found"
    โœ… Solution: Run source("Required_dependencies.R") to install all dependencies

  • โŒ Issue: "Download failed"
    โœ… Solution: Check your internet connection or download the dataset manually from the URL in main.R

  • โŒ Issue: "Not enough memory"
    โœ… Solution: Reduce dataset size or use data.table for large datasets

๐Ÿ“„ License

This project is open source and available under the MIT License.

๐Ÿ‘ค Author

Chinmoy Guha

๐Ÿ“Š Project Status

๐Ÿšง Completed development โ€“ Version 1.0.0

โญ If you found this project helpful, please consider giving it a star! and please excuse any mistakes! :)

Profile Banner

About

Academic data science project on the IMDB 5000 movie dataset featuring EDA, visualization, and ML (Regression & Classification) using R language.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages