This project is a robust data scrapper that collects comprehensive information about movies from The Movie Database (TMDB) API. The dataset spans from the earliest recorded films (1874) up to recent releases (2024), providing a rich historical record of cinema evolution, production details, popularity metrics, and financial data.
The collected data can be used for various analytical and research purposes:
- Film history and evolution analysis
- Box office performance and ROI studies
- Genre popularity trends over time
- Production company dominance analysis
- Country-specific film industry research
- Movie rating and reception analysis
- Certification and content rating studies
- Data science projects and machine learning models
- Visualization projects examining film industry trends
- Academic research on cinema economics and cultural impact
The dataset (MovieData-Raw.csv) contains extensive information about each movie, including:
-
Movie Identifiers
- TMDB ID
- Title
- Original Title
-
Release Information
- Release Date
-
Financial Data
- Budget
- Revenue
-
Popularity Metrics
- Popularity Score
- Runtime (minutes)
- Vote Average (rating)
- Vote Count (number of votes)
-
Content Attributes
- Adult Content Flag
- Status (Released, Announced, etc.)
- US Certification (PG, PG-13, R, etc.)
-
Categorization
- Genres (Action, Comedy, Drama, etc.)
-
Production Details
- Production Companies
- Production Countries
The scrapper is built in Python and uses the following libraries:
tmdbsimple: Core API wrapper for The Movie Databasepandas: Data handling and CSV file operationsnumpy: Numerical processingtime: Managing API rate limits and retries
The implementation follows these key processes:
- API Authentication: Uses a TMDB API key to access the service
- Systematic Data Collection: Iterates through years (1873-2024) to discover movies
- Pagination Handling: Processes multiple pages of results per year
- Detailed Extraction: Retrieves comprehensive movie data beyond basic discovery data
- Data Storage: Incrementally saves data to CSV to prevent data loss
- Error Handling: Implements retry mechanisms for API failures
The data collection process follows this workflow:
- Set up the DataFrame structure with appropriate columns
- For each year from 1873 to 2024:
- Query the TMDB Discover API to find movies released that year
- Calculate the total number of pages to process
- For each page of results:
- Extract each movie's ID
- Use the ID to make a detailed API request for complete movie information
- Extract the US certification if available
- Append the comprehensive movie data to the CSV file
The dataset is stored in CSV format with the following columns:
| Column | Description |
|---|---|
| id | TMDB unique identifier |
| title | Movie title (localized) |
| original_title | Original movie title |
| release_date | Official release date (YYYY-MM-DD) |
| budget | Production budget in USD |
| revenue | Box office revenue in USD |
| popularity | TMDB popularity score |
| runtime | Movie duration in minutes |
| vote_average | Average rating (0-10) |
| vote_count | Number of votes received |
| adult | Whether the movie is flagged as adult content |
| status | Release status (Released, In Production, etc.) |
| certification_US | US content rating (G, PG, PG-13, R, etc.) |
| genres | List of genre classifications |
| production_companies | Companies involved in production |
| production_countries | Countries where produced |
The raw dataset requires several cleaning operations before analysis:
- Handling Missing Values: Many films have incomplete data (e.g., budget, revenue)
- JSON Parsing: Genre, production companies, and countries are stored as JSON strings
- Date Normalization: Converting release_date to proper datetime format
- Numeric Validation: Ensuring budget/revenue values are properly formatted
- Deduplication: Removing any duplicate entries
- Text Normalization: Handling special characters in titles
- Currency Standardization: Ensuring all financial values are in the same currency
- Outlier Detection: Identifying and potentially addressing statistical outliers
- Python 3.6 or higher
- pip (Python package manager)
- Clone this repository
git clone https://github.com/prashantkoirala465/TMDB-Movie-Data-Scrapper.git
cd TMDB-Movie-Data-Scrapper- Create and activate a virtual environment (optional but recommended)
python -m venv myenv
# On Windows
myenv\Scripts\activate
# On macOS/Linux
source myenv/bin/activate- Install required packages
pip install -r requirements.txtTo run the scrapper and collect movie data:
python scrapper.pyThis will start collecting data from TMDB API and save it to MovieData-Raw.csv. The process may take several hours depending on your internet connection and API rate limits.
- The TMDB API has rate limits (40 requests per 10 seconds)
- The scrapper includes a retry mechanism with a 30-second wait on failures
- If you want to use your own API key, replace it in the
scrapper.pyfile
Here are some potential analyses you could perform with this dataset:
-
Financial Performance Trends
- ROI analysis across decades
- Budget vs. Revenue correlation studies
- Identifying financially successful genres
-
Temporal Patterns
- Film production volume by year/decade
- Evolution of movie runtimes over time
- Seasonal release patterns and success correlations
-
Content Analysis
- Genre distribution and evolution
- Adult content prevalence over time
- Certification rating distribution by year
-
Geographic Insights
- Production country shifts over film history
- Regional genre preferences
- International co-production trends
-
Reception Analysis
- Vote count vs. vote average relationships
- Popularity metrics analysis
- Budget impact on audience ratings
- TMDB API Documentation
- Pandas Documentation
- Data Visualization with Matplotlib
- Data Analysis with Seaborn
This project is for educational and research purposes. All data is sourced from TMDB API and is subject to their terms of use.
- Data provided by The Movie Database (TMDB)
- Thanks to TMDB for their comprehensive API
- This project does not store or redistribute TMDB's content in a manner that would violate their terms of service
Note: The data collected requires cleaning and processing before analysis. Consider implementing data cleaning pipelines to prepare the dataset for analytical use.