This repository contains code and solutions for the MLOps Zoomcamp course homeworks, covering the full ML lifecycle: data processing, model training, hyperparameter optimization, experiment tracking, and model registration using MLflow.
mlops_zoomcamps/
├── artifacts/ # MLflow artifacts (models, plots, etc.)
├── data/
│ ├── homework_1/ # Raw data for Homework 1
│ ├── homework_2/ # Raw data for Homework 2
│ └── homework_3/ # Raw data for Homework 3
├── mlruns/ # MLflow tracking directory
├── output/ # Processed data and outputs
├── homework_1.ipynb # Homework 1 notebook
├── homework_2.ipynb # Homework 2 notebook
├── preprocess_data.py # Data preprocessing script
├── train.py # Model training script
├── hpo.py # Hyperparameter optimization script
├── register_model.py # Model registration script
├── mlflow.db # SQLite DB for MLflow tracking
├── requirements.txt # Project dependencies
├── README.md # This file
├── homework_4/ # Homework 4: Dockerized prediction
│ ├── predict.py # Prediction script (CLI)
│ ├── Dockerfile # Dockerfile for building the container
│ └── model.bin # Pre-trained model and vectorizer (provided in base image)
-
Clone the repository:
git clone <your-repo-url> cd mlops_zoomcamps
-
Install dependencies:
pip install -r requirements.txt
-
Goal: Predict taxi trip durations using linear regression.
-
Steps:
- Load and clean the NYC Yellow Taxi trip data.
- Feature engineering and outlier removal.
- Train a linear regression model.
- Evaluate model performance using RMSE.
-
How to run:
- Open and run
homework_1.ipynbin Jupyter Notebook. - Follow the notebook cells for data processing, feature engineering, model training, and evaluation.
- Open and run
- Goal: Manage the full ML lifecycle for a RandomForestRegressor using MLflow, including hyperparameter optimization and model registration.
python preprocess_data.py --raw_data_path ./data/homework_2 --dest_path ./outputmlflow server \
--backend-store-uri sqlite:///mlflow.db \
--default-artifact-root ./artifacts \
--host 127.0.0.1 \
--port 5001python train.py --data_path ./outputpython hpo.py- This script uses Hyperopt to search for the best RandomForestRegressor hyperparameters.
- Each run logs the hyperparameters and validation RMSE to MLflow.
python register_model.py --data_path ./output --top_n 5- This script selects the top 5 models from hyperopt, evaluates them on the test set, and registers the best one in the MLflow Model Registry.
Open the MLflow UI in your browser:
http://localhost:5001
- View experiments, runs, and the model registry.
- Goal: Containerize the ML application and deploy MLflow tracking server using Docker.
mlops/
├── mlflow.dockerfile # Dockerfile for MLflow server
├── docker-compose.yml # Docker Compose configuration
├── mlflow_data/ # MLflow data storage
└── scripts/ # Database initialization scripts
# Create necessary directories
mkdir -p mlflow_data
chmod 777 mlflow_data
# Start all services
docker-compose up --build- MLflow UI: http://localhost:5001
- Mage UI: http://localhost:6789
- Containerized MLflow tracking server
- Persistent storage for MLflow data
- Integration with Mage platform
- PostgreSQL database with pgvector support
- Automatic service restart on failure
- Goal: Package the prediction script in a Docker container using a pre-built base image with the model and vectorizer.
homework_4/
├── predict.py # CLI script for making predictions
├── Dockerfile # Dockerfile for building the container
Make sure you are in the homework_4 directory:
cd homework_4
docker build -t taxi-prediction .To predict the mean duration for a specific year and month (e.g., May 2023):
docker run -it taxi-prediction --year 2023 --month 5- The script will output the mean predicted duration for the specified month.
- The model and vectorizer are already included in the base image (
agrigorev/zoomcamp-model:mlops-2024-3.10.13-slim).
-
Goal: Monitor data quality and model metrics for NYC Taxi trip duration prediction using Evidently, PostgreSQL, and Grafana.
-
Steps:
- Download and prepare the March 2024 Green Taxi dataset.
- Expand data quality metrics using Evidently (e.g., add quantile and custom metrics).
- Store metrics in a PostgreSQL database for batch monitoring.
- Visualize metrics and data quality in Grafana dashboards.
- Save and manage dashboard configurations for reproducibility.
-
How to run:
- Open and run
homework_5.ipynbin Jupyter Notebook. - Follow the notebook cells to process data, compute metrics, and interact with dashboards.
- Open and run
-
Main tools:
- Evidently (for data quality and drift metrics)
- PostgreSQL (for metrics storage)
- Grafana (for dashboard visualization)
Goal: Refactor the batch inference code, add unit and integration tests, and mock S3 using Localstack for robust ML batch pipelines.
- Move all logic into a
main(year, month)function. - Extract a
prepare_data(df, categorical)function for data preprocessing. - Remove all global variables; pass parameters explicitly.
- Add
get_input_pathandget_output_pathfunctions to support environment-based input/output paths for both testing and production.
- Create a
homework_6/tests/directory with atest_batch.pyfile. - Write a test for
prepare_datausing a sample dataframe, checking that only valid rows are kept. - Run the test:
cd homework_6 pytest tests/
- Create a
docker-compose.yamlinhomework_6:services: localstack: image: localstack/localstack:latest ports: - "4566:4566" environment: - SERVICES=s3 - DEBUG=1 - DATA_DIR=/tmp/localstack/data volumes: - "/var/run/docker.sock:/var/run/docker.sock"
- Start Localstack:
cd homework_6 docker-compose up -d - Create a mock S3 bucket:
export AWS_ACCESS_KEY_ID=test export AWS_SECRET_ACCESS_KEY=test export AWS_DEFAULT_REGION=us-east-1 aws s3 mb s3://nyc-duration --endpoint-url http://localhost:4566
- Use the
S3_ENDPOINT_URLenvironment variable andstorage_optionswhen reading/writing Parquet files with pandas. - Add a
save_data(df, filename)function to handle saving to S3 or local files.
- Create
integration_test.py:- Generate a test dataframe (same as the unit test).
- Save it to the mock S3 bucket (Localstack).
- Set environment variables for
batch.py. - Call
batch.pyto process the batch. - Read the output file from S3 and compute the sum of predicted durations.
- Install the required package:
pip install s3fs
- Run the integration test:
python integration_test.py
- Expected output:
Sum of predicted durations: 36.28
- Refactor code, write unit tests, configure Localstack, and test the batch pipeline with mock S3 data.
- Ensure all steps are automated and easily repeatable.
Quick Start for Homework 6:
# Install required packages
pip install -r requirements.txt
pip install s3fs
# Start Localstack
cd homework_6
export AWS_ACCESS_KEY_ID=test
export AWS_SECRET_ACCESS_KEY=test
export AWS_DEFAULT_REGION=us-east-1
docker-compose up -d
aws s3 mb s3://nyc-duration --endpoint-url http://localhost:4566
# Run the integration test
python integration_test.pyExpected final result:
Sum of predicted durations: 36.28
All dependencies are listed in requirements.txt