This repository provides a set of scripts to train anomaly detection models on door sensor data and then run inference (predict anomalies) on new data. The workflow relies on Python 3.11 and several common data science libraries. All dependencies and environment setup are managed by Poetry via a pyproject.toml file.
.
├── Doors_example/ # Dataset
│ ├── csv_labeled/ # Folder for labeled CSV files
│ ├── csv_unlabeled/ # Folder for unlabeled CSV files
│ ├── label_json_mini/ # Dataet in JSON format
│ └── doors_anomaly_detector_config # Configuration file for system
├── my_utils.py # Common utility functions: CSV loading, resampling, wagon filtering, metrics plotting, etc.
├── my_anomaly_train.py # Main script to train anomaly detection models
├── my_anomaly_inference.py # Main script to run anomaly detection inference using trained models
├── trained_models/ # Directory where trained model files (.pkl) are saved
├── training_stats/ # Directory for saving training-related statistics (if needed)
├── result/ # Directory for saving inference results (.csv)
├── pyproject.toml # Poetry configuration for environment dependencies
└── README.md # This fileFollow the steps below to create and activate the environment using Poetry:
# 1. Install Poetry if not already installed:
pip install poetry
# 2. Install dependencies from pyproject.toml:
poetry install
# 3. Activate the environment shell:
poetry shellFrom within this Poetry environment, you can run all scripts without conflicts.
my_anomaly_train.py handles reading multiple CSV files, optionally resampling them, filtering for a specific wagon, and training one or more anomaly detection models. The script:
- Reads multiple CSV files (paths are currently hardcoded as examples).
- Filters columns for a specific wagon if needed.
- Resamples the data (e.g., every
10s). - Splits the data into train/validation.
- Trains models (e.g.,
IsolationForest,OneClassSVM), either one per sensor or a single multivariate model. - Saves trained model files (
.pkl) intrained_models/.
To run:
python my_anomaly_train.pyBy default, you'll see multiple .pkl files in trained_models/ (e.g., model_one_class_svm_ALL.pkl or model_isolation_forest_01_1_0040.pkl).
my_anomaly_inference.py loads those .pkl model files from trained_models/, reads new CSV test data, does the same optional resampling/filtering, and applies each model to predict anomalies. The script:
- Reads CSV files for test data (hardcoded in the script).
- Resamples/filters similarly to training.
- Loads each
.pklfromtrained_models/. - Produces a DataFrame with columns:
<sensor_name>anomaly_score(the negative ofdecision_function, higher => more anomalous)anomaly_probability(a basic sigmoid transform of the score)prediction("anomaly"or"normal")
- If the test data includes an
anomalycolumn, it will compute and optionally plot Precision/Recall/F1/etc. - Saves results to
result/as CSV.
To run:
python my_anomaly_inference.py- Trained model files (
.pkl) are saved intrained_models/. - Training stats (if any) go to
training_stats/. - Inference results (
.csv) appear inresult/.
- Dates/paths: Hardcoded in
my_anomaly_train.pyandmy_anomaly_inference.py. Change them to suit your data structure. - Rolling features: If
use_features=True, a range of rolling stats and correlations is computed in_add_features(). Tweak the rolling windows or remove that logic to reduce complexity. - Individual vs. multivariate: If
individual_model=True, you’ll get multiple.pklfiles (one per sensor). IfFalse, you get just one file per model type (ALLin the filename).