A selection of Python scripts for file processing, validation, and splitting.
- Python 3.x
- pandas
- openpyxl
- tqdm
- colorama
- tabulate
It is recommended to use a virtual environment to manage dependencies. Follow these steps to create and activate a virtual environment:
-
Create a virtual environment:
python3 -m venv venv
-
Activate the virtual environment:
-
On Linux and macOS:
source venv/bin/activate -
On Windows:
.\venv\Scripts\activate
-
-
Install the required packages:
pip install pandas openpyxl tqdm colorama tabulate
This script splits large CSV or Excel files into smaller files compatible with spreadsheet applications like Google Sheets, Excel, or LibreOffice.
Basic usage:
python file-splitter.py large_data_file.csvAdvanced usage:
python file-splitter.py large_data_file.xlsx --output-dir=split_files --target-app=google_sheets --output-format=csv --max-rows=30000 --max-size-mb=15input_file: Path to the large CSV or Excel file to split--output-dir: Directory to save output files (optional)--target-app: Target application (excel, google_sheets, or libreoffice)--output-format: Format for output files (csv or excel)--max-rows: Maximum number of rows per output file--max-size-mb: Maximum file size in MB per output file--chunk-size: Number of rows to process at a time (for memory efficiency)
This script validates CSV and Excel files before database ingestion by checking for various issues like missing values, duplicate primary keys, data type consistency, and more.
Basic usage:
python data-validator.py input_file.csv --pk idAdvanced usage:
python data-validator.py input_file.xlsx --pk "id,email" --required "name,email,phone" --config data-validator-config.jsoninput_file: Path to the CSV or Excel file to validate--pk: Column name(s) to use as primary key(s) for duplicate detection--required: Comma-separated list of columns that must not contain empty values--config: Path to JSON configuration file with validation rules
This script processes input CSV or Excel files by reading input files, validating column headers, renaming specified columns, transforming data, generating unique identifiers, and outputting the processed data to a new file.
Basic usage:
python csv-excel-processor.py input_file.csvAdvanced usage:
python csv-excel-processor.py input_file.csv --output-file=processed_file.csv --output-format=csv --chunk-size=5000input_file: Path to the input CSV or Excel file--output-file: Path to the output file (optional)--output-format: Format of the output file (csv or excel)--chunk-size: Number of rows to process at a time
This repository is licensed under the MIT License. See the LICENSE file for more information.