dataviper is a SQL-based tool to get the basic data preparation done in easy way, with doing
- Create "Data Profile" report of a table
- One-hot encode for "Categorical Columns" and create a "one-hot" table
- Check "Joinability" between 2 tables
- // TODO: and more
pip install dataviperYour main.py will look like this
from dataviper import Client
from dataviper.source import MySQL
client = Client(source=MySQL({
'host': 'localhost',
'user': 'root',
'password': 'xxxxxx',
'database': 'your_database'
}))
with client.connect():
profile = client.profile('Your_Table')
profile.to_excel()python3 main.py
# Then you will get 'profile_Your_Table.xlsx' 🤗It's known that "Data Profiling" needs to be done with scanning all the rows in a table. If you try to do this naively by pandas or any libraries which internally use pandas, it's not avoidable to use bunch of memory of your local machine and freeze your work.
dataviper is a SQL-based Data Profiling tool, which simply and dynamically generates SQLs and lets the database machine do the annoying calculation.
With dataviper, you don't have to have massive local computer. All you need are a stable network and reachable SQL db.
You can choose your data source from
- SQL Server
-
profile -
pivot -
joinability -
histogram
-
- MySQL
-
profile -
pivot -
joinability -
histogram
-
- PostgreSQL
- CSV
-
profile -
pivot -
joinability -
histogram
-
- Excel
Create "Data Profile" excel file of a specified table.
When you have Sales table like this
| id | region | sales_type | price | rep_id |
|---|---|---|---|---|
| 1 | jp | phone | 240 | 115723 |
| 2 | us | web | 90 | 125901 |
| 3 | jp | web | 560 | 8003 |
| 4 | us | shop | 920 | 182234 |
| 5 | jp | NULL | 90 | 92231 |
| 6 | us | shop | 180 | 100425 |
| 7 | us | shop | 70 | 52934 |
do
with client.connect() as conn:
table_name = 'Sales'
profile = client.profile(table_name, example_count=3)
profile.to_excel()then you will get profile_Sales.xlsx file with
| column_name | data_type | null_count | null_% | unique_count | unique_% | min | max | avg | std | example_top_3 | example_last_3 |
|---|---|---|---|---|---|---|---|---|---|---|---|
| id | bigint | 0 | 0 | 7 | 100.00 | 1 | 7 | 4.0 | 2.0 | [1,2,3] | [5,6,7] |
| region | varchar | 0 | 0 | 2 | 28.57 | [jp,us,jp] | [jp,us,us] | ||||
| sales_type | varchar | 1 | 14.28 | 3 | 42.85 | [phone,web,web] | [None,shop,shop] | ||||
| price | int | 0 | 0 | 6 | 85.71 | 70 | 920 | 307.1428 | 295.379 | [240,90,560] | [90,180,70] |
| rep_id | int | 0 | 0 | 7 | 100.00 | 8003 | 182234 | 96778.7142 | 51195.79065 | [115723,125901,8003] | [92231,100425,52934] |
Spread categorical columns to N binary columns.
When you have Sales table like above, do
with client.connect() as conn:
table_name = 'Sales'
key = 'id'
categorical_columns = ['region', 'sales_type']
profile = client.get_schema(table_name)
client.pivot(profile, key, categorical_columns)then you will get Sales_PIVOT_YYYYmmddHHMM table with
| id | region_jp | region_us | sales_type_phone | sales_type_web | sales_type_shop |
|---|---|---|---|---|---|
| 1 | 1 | 0 | 1 | 0 | 0 |
| 2 | 0 | 1 | 0 | 1 | 0 |
| 3 | 1 | 0 | 0 | 1 | 0 |
| 4 | 0 | 1 | 0 | 0 | 1 |
| 5 | 1 | 0 | 0 | 0 | 0 |
| 6 | 0 | 1 | 0 | 0 | 1 |
| 7 | 0 | 1 | 0 | 0 | 1 |
Count how much 2 tables can be joined.
When you have Sales table like above, and Reps table like this
| id | name | tenure |
|---|---|---|
| 8003 | Hiromu | 9 |
| 8972 | Ochiai | 6 |
| 52934 | Taro | 1 |
| 92231 | otiai10 | 2 |
| 100425 | Hanako | 7 |
| 125901 | Chika | 3 |
| 182234 | Mary | 5 |
| 199621 | Jack | 1 |
do
with client.connect() as conn:
report = client.joinability(on={'Sales': 'rep_id', 'Reps': 'id'})
report.to_excel()then you will get join_Sales_Reps.xlsx file with
| table | key | total | match | match_% | drop | drop_% |
|---|---|---|---|---|---|---|
| Sales | [rep_id] | 7 | 6 | 85.714 | 1 | 14.285 |
| Reps | [id] | 8 | 6 | 75.00 | 2 | 25.00 |
