-
Notifications
You must be signed in to change notification settings - Fork 1
Add run script and config template #63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
mary-cleaton
wants to merge
29
commits into
develop
Choose a base branch
from
feat/run-scripts
base: develop
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Changes from 17 commits
Commits
Show all changes
29 commits
Select commit
Hold shift + click to select a range
26fab28
feat: add config file template
mary-cleaton c5db1d4
feat: add run script
mary-cleaton de35f65
docs: update configs template
mary-cleaton 7e7946b
fix: correct imports, comment out checkpoint tidy-up, add docstring
mary-cleaton 0351f0e
feat: update run script to use boto3 for checkpoint tidy-up
mary-cleaton 0ac9467
fix: correct spark session function
mary-cleaton 00d021a
docs: update changelog
mary-cleaton 21668ff
Merge branch 'develop' into feat/run-scripts
mary-cleaton 7823f9c
fix: change pandas version requirements
mary-cleaton 939d95d
Merge remote-tracking branch 'refs/remotes/origin/feat/run-scripts' i…
mary-cleaton 67e66e1
fix: stop specifying pandas version
mary-cleaton 39e9ce7
docs: correct changelog
mary-cleaton d050444
fix: update setup.cfg
mary-cleaton da4e43a
refactor: update tests to reflect spark session and config changes
mary-cleaton ec603bf
docs: update changelog
mary-cleaton fe388ae
ci: stop applying check-yaml pre-commit to github actions
mary-cleaton a80700f
fix: add new job to github action
mary-cleaton 0f054ab
ci: update install-librkb5 github action job
mary-cleaton 1186920
Revert "docs: update changelog"
mary-cleaton 5656a83
fix: correct issues caused by last few commits
mary-cleaton 49bde5a
fix: add new job to github action
mary-cleaton 607daaf
feat: add import to permit easy install
mary-cleaton 62c8086
ci: update required packages
mary-cleaton 7370983
ci: change required numpy version
mary-cleaton 736ea96
ci: add missing required package to install instructions
mary-cleaton 03bdb52
ci: update minimum version of pyspark needed
mary-cleaton af70935
ci: remove import line as it seems to have broken things
mary-cleaton 663b0a3
ci: remove pyspark from required packages, might this be breaking pyt…
mary-cleaton 36aba7c
ci: restore pyspark to required packages
mary-cleaton File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,28 @@ | ||
| [run_spec] | ||
| #Session size can be s, m, l or xl | ||
| spark_session_size = | ||
|
|
||
| [filepaths] | ||
| # df1_path, df2_path and df_candidates_path must be to parquet files | ||
| # checkpoint_path must be to a unique location as it will be deleted at the end of the run | ||
| bucket_name = | ||
| ssl_file = | ||
| df1_path = | ||
| df2_path = | ||
| df_candidates_path = | ||
| checkpoint_path = | ||
| output_path = | ||
|
|
||
| [variables] | ||
| df1_id = | ||
| df2_id = | ||
| linkage_vars = | ||
| df1_suffix = | ||
| df2_suffix = | ||
|
|
||
| [cutpoints] | ||
| day_cutpoints = | ||
| month_cutpoints = | ||
| year_cutpoints = | ||
| sex_cutpoints = | ||
| surname_cutpoints = |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,124 @@ | ||
| """ | ||
| Run script for Scalelink. | ||
| """ | ||
|
|
||
| import boto3 | ||
| import raz_client | ||
| from rdsa_utils.cdp.helpers.s3_utils import delete_folder | ||
|
|
||
| from scalelink.indicator_matrix import indicator_matrix as im | ||
| from scalelink.match_scores import match_scores as ms | ||
| from scalelink.matrix_a_star import matrix_a_star as ma | ||
| from scalelink.utils import utils as ut | ||
|
|
||
|
|
||
| def run_scalelink(config_path="scalelink/configs.ini"): | ||
| """ | ||
| Takes a path for the location of the config file. From this, runs the entire | ||
| scaling method as per Goldstein et al. (2017) on the specified datasets, | ||
| using the specified linkage variables. | ||
|
|
||
| Args: | ||
| config_path (str): | ||
| The filepath for the config file. The default is a file called | ||
| configs.ini in the head of this repo. The contents should follow the | ||
| template found at scalelink/configs_template.ini in this repo. | ||
|
|
||
| Dependencies: | ||
| shutil | ||
|
|
||
| Returns: | ||
| df_weights_match_scores (Spark DataFrame): | ||
| A dataframe either derived from the Cartesian join of the two input | ||
| dataframes or derived from the dataset located at df_candidates_path. | ||
| The columns present on this dataset are: | ||
| - The ID column of each dataset. | ||
| - Weight columns (named after the linkage variables, suffixed with | ||
| '_weight') containing weights for each linkage variable and row. | ||
| - A column called match_score which contains the sum of the weights, | ||
| row-wise. | ||
| This dataframe is also written to the output_path specified in the config | ||
| file. | ||
| """ | ||
| input_variables = ut.get_input_variables(config_path=config_path) | ||
|
|
||
| spark = ut.create_spark_session( | ||
| spark_session_name="Scalelink", | ||
| spark_session_size=input_variables["spark_session_size"], | ||
| ) | ||
|
|
||
| if input_variables["df_candidates_path"] == "": | ||
| df_candidate_pairs = ut.cartesian_join_dataframes( | ||
| df1_path="s3a://" | ||
| + input_variables["bucket_name"] | ||
| + input_variables["df1_path"], | ||
| df2_path="s3a://" | ||
| + input_variables["bucket_name"] | ||
| + input_variables["df2_path"], | ||
| spark=spark, | ||
| ) | ||
| else: | ||
| df_candidate_pairs = spark.read.parquet( | ||
| "s3a://" | ||
| + input_variables["bucket_name"] | ||
| + input_variables["df_candidates_path"] | ||
| ) | ||
|
|
||
| input_variables = ut.get_s( | ||
| input_variables=input_variables, df_cartesian_join=df_candidate_pairs | ||
| ) | ||
|
|
||
| spark.sparkContext.setCheckpointDir( | ||
| "s3a://" + input_variables["bucket_name"] + input_variables["checkpoint_path"] | ||
| ) | ||
|
|
||
| df_deltas = im.get_deltas( | ||
| df_cartesian_join=df_candidate_pairs, input_variables=input_variables | ||
| ) | ||
|
|
||
| df_delta_comparisons = im.compare_deltas( | ||
| df=df_deltas, | ||
| linkage_vars=input_variables["linkage_vars"], | ||
| delta_col_prefix="di_", | ||
| ) | ||
|
|
||
| print("Deltas have been calculated") | ||
|
|
||
| matrix_a_star = ma.get_matrix_a_star( | ||
| df_delta_comparisons=df_delta_comparisons, input_variables=input_variables | ||
| ) | ||
|
|
||
| print("Matrix A* has been calculated") | ||
|
|
||
| x_star_scaled_labelled = ma.get_scaled_labelled_x_star( | ||
| matrix_a_star=matrix_a_star, input_variables=input_variables | ||
| ) | ||
|
|
||
| df_weights_match_scores = ms.get_match_scores( | ||
| df_deltas=df_deltas, | ||
| x_star_scaled_labelled=x_star_scaled_labelled, | ||
| input_variables=input_variables, | ||
| spark=spark, | ||
| ) | ||
|
|
||
| print("Match scores have been calculated") | ||
|
|
||
| df_weights_match_scores.write.mode("overwrite").parquet( | ||
| "s3a://" + input_variables["bucket_name"] + input_variables["output_path"] | ||
| ) | ||
|
|
||
| print("Your linked dataset has been written to:", input_variables["output_path"]) | ||
|
|
||
| client = boto3.client("s3") | ||
| raz_client.configure_ranger_raz(client, ssl_file=input_variables["ssl_file"]) | ||
| delete_folder( | ||
| client, input_variables["bucket_name"], input_variables["checkpoint_path"] | ||
| ) | ||
|
|
||
| print("Your checkpoint files have been tidied up") | ||
|
|
||
| return df_weights_match_scores | ||
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| run_scalelink() |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.