-
Notifications
You must be signed in to change notification settings - Fork 1
Add run script and config template #63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
Add more comments. Remove config that is no longer used.
Checkpoint tidy-up doesn't work on S3 buckets. Need to find a method that does.
This requires: a) updating the required packages in setup.cfg, b) updating the configs_template.ini so it included additional configs (bucket_name, ssl_file) for the raz client and boto3, c) updating the read_configs method in utils.py so that these configs were read in properly and d) updating main.py so that it contains the new checkpoint tidy-up code and uses the new packages and configs correctly to run this.
Update config spark.shuffle.service.enabled to equal false. This allows the code to run using Spark 3.5.1.
|
@ONSdigital/scalelink-maintainers and @ONSdigital/scalelink-developers |
Cannot specify Pandas >= 2.1.4 as this does not work with Python 3.8. Instead, downgrade to specify Pandas >= 2.0.3, which should work with Python 3.8.
…nto feat/run-scripts
Specifying version 2.0.3 still did not work when GitHub Actions were run. Instead, do not specify version until Scalelink stops using Python 3.8. The issue is due to pandas no longer supporting Python 3.8.
Configs template was added not updated by this branch.
Try again...
Update test_get_input_variables to inforporate new config variables bucket_name and ssl_file and remove superseded config variable hdfs_test_path. Update test_create_spark_session so spark.shuffle.service.enabled is always expected to be false.
This hopefully should get round the 'subprocess.CalledProcessError: Command 'krb5-config --libs gssapi' returned non-zero exit status 127.' that I keep getting for this branch.
As we are always using Ubuntu here, removed the if-clause checking if Ubuntu (or MacOS) was being used. This if-clause had a typo in it and as it wasn't needed, removing it was better than trying to fix the typo.
This reverts commit ec603bf.
Another attempt at installing libkrb5 to get round the 'subprocess.CalledProcessError: Command 'krb5-config --libs gssapi' returned non-zero exit status 127' error encountered when trying to run the pytests as a GitHub Action now that the code contains a dependency for raz_client.
ayomide2021
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy with all the changes that have made to the run script and config templete.
Is it possible to carryout the checks using the synthetic test datasets from the Scalelink R package?
Have put the relevant files in your Dev workspace. :) |
Add import line to scalelink folder's __init__.py so the run_scalelink function can easily be imported from scalelink once it's a package.
Version 1.26.4 is not compatible with Python 3.8 so can't be used.
|
Two minor updates made since last QA. |
Further investigation suggests failure of GitHub Actions that run pytests may be due to Python 3.9 having become unsupported since last run. Will fix this in another branch and then come back to this branch.
|
After a lot of back-and-forth, I think the reason the most recent commits have had failing GitHub Actions for running the pytests is that Python 3.9 has been discontinued whilst this branch was waiting for QA. I'm going to revert this to draft and open up the branch that bumps the Python versions. Once that's updated, this branch can be rebased to it and then we can try again to get the GitHub Actions to pass. |
Description
Add run script and config template. Minor changes done to run script to allow it to work on Spark 3.5.1 and with S3 buckets instead of HDFS, namely:
if __name__ = "__main__":idiom to end ofmain.pyso it can be run directly. Thus,run.py(which called it) is no longer needed for encapsulation purposes. This also ensures everything is within thescalelinkfolder.__init__.pyfile within thescalelinkfolder. This allows therun_scalelinkfunction in main.py to be imported directly in a script usingfrom scalelink import run_scalelinkonce the code is packaged.subprocessto a checkpoint clean-up method for s3 bucket files usingraz_clientandboto3.read_configsfunction to facilitate this new checkpoint clean-up method.create_spark_sessionso it works with Spark 3.5.1. Also updated its test.main.py.test_get_input_variablesto reflect changes in configs.As this is a main script, there are no unit tests associated with it. However, I have tested that it runs using the synthetic test datasets from the Scalelink R package. Whoever picks this up for review, please contact me so that you can carry out the same checks.
Type of change
Checklist:
Peer review
Any new code includes all the following:
Review comments
Suggestions should be tailored to the code that you are reviewing. Provide context.
Be critical and clear, but not mean. Ask questions and set actions.
These might include:
that it is likely to interact with?)
works correctly? Are there additional edge cases/ negative tests to be considered?
Further reading: code review best practices