Add run script and config template #63

mary-cleaton · 2025-10-02T15:09:24Z

Description

Add run script and config template. Minor changes done to run script to allow it to work on Spark 3.5.1 and with S3 buckets instead of HDFS, namely:

Added if __name__ = "__main__": idiom to end of main.py so it can be run directly. Thus, run.py (which called it) is no longer needed for encapsulation purposes. This also ensures everything is within the scalelink folder.
Added import to the __init__.py file within the scalelink folder. This allows the run_scalelink function in main.py to be imported directly in a script using from scalelink import run_scalelink once the code is packaged.
Switched HDFS code checkpoint clean-up using subprocess to a checkpoint clean-up method for s3 bucket files using raz_client and boto3.
Updated the required packages list, config template and read_configs function to facilitate this new checkpoint clean-up method.
Updated create_spark_session so it works with Spark 3.5.1. Also updated its test.
Added a configs template and updated it so it will work with the updated main.py.
Updated test_get_input_variables to reflect changes in configs.

As this is a main script, there are no unit tests associated with it. However, I have tested that it runs using the synthetic test datasets from the Scalelink R package. Whoever picks this up for review, please contact me so that you can carry out the same checks.

Type of change

Bug fix - non-breaking change
New feature - non-breaking change
Breaking change - backwards incompatible change, changes expected behaviour
Non-user facing change, structural change, dev functionality, docs ...

Checklist:

I have performed a self-review of my own code.
I have commented my code appropriately, focusing on explaining my design decisions (explain why, not how).
I have made corresponding changes to the documentation (comments, docstring, etc.. )
I have added tests that prove my fix is effective or that my feature works.
New and existing unit tests pass locally with my changes.
I have updated the change log.

Peer review

Any new code includes all the following:

Documentation: docstrings, comments have been added/ updated.
Style guidelines: New code conforms to the project's contribution guidelines.
Functionality: The code works as expected, handles expected edge cases and exceptions are handled appropriately.
Complexity: The code is not overly complex, logic has been split into appropriately sized functions, etc..
Test coverage: Unit tests cover essential functions for a reasonable range of inputs and conditions. Added and existing tests pass on my machine.

Review comments

Suggestions should be tailored to the code that you are reviewing. Provide context.
Be critical and clear, but not mean. Ask questions and set actions.

These might include:

bugs that need fixing (does it work as expected? and does it work with other code
that it is likely to interact with?)
alternative methods (could it be written more efficiently or with more clarity?)
documentation improvements (does the documentation reflect how the code actually works?)
additional tests that should be implemented
- Do the tests effectively assure that it
  works correctly? Are there additional edge cases/ negative tests to be considered?
code style improvements (could the code be written more clearly?)

Further reading: code review best practices

Add more comments. Remove config that is no longer used.

Checkpoint tidy-up doesn't work on S3 buckets. Need to find a method that does.

This requires: a) updating the required packages in setup.cfg, b) updating the configs_template.ini so it included additional configs (bucket_name, ssl_file) for the raz client and boto3, c) updating the read_configs method in utils.py so that these configs were read in properly and d) updating main.py so that it contains the new checkpoint tidy-up code and uses the new packages and configs correctly to run this.

Update config spark.shuffle.service.enabled to equal false. This allows the code to run using Spark 3.5.1.

mary-cleaton · 2025-10-23T13:08:22Z

@ONSdigital/scalelink-maintainers and @ONSdigital/scalelink-developers
See comment in description - contact me for files to run this on synthetic data as a test.

Cannot specify Pandas >= 2.1.4 as this does not work with Python 3.8. Instead, downgrade to specify Pandas >= 2.0.3, which should work with Python 3.8.

…nto feat/run-scripts

Specifying version 2.0.3 still did not work when GitHub Actions were run. Instead, do not specify version until Scalelink stops using Python 3.8. The issue is due to pandas no longer supporting Python 3.8.

Configs template was added not updated by this branch.

Try again...

Update test_get_input_variables to inforporate new config variables bucket_name and ssl_file and remove superseded config variable hdfs_test_path. Update test_create_spark_session so spark.shuffle.service.enabled is always expected to be false.

This hopefully should get round the 'subprocess.CalledProcessError: Command 'krb5-config --libs gssapi' returned non-zero exit status 127.' that I keep getting for this branch.

.github/workflows/pull_request_workflow.yaml

As we are always using Ubuntu here, removed the if-clause checking if Ubuntu (or MacOS) was being used. This if-clause had a typo in it and as it wasn't needed, removing it was better than trying to fix the typo.

.github/workflows/pull_request_workflow.yaml

This reverts commit ec603bf.

Another attempt at installing libkrb5 to get round the 'subprocess.CalledProcessError: Command 'krb5-config --libs gssapi' returned non-zero exit status 127' error encountered when trying to run the pytests as a GitHub Action now that the code contains a dependency for raz_client.

ayomide2021

I'm happy with all the changes that have made to the run script and config templete.
Is it possible to carryout the checks using the synthetic test datasets from the Scalelink R package?

mary-cleaton · 2025-11-14T14:02:35Z

I'm happy with all the changes that have made to the run script and config templete. Is it possible to carryout the checks using the synthetic test datasets from the Scalelink R package?

Have put the relevant files in your Dev workspace. :)

Add import line to scalelink folder's __init__.py so the run_scalelink function can easily be imported from scalelink once it's a package.

Version 1.26.4 is not compatible with Python 3.8 so can't be used.

mary-cleaton · 2025-12-04T15:25:10Z

Two minor updates made since last QA.

…ests?

Further investigation suggests failure of GitHub Actions that run pytests may be due to Python 3.9 having become unsupported since last run. Will fix this in another branch and then come back to this branch.

mary-cleaton · 2025-12-04T16:50:09Z

After a lot of back-and-forth, I think the reason the most recent commits have had failing GitHub Actions for running the pytests is that Python 3.9 has been discontinued whilst this branch was waiting for QA.

I'm going to revert this to draft and open up the branch that bumps the Python versions. Once that's updated, this branch can be rebased to it and then we can try again to get the GitHub Actions to pass.

mary-cleaton added 2 commits October 2, 2025 14:19

feat: add config file template

26fab28

feat: add run script

c5db1d4

mary-cleaton self-assigned this Oct 2, 2025

mary-cleaton added python Pull requests that update python code pyspark Pull requests that update pyspark code labels Oct 2, 2025

mary-cleaton added 2 commits October 3, 2025 14:47

docs: update configs template

de35f65

Add more comments. Remove config that is no longer used.

fix: correct imports, comment out checkpoint tidy-up, add docstring

7e7946b

Checkpoint tidy-up doesn't work on S3 buckets. Need to find a method that does.

mary-cleaton requested a review from a team October 9, 2025 14:26

mary-cleaton added 3 commits October 17, 2025 10:52

fix: correct spark session function

0ac9467

Update config spark.shuffle.service.enabled to equal false. This allows the code to run using Spark 3.5.1.

docs: update changelog

00d021a

mary-cleaton changed the title ~~Add run script and config template~~ Add run script and update config template Oct 23, 2025

mary-cleaton marked this pull request as ready for review October 23, 2025 13:02

mary-cleaton requested a review from a team as a code owner October 23, 2025 13:02

Merge branch 'develop' into feat/run-scripts

21668ff

mary-cleaton added 2 commits October 23, 2025 13:18

fix: change pandas version requirements

7823f9c

Cannot specify Pandas >= 2.1.4 as this does not work with Python 3.8. Instead, downgrade to specify Pandas >= 2.0.3, which should work with Python 3.8.

Merge remote-tracking branch 'refs/remotes/origin/feat/run-scripts' i…

939d95d

…nto feat/run-scripts

mary-cleaton changed the title ~~Add run script and update config template~~ Add run script and config template Oct 23, 2025

mary-cleaton and others added 3 commits October 23, 2025 13:26

fix: stop specifying pandas version

67e66e1

Specifying version 2.0.3 still did not work when GitHub Actions were run. Instead, do not specify version until Scalelink stops using Python 3.8. The issue is due to pandas no longer supporting Python 3.8.

docs: correct changelog

39e9ce7

Configs template was added not updated by this branch.

fix: update setup.cfg

d050444

Try again...

mary-cleaton marked this pull request as draft October 23, 2025 14:11

mary-cleaton added 2 commits October 23, 2025 14:17

docs: update changelog

ec603bf

mary-cleaton marked this pull request as ready for review October 23, 2025 14:21

mary-cleaton added 2 commits October 23, 2025 15:25

ci: stop applying check-yaml pre-commit to github actions

fe388ae

fix: add new job to github action

a80700f

This hopefully should get round the 'subprocess.CalledProcessError: Command 'krb5-config --libs gssapi' returned non-zero exit status 127.' that I keep getting for this branch.

github-advanced-security bot found potential problems Oct 23, 2025

View reviewed changes

.github/workflows/pull_request_workflow.yaml Fixed Show fixed Hide fixed

ci: update install-librkb5 github action job

0f054ab

As we are always using Ubuntu here, removed the if-clause checking if Ubuntu (or MacOS) was being used. This if-clause had a typo in it and as it wasn't needed, removing it was better than trying to fix the typo.

github-advanced-security bot found potential problems Oct 23, 2025

View reviewed changes

.github/workflows/pull_request_workflow.yaml Fixed Show fixed Hide fixed

mary-cleaton marked this pull request as draft October 23, 2025 15:41

mary-cleaton added 3 commits October 23, 2025 15:47

Revert "docs: update changelog"

1186920

This reverts commit ec603bf.

fix: correct issues caused by last few commits

5656a83

mary-cleaton marked this pull request as ready for review October 28, 2025 14:08

mary-cleaton mentioned this pull request Nov 6, 2025

Remove Python 3.8 and 3.9, add Python 3.11 #65

Open

10 tasks

ayomide2021 requested changes Nov 6, 2025

View reviewed changes

mary-cleaton added 3 commits December 4, 2025 15:07

feat: add import to permit easy install

607daaf

Add import line to scalelink folder's __init__.py so the run_scalelink function can easily be imported from scalelink once it's a package.

ci: update required packages

62c8086

ci: change required numpy version

7370983

Version 1.26.4 is not compatible with Python 3.8 so can't be used.

mary-cleaton mentioned this pull request Dec 4, 2025

Add type hints and restyle docstrings #73

Draft

10 tasks

mary-cleaton added 5 commits December 4, 2025 15:44

ci: add missing required package to install instructions

736ea96

ci: update minimum version of pyspark needed

03bdb52

ci: remove import line as it seems to have broken things

af70935

ci: remove pyspark from required packages, might this be breaking pyt…

663b0a3

…ests?

ci: restore pyspark to required packages

36aba7c

Further investigation suggests failure of GitHub Actions that run pytests may be due to Python 3.9 having become unsupported since last run. Will fix this in another branch and then come back to this branch.

mary-cleaton marked this pull request as draft December 4, 2025 16:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add run script and config template #63

Add run script and config template #63

Uh oh!

mary-cleaton commented Oct 2, 2025 •

edited

Loading

Uh oh!

mary-cleaton commented Oct 23, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

ayomide2021 left a comment

Uh oh!

mary-cleaton commented Nov 14, 2025

Uh oh!

mary-cleaton commented Dec 4, 2025

Uh oh!

mary-cleaton commented Dec 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add run script and config template #63

Are you sure you want to change the base?

Add run script and config template #63

Uh oh!

Conversation

mary-cleaton commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Checklist:

Peer review

Review comments

Uh oh!

mary-cleaton commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ayomide2021 left a comment

Choose a reason for hiding this comment

Uh oh!

mary-cleaton commented Nov 14, 2025

Uh oh!

mary-cleaton commented Dec 4, 2025

Uh oh!

mary-cleaton commented Dec 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mary-cleaton commented Oct 2, 2025 •

edited

Loading

mary-cleaton commented Oct 23, 2025 •

edited

Loading