Skip to content

Conversation

@mary-cleaton
Copy link
Contributor

@mary-cleaton mary-cleaton commented Oct 2, 2025

Description

Add run script and config template. Minor changes done to run script to allow it to work on Spark 3.5.1 and with S3 buckets instead of HDFS, namely:

  • Added if __name__ = "__main__": idiom to end of main.py so it can be run directly. Thus, run.py (which called it) is no longer needed for encapsulation purposes. This also ensures everything is within the scalelink folder.
  • Added import to the __init__.py file within the scalelink folder. This allows the run_scalelink function in main.py to be imported directly in a script using from scalelink import run_scalelink once the code is packaged.
  • Switched HDFS code checkpoint clean-up using subprocess to a checkpoint clean-up method for s3 bucket files using raz_client and boto3.
  • Updated the required packages list, config template and read_configs function to facilitate this new checkpoint clean-up method.
  • Updated create_spark_session so it works with Spark 3.5.1. Also updated its test.
  • Added a configs template and updated it so it will work with the updated main.py.
  • Updated test_get_input_variables to reflect changes in configs.

As this is a main script, there are no unit tests associated with it. However, I have tested that it runs using the synthetic test datasets from the Scalelink R package. Whoever picks this up for review, please contact me so that you can carry out the same checks.

Type of change

  • Bug fix - non-breaking change
  • New feature - non-breaking change
  • Breaking change - backwards incompatible change, changes expected behaviour
  • Non-user facing change, structural change, dev functionality, docs ...

Checklist:

  • I have performed a self-review of my own code.
  • I have commented my code appropriately, focusing on explaining my design decisions (explain why, not how).
  • I have made corresponding changes to the documentation (comments, docstring, etc.. )
  • I have added tests that prove my fix is effective or that my feature works.
  • New and existing unit tests pass locally with my changes.
  • I have updated the change log.

Peer review

Any new code includes all the following:

  • Documentation: docstrings, comments have been added/ updated.
  • Style guidelines: New code conforms to the project's contribution guidelines.
  • Functionality: The code works as expected, handles expected edge cases and exceptions are handled appropriately.
  • Complexity: The code is not overly complex, logic has been split into appropriately sized functions, etc..
  • Test coverage: Unit tests cover essential functions for a reasonable range of inputs and conditions. Added and existing tests pass on my machine.

Review comments

Suggestions should be tailored to the code that you are reviewing. Provide context.
Be critical and clear, but not mean. Ask questions and set actions.

These might include:
  • bugs that need fixing (does it work as expected? and does it work with other code
    that it is likely to interact with?)
  • alternative methods (could it be written more efficiently or with more clarity?)
  • documentation improvements (does the documentation reflect how the code actually works?)
  • additional tests that should be implemented
    • Do the tests effectively assure that it
      works correctly? Are there additional edge cases/ negative tests to be considered?
  • code style improvements (could the code be written more clearly?)

Further reading: code review best practices

@mary-cleaton mary-cleaton self-assigned this Oct 2, 2025
@mary-cleaton mary-cleaton added python Pull requests that update python code pyspark Pull requests that update pyspark code labels Oct 2, 2025
Add more comments. Remove config that is no longer used.
Checkpoint tidy-up doesn't work on S3 buckets. Need to find a method that does.
@mary-cleaton mary-cleaton requested a review from a team October 9, 2025 14:26
This requires: a) updating the required packages in setup.cfg, b) updating the configs_template.ini so it included additional configs (bucket_name, ssl_file) for the raz client and boto3, c) updating the read_configs method in utils.py so that these configs were read in properly and d) updating main.py so that it contains the new checkpoint tidy-up code and uses the new packages and configs correctly to run this.
Update config spark.shuffle.service.enabled to equal false. This allows the code to run using Spark 3.5.1.
@mary-cleaton mary-cleaton changed the title Add run script and config template Add run script and update config template Oct 23, 2025
@mary-cleaton mary-cleaton marked this pull request as ready for review October 23, 2025 13:02
@mary-cleaton mary-cleaton requested a review from a team as a code owner October 23, 2025 13:02
@mary-cleaton
Copy link
Contributor Author

mary-cleaton commented Oct 23, 2025

@ONSdigital/scalelink-maintainers and @ONSdigital/scalelink-developers
See comment in description - contact me for files to run this on synthetic data as a test.

Cannot specify Pandas >= 2.1.4 as this does not work with Python 3.8. Instead, downgrade to specify Pandas >= 2.0.3, which should work with Python 3.8.
@mary-cleaton mary-cleaton changed the title Add run script and update config template Add run script and config template Oct 23, 2025
mary-cleaton and others added 3 commits October 23, 2025 13:26
Specifying version 2.0.3 still did not work when GitHub Actions were run. Instead, do not specify version until Scalelink stops using Python 3.8. The issue is due to pandas no longer supporting Python 3.8.
Configs template was added not updated by this branch.
@mary-cleaton mary-cleaton marked this pull request as draft October 23, 2025 14:11
Update test_get_input_variables to inforporate new config variables bucket_name and ssl_file and remove superseded config variable hdfs_test_path. Update test_create_spark_session so spark.shuffle.service.enabled is always expected to be false.
@mary-cleaton mary-cleaton marked this pull request as ready for review October 23, 2025 14:21
This hopefully should get round the 'subprocess.CalledProcessError: Command 'krb5-config --libs gssapi' returned non-zero exit status 127.' that I keep getting for this branch.
As we are always using Ubuntu here, removed the if-clause checking if Ubuntu (or MacOS) was being used. This if-clause had a typo in it and as it wasn't needed, removing it was better than trying to fix the typo.
@mary-cleaton mary-cleaton marked this pull request as draft October 23, 2025 15:41
Another attempt at installing libkrb5 to get round the 'subprocess.CalledProcessError: Command 'krb5-config --libs gssapi' returned non-zero exit status 127' error encountered when trying to run the pytests as a GitHub Action now that the code contains a dependency for raz_client.
@mary-cleaton mary-cleaton marked this pull request as ready for review October 28, 2025 14:08
Copy link

@ayomide2021 ayomide2021 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy with all the changes that have made to the run script and config templete.
Is it possible to carryout the checks using the synthetic test datasets from the Scalelink R package?

@mary-cleaton
Copy link
Contributor Author

I'm happy with all the changes that have made to the run script and config templete. Is it possible to carryout the checks using the synthetic test datasets from the Scalelink R package?

Have put the relevant files in your Dev workspace. :)

Add import line to scalelink folder's __init__.py so the run_scalelink function can easily be imported from scalelink once it's a package.
Version 1.26.4 is not compatible with Python 3.8 so can't be used.
@mary-cleaton
Copy link
Contributor Author

Two minor updates made since last QA.

Further investigation suggests failure of GitHub Actions that run pytests may be due to Python 3.9 having become unsupported since last run. Will fix this in another branch and then come back to this branch.
@mary-cleaton
Copy link
Contributor Author

After a lot of back-and-forth, I think the reason the most recent commits have had failing GitHub Actions for running the pytests is that Python 3.9 has been discontinued whilst this branch was waiting for QA.

I'm going to revert this to draft and open up the branch that bumps the Python versions. Once that's updated, this branch can be rebased to it and then we can try again to get the GitHub Actions to pass.

@mary-cleaton mary-cleaton marked this pull request as draft December 4, 2025 16:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pyspark Pull requests that update pyspark code python Pull requests that update python code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants