Skip to content

Conversation

@marklondon999
Copy link
Contributor

@marklondon999 marklondon999 commented Jun 20, 2025

Title: Introduce JsonPreprocessor for Data Processing

Summary

This pull request introduces a configuration-driven workflow for processing raw JSON outputs from Survey Assist, preparing the data for analysis, and running evaluation metrics.

Background

A labeled set of clerical coder assignments has been run through Survey Assist and the json output file(s) need to be processed.
Currently, there is no standardised method for processing the JSON files produced by batch runs of Survey Assist. This makes it difficult to evaluate the performance of different models or prompts. This change introduces a pipeline to solve this problem.

Motivation

This PR provides a link from the Docker run data set outputs to the module that calculates the final metrics. The class JsonPreprocessor allows a developer to take the raw JSON output from a model run, flatten it into a structured format, merge it with the human-coded input SIC data, and generate a set of performance metrics. This is all controlled by a configuration file.

Changes Introduced

JsonPreprocessor Class:

A new class responsible for fetching and processing raw JSON files directly from Google Cloud Storage (GCS). It can operate in two modes:

  • Single File Mode: Processes a single, specified JSON file.
  • Directory Mode: Scans a GCS directory and processes all files created after a specified date assuming that these files are the needed ones. This is used when a run has been interrupted, or is too large to run in one go.

Runner Script (run_json_processor.py):

A Jupyter Notebook (in .py format) that uses a .toml configuration file to manage the entire workflow, from fetching data to producing final metrics.

  • Note, the config also allows a list of prepared files to be processed through the class LabelAccuracy so that a comparison between different experiments can be made.

Control

How the prepare_config.toml Works

The process is controlled by the prepare_config.toml file, which is broken down into two main sections:

[paths]

This section defines all the input and output locations in GCS.

  • batch_filepath: The location of the original, human-coded input data. This is a source location.
  • gcs_bucket_name: The name of the GCS bucket where data is stored.
  • gcs_json_dir: The directory within the bucket where the raw JSON output files from model runs are located. This is a source location.
  • analysis_csv: The location of the human-coded data after it has been enriched with quality flags by the prepare_evaluation_data_for_analysis.py script. This is a destination where the processing will be directed to.
  • named_file: The full GCS path to a single JSON output file to be processed. This is only used if single_file is set to "True". This is a source location.
  • merged_file_list: A list of pre-processed CSV files that can be run through the final evaluation metrics. These are source locations.

[parameters]

This section controls the behaviour of the JsonPreprocessor.

  • single_file: If set to "True", the processor will only use the file specified in named_file. If set to "False" or omitted, it will scan the gcs_json_dir directory.
  • date_since: When single_file is false, this tells the processor to only consider files in the directory created on or after this date (in YYYYMMDD format).

Prerequisites & How to Run

  1. Authenticate with Google Cloud:
    gcloud auth application-default login
    
  2. Update the Configuration: First change the place holder <my-bucket-name> for the correct location. in the config.
  3. If the notebook run_json_processor.ipynb is being run against the current json output file, the paths defined in the config are all set. If it is to be run against a different data set, the GCS location of that input file needs to be put into batch_filepath and the location of the json output put into named_file.
  4. Convert and Run the Notebook:
    The runner script is a Jupytext file. To run it, first convert it to a .ipynb notebook:
    jupytext --to ipynb notebooks/run_json_processor.py
    
    Then, open and run run_json_processor.ipynb in Jupyter.

✅ Checklist

Please confirm you've completed these checks before requesting a review.

  • [✅ ] Code is formatted using Black
  • [ ✅] Imports are sorted using isort
  • [ ✅] Code passes linting with Ruff, Pylint, and Mypy
  • [✅ ] Security checks pass using Bandit
  • [✅ ] API and Unit tests are written and pass using pytest
  • [NA ] Terraform files (if applicable) follow best practices and have been validated (terraform fmt & terraform validate)
  • [✅ ] DocStrings follow Google-style and are added as per Pylint recommendations
  • [NA ] Documentation has been updated if needed

🔍 How to Test

  1. In the file prepare_config.toml ensure you do a global replace of the place holder text <my-bucket-name> for the actual GCS bucket name (I have deliberately changed this for security reasons).
  2. Create the notebook using jupytext --to ipynb notebooks/run_json_processor.py
  3. This file batch_filepath = "gs://<my-bucket-name>/evaluation_data/DSC_Rep_Sample.csv" is the source of the original data that has been run by the batch script.
  4. The batch process, when it was run, put the JSON output into this file: named_file = "gs://<my-bucket-name>/analysis_outputs/20250620_153641_output.json"
    Since this PR is processing these files, do not change these filenames. For a future evaluation, we will change the filenames to the appropriate locations.
  5. analysis_csv = "gs://<my-bucket-name>/analysis_outputs/added_columns/flags_20250620_153641.csv" is the target location of the output from the script prepare_evaluation_data_for_analysis.py. The exact location is arbitrary.
  6. This file has already been created by the process:
    merged_file = "gs://<my-bucket-name>/analysis_outputs/merged_files/merged_flags_20250620_153641.csv"
    This is the file that is created by full_output_df = preprocessor.merge_eval_data(llm_processed_df) in the notebook. I have already written it to the bucket.

Running run_json_processor.ipynb will perform a full end-to-end test.

Unit Tests:

Run the new unit tests to verify the class logic in isolation:

pytest tests/test_coder_alignment.py -vv -s

…a_for_analysis.py to allow full end to end run
@marklondon999 marklondon999 marked this pull request as ready for review July 23, 2025 14:07
@marklondon999 marklondon999 requested a review from djyldyz July 24, 2025 10:14
@@ -0,0 +1,356 @@
# ---
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this need to be a notebook rather than a Python script? What are the advantages of having it as a notebook?

@@ -0,0 +1,356 @@
# ---
# jupyter:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be a docstring explaining the purpose of the script at the top.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants