-
Notifications
You must be signed in to change notification settings - Fork 49
validator.py - the remake #132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
harisont
wants to merge
157
commits into
master
Choose a base branch
from
remake
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
+9,916
−941
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…cies/tools into infrastructure
Utils module (with some but not all planned tests)
Use new crex module & more tests for utils module
This was referenced Sep 15, 2025
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This (draft) PR is a substantial refactoring/upgrade of
validate.pyaccording to the goals listed in #113.At the moment, this code is not ready to be merged. The point of opening a PR this early is to give @dan-zeman a chance to take a look at our code before more time is put into it.
Summary of changes
New configuration file
(an example is available in
docs/example_config.yaml)The file is divided into 8 parts:
file,block,line,token_lines,comment_lines,cols,tree,node(this is still subject to changes).Each part lists in the right order the checks that should be performed while reading and subsequently parsing the file.
Each entry has the following format:
This means that this check is performed at level
nonly if none of the test_ids mentioned in thedepends_onlist has failed for the same block/line/tree/...In case the check cannot be performed because of previous failures, a
Warningis added to theincidents.This is not the case in the current validation script, but we think it is a rather uncontroversial improvement.
Modules
(there are also a few other Python files but they will not be part of the final PR)
cli.pyA command-line interface similar to
validator.py's (not fully functional yet), with a few additional flags:--data-folder> folder where all the json data concerning specifics of UD are stored (i.e., currentlydata)--config-file> path of.yamlfile containing the configuration for a specific run of the validator--format> "LOG" (the current format) or "json" (agreed upon with @bguil).LOGis the default and will mimic current behavior, but makes it more personalizable as everything is dumped on a logging file and a level can be set for the console--dest> destination for the output (stdin/out or path to file)--explanations> flag to enable longer explanations in the output--lines-content> flag to display the portions of the files that triggered the errors in the output (requested by @bguil) (please suggest a better name for this!)validate.pyThis module contains:
validate_xxx). The most important of these functions isvalidate_file()(see below)validate_xxx, now renamed tocheck_xxx), all returning a list ofIncidents (an empty list means that the input passed the check)The function
validate_filerelies on therun_checksfunction, which takes as parameters:check_xxxfunction and the test IDs of the checks it depends on (from the configuration file)check_xxxfunctionIncidents andStateIt runs the appropriate library function by passing it the parameters and extends the list of incidents.
validate_fileopens a single filepath, reads blocks of data (sentence candidates) from it and performs the checks.These are defined to be either file-level, block-level, line-level, token-level, tree-level etc... (this is still subject to changes), and these are run in order (i.e., block-level checks are run before line-level ones).
Checks are not selected by level yet (based on the command-line argument), but we kept the information both in each
Incidentand in the configuration file.incident.pyThis module contains the "abstract" class
Incidentand its two subclassesErrorandWarning, as well the enumTestClass, which replaces testclass strings.specifications.pyContains a
UDSpecsclass that is meant to store all UD-specific information such as list of admitteduposor language-specific constraints.In the future, we might consider adding an abstract class so that more format-specific classes with the same behavior could be integrated (e.g., SUDSpecs).
loaders.pyLibrary of functions to load data stored typically in
jsonoryamlformat. These are used to load all UD specific information which is stored in aUDSpecsobject.compiled_regex.pyCompiled regular expressions (mostly unchanged from the
CompiledRegexesclass).output_utils.pyHelpers for output, including functions to generate extended explanations (originally part of the
Stateclass).utils.pyOther helper functions.
Other stuff
pyproject.tomlFile used to manage the python package, listing dependencies, metadata, command line entry points and more to come. Relies on
hatchling.logs/Output folder for log files.
docs/Currently used to store various notes, destination folder for documentation in the future.
tests/Test scripts are meant to be run with the
pytestcommand:test_cases.py, which executes the tests stored in the subfoldertest-cases/(which includes all pre-existing test, plus a few new ones)test_regex.py: new tests for compiled regular expressionstest_utils.py: utils functionstest_validate.py: new tests for some atomic checks. These are meant to be replaced by new, ad-hoc test casesAt the moment, level 3+ tests are disabled and there are some expected failures in level 1-2 tests. This is because we have not yet reintroduced all checks.
TODO before this PR is merged
(feel free to add to this list and/or let us know if you feel up to lend us a hand!)
conllu_spec.yamlto data folder--lines-contentand--explanationsare not used yet)valdidate.pywith the missingcheck_xxxsStatea dataclassFuture work
(feel free to edit this list as well, but maybe not too much, unless you want to help with the implementation!)
validate_sentenceentry point for validation of individual sentences (requested by @bguil)testids and list which tests should fail as metadata-- @ellepannitto & @harisont