validator.py - the remake #132

harisont · 2025-09-05T21:29:32Z

This (draft) PR is a substantial refactoring/upgrade of validate.py according to the goals listed in #113.

At the moment, this code is not ready to be merged. The point of opening a PR this early is to give @dan-zeman a chance to take a look at our code before more time is put into it.

Summary of changes

New configuration file

(an example is available in docs/example_config.yaml)

The file is divided into 8 parts: file, block, line, token_lines, comment_lines, cols, tree, node (this is still subject to changes).

Each part lists in the right order the checks that should be performed while reading and subsequently parsing the file.
Each entry has the following format:

	check_name:            # function name
		level: n           # integer from 0 to 5
		depends_on:
			- test_id_1    # string that uniquely identifies test
			- test_id_2
			- ...
			- test_id_n

This means that this check is performed at level n only if none of the test_ids mentioned in the depends_on list has failed for the same block/line/tree/...

In case the check cannot be performed because of previous failures, a Warning is added to the incidents.
This is not the case in the current validation script, but we think it is a rather uncontroversial improvement.

Modules

(there are also a few other Python files but they will not be part of the final PR)

`cli.py`

A command-line interface similar to validator.py's (not fully functional yet), with a few additional flags:

--data-folder > folder where all the json data concerning specifics of UD are stored (i.e., currently data)
--config-file > path of .yaml file containing the configuration for a specific run of the validator
--format > "LOG" (the current format) or "json" (agreed upon with @bguil). LOG is the default and will mimic current behavior, but makes it more personalizable as everything is dumped on a logging file and a level can be set for the console
--dest > destination for the output (stdin/out or path to file)
--explanations > flag to enable longer explanations in the output
--lines-content > flag to display the portions of the files that triggered the errors in the output (requested by @bguil) (please suggest a better name for this!)

`validate.py`

This module contains:

the entry point(s) for the validator (named validate_xxx). The most important of these functions is validate_file() (see below)
a library of more or less atomic checks (originally called validate_xxx, now renamed to check_xxx), all returning a list of Incidents (an empty list means that the input passed the check)

The function validate_file relies on the run_checks function, which takes as parameters:

the name of a specific check_xxx function and the test IDs of the checks it depends on (from the configuration file)
the required parameters for the check_xxx function
current list of Incidents and
the validation State

It runs the appropriate library function by passing it the parameters and extends the list of incidents.

validate_file opens a single filepath, reads blocks of data (sentence candidates) from it and performs the checks.
These are defined to be either file-level, block-level, line-level, token-level, tree-level etc... (this is still subject to changes), and these are run in order (i.e., block-level checks are run before line-level ones).

Checks are not selected by level yet (based on the command-line argument), but we kept the information both in each Incident and in the configuration file.

`incident.py`

This module contains the "abstract" class Incident and its two subclasses Error and Warning, as well the enum TestClass, which replaces testclass strings.

`specifications.py`

Contains a UDSpecs class that is meant to store all UD-specific information such as list of admitted upos or language-specific constraints.

In the future, we might consider adding an abstract class so that more format-specific classes with the same behavior could be integrated (e.g., SUDSpecs).

`loaders.py`

Library of functions to load data stored typically in json or yaml format. These are used to load all UD specific information which is stored in a UDSpecs object.

`compiled_regex.py`

Compiled regular expressions (mostly unchanged from the CompiledRegexes class).

`output_utils.py`

Helpers for output, including functions to generate extended explanations (originally part of the State class).

`utils.py`

Other helper functions.

Other stuff

`pyproject.toml`

File used to manage the python package, listing dependencies, metadata, command line entry points and more to come. Relies on hatchling.

`logs/`

Output folder for log files.

`docs/`

Currently used to store various notes, destination folder for documentation in the future.

`tests/`

Test scripts are meant to be run with the pytest command:

test_cases.py, which executes the tests stored in the subfolder test-cases/ (which includes all pre-existing test, plus a few new ones)
test_regex.py: new tests for compiled regular expressions
test_utils.py: utils functions
test_validate.py: new tests for some atomic checks. These are meant to be replaced by new, ad-hoc test cases

At the moment, level 3+ tests are disabled and there are some expected failures in level 1-2 tests. This is because we have not yet reintroduced all checks.

TODO before this PR is merged

(feel free to add to this list and/or let us know if you feel up to lend us a hand!)

move conllu_spec.yaml to data folder
restore all currently disabled CLI functionalities
fully support new options (at least --lines-content and --explanations are not used yet)
complete valdidate.py with the missing check_xxxs
finalize the default config file (so that the new validator runs the same checks as the old one)
make State a dataclass
configure logging so that it is still possible to display errors as they happen, especially in large treebanks (requested by @LeonieWeissweiler)

Future work

(feel free to edit this list as well, but maybe not too much, unless you want to help with the implementation!)

produce PyPI package
implement an additional validate_sentence entry point for validation of individual sentences (requested by @bguil)
add missing test cases
reorganize the test cases so that they match testids and list which tests should fail as metadata
setup CI (both to run the tests and potentially to publish a new version of the package on PyPI for each new tag)
keep track of multiple error indices for errors involving several nodes (suggested by @AngledLuffa; see Multiple node_ids in a single error? #137 for details)

-- @ellepannitto & @harisont

Tests

…cies/tools into infrastructure

Utils module (with some but not all planned tests)

Regex

Use new crex module & more tests for utils module

…n, depth-first)

…de-empty-vals'

ellepannitto and others added 30 commits September 1, 2025 10:31

first working validate.py

72308ae

add folder to keep logs

f1cbee3

ignoring actual logs

963f022

add logging utils

8f2b692

add dotenv to dependencies

78b8e55

add regex and unicodedata as dependencies

f6ad3f2

add basic logger

4aaafae

add args pretty print

c6034fc

move test cases into new tests folder

e67205e

fix import

e6997da

pytest infrastructure

d289ad5

Merge branch 'infrastructure' into tests

30aaf44

Merge pull request #120 from UniversalDependencies/tests

e1853d7

Tests

minor changes

4084968

Merge branch 'infrastructure' of https://github.com/UniversalDependen…

96b75e6

…cies/tools into infrastructure

semi-auto-generated docs

a21c0c6

add files for modularization

06f4255

started refactoring regex

81f7c1a

utils module

3e15718

micro whitespace changes

840fa2d

move compiled regex to dedicated module

d493fc3

WIP tests for utils

4455b3d

minor changes

f99265c

Merge pull request #121 from UniversalDependencies/utils

47e1de9

Utils module (with some but not all planned tests)

Merge branch 'infrastructure' into regex

17ddf0f

Merge pull request #122 from UniversalDependencies/regex

3f63f51

Regex

use the new crex module

133020a

minor comments

e51507f

Merge pull request #123 from UniversalDependencies/utils

9a4f7d0

Use new crex module & more tests for utils module

rm outdated notes file

79d217f

harisont and others added 22 commits September 7, 2025 15:33

refactor validate->check_features_level4

00d76b5

refactor lv 5 checks

a997c57

WIP validate_annotation and the myriad of functions it calls (top-dow…

c27c0bd

…n, depth-first)

add table with description of checks

8111834

minor fixes

acdb792

update table with description of checks

d845b78

minor fixes

9a1b2ac

add minimal test case

0a45061

add minimal test case

5f8baa9

update table with description of checks

ba2f89a

better error representation

22aee51

fix check_invalid_lines and check_columns_format

57fe163

fix validate behaviour

770e9f6

add minimal testing scenarios

700ec8e

update table with description of checks

52ab7bc

done refactoring misplaced-comment

aede17e

better handling of line number

df789cd

tests for pseudo-empty-line and extra-empty-line

6a344aa

add tests for 'unicode-normalization', 'mwt-empty-vals' and 'empty-no…

e3ab7b3

…de-empty-vals'

finish level 1 tests

ef835c2

test for 'check_sent_id' and support for kwargs

056b225

minor changes

1f89f4b

This was referenced Sep 15, 2025

Multiple node_ids in a single error? #137

Open

Save the errors in the state, not just the error counts. Makes it ea… #138

Merged

ellepannitto added 3 commits September 16, 2025 15:24

'check_parallel_id' and 'check_test_meta' + add dataclass for state

7117c9a

minor fix

c119cc7

pull master validator for testing purposes

39c61d5

harisont mentioned this pull request Nov 21, 2025

pip installable version of validator and scorer? #147

Open

ellepannitto mentioned this pull request Nov 24, 2025

Nonstandard DEPRELs for nonstandard syntax UniversalDependencies/docs#1178

Open

dan-zeman added a commit that referenced this pull request Dec 1, 2025

Enumerate test classes and incident types (as in #132).

2e32e83

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

validator.py - the remake #132

validator.py - the remake #132

harisont commented Sep 5, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

validator.py - the remake #132

Are you sure you want to change the base?

validator.py - the remake #132

Conversation

harisont commented Sep 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary of changes

New configuration file

Modules

cli.py

validate.py

incident.py

specifications.py

loaders.py

compiled_regex.py

output_utils.py

utils.py

Other stuff

pyproject.toml

logs/

docs/

tests/

TODO before this PR is merged

Future work

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

harisont commented Sep 5, 2025 •

edited

Loading

`cli.py`

`validate.py`

`incident.py`

`specifications.py`

`loaders.py`

`compiled_regex.py`

`output_utils.py`

`utils.py`

`pyproject.toml`

`logs/`

`docs/`

`tests/`