Skip to content

Conversation

@m4rch3n1ng
Copy link
Contributor

add a --attribute-filter cli flag to icu4x-datagen, that implements a regex search like suggested in this review comment chain.

as i have implemented something extremely similarly sometime in the past i have some thoughts:

  • i have implemented a way to negate a regex match by prefixing it with a - as suggested in this review, but i wanted to share an alternative: in my previous implementation i used regex "flags" similarly to ecmascript regexes, so i can do a case-insensitive match like /regex/i and an inverted like /regex/v (-v flag from grep) or i can combine them.
  • i have implemented an anchored (i.e. whole-string) match as suggested in this review, by just wrapping the regex in ^(?:)$. this does however have one side effect: the error messages are worse, as the error message now point to the wrapped regex and )x( is now allowed as an input. this is fixable by parsing the regex twice, which i can do if you want me to. however, this is also an issue in ripgrep, so it is probably fine for a cli.

resolves #6851
supersedes/closes #7166

Copy link
Member

@sffc sffc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!

@m4rch3n1ng m4rch3n1ng force-pushed the attribute-filter branch 2 times, most recently from fce061b to 49aab1b Compare November 12, 2025 17:00
Copy link
Member

@sffc sffc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work!

("vi", "", "Xin chào thế giới"),
("zh", "", "你好世界"),
// tests for the attribute filter tests
("de", "de", "Hallo Welt"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional Suggestion: Make new strings for the data marker attributes instead of copying the locale strings. There is already one, "reverse". How about:

  • rotate1 (rotate the string 1 character, like "dHello Worl")
  • rotate2 (rotate the string 2 characters, like "ldHello Wor")
  • rotate3 (rotate the string 3 characters, like "rldHello Wo")
  • uppercase (like "HELLO WORLD")
  • lowercase (like "hello world")

I think that should be enough to make an interesting test case. Interesting regexes could be:

  1. /^r.*$/
  2. /^uppercase$/
  3. -/^.*3$/

which should resolve to: "reverse", "rotate1", "rotate2", and "uppercase", but exclude "rotate3" and "lowercase"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh ok then maybe my implementation is incorrect: when specifying multiple filter for the same domain, then all of them have to match to keep the element, doing /r.*/ and /uppercase/ means to let through all that start with an "r" and are equal to the string "uppercase", so nothing would be generated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the suggestion, i am always pretty bad at coming up with examples for tests, but yours are pretty good! for the filters i have chosen to do /r.*?|.*?case/, -/lowercase/ and -/.*3/, which results in the same things as yours.

@m4rch3n1ng m4rch3n1ng force-pushed the attribute-filter branch 2 times, most recently from cec660d to 7dc2d3c Compare November 13, 2025 14:26
@robertbastian
Copy link
Member

I pulled the changes to HelloWorld out into a different PR, I'd prefer merging that separately because it affects a lot of generated files.

@m4rch3n1ng
Copy link
Contributor Author

ok i am a little confused about what to do with the tests: do you want me to only test for the argument parsing in the icu4x-datagen crate or do you want me to also test the with_marker_attributes_filter function? if so, where should i put these tests?

@robertbastian
Copy link
Member

I'll push something

robertbastian added a commit that referenced this pull request Nov 13, 2025
Copy link
Member

@sffc sffc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @m4rch3n1ng for the contribution and @robertbastian for the test improvement

@Manishearth Manishearth removed their request for review November 13, 2025 20:16
@Manishearth
Copy link
Member

Broadly in favor, removing myself from the blocking review path.

@sffc sffc merged commit 18a0068 into unicode-org:main Nov 15, 2025
30 checks passed
@m4rch3n1ng m4rch3n1ng deleted the attribute-filter branch November 15, 2025 20:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add attributes filtering to CLI

4 participants