Skip to content

Conversation

@gleb-lobov
Copy link

@gleb-lobov gleb-lobov commented Dec 4, 2025

Would clean up this enrich PR

@gleb-lobov gleb-lobov changed the base branch from master to develop December 4, 2025 09:04
@istreeter
Copy link
Contributor

@gleb-lobov I don't quite understand why these are the two methods you want to make public.

Comparing with the Enrich PR snowplow/enrich-private#151, it seems to me the functionality you need to import/replace is this part where you convert from circe Json into networknet JsonSchema. So the corresponding function in iglu-scala-client is provideNewJsonSchema, this function.

@gleb-lobov
Copy link
Author

gleb-lobov commented Dec 5, 2025

@istreeter I was following the public method CirceValidator#validate, and it never calls provideNewJsonSchema. provideNewJsonSchema is part of the private object, and is called from the CirceValidator.WithCaching#validate, that is used internally to cache things. Since we are doing our own caching, and exposing a private object is awkward, I've just made public the method used in the main object.

I can move the provideNewJsonSchema to the main object and make is public, if you see it fit. Seems that it does not rely on caching and logically can belong to the main object

@istreeter
Copy link
Contributor

I was following the public method CirceValidator#validate

OK I think I get what you're saying.

In your Enrich PR snowplow/enrich-private#151 there are two different places where you have written code which could be replaced by iglu-scala-client:

1. Validating an entity against the schema

This function here where you validate the entity against the schema. I agree that CirceValidator#validateOnReadySchema gives you the functionality you need to replace the implementation in Enrich.

But... I don't think that implementation is a problem in Enrich. I mean, it's just two lines of code. I don't think you benefit much from importing the implementation from iglu-scala-client

    (for {
      jacksonJson <- circeToJackson(data, maxJsonDepth).toOption
    } yield schema.validate(jacksonJson).isEmpty).getOrElse(false)

2. Compiling the json schema

This function here where you compile the json schema. In this case the Enrich implementation adds a lot of overhead. e.g. you were forced to re-define SchemaValidatorsConfig and IgluMetaschemaFactory and fakeUrlFetcher etc.

So this is the functionality which I think we should look to import from iglu-scala-client. And this is why I think you need provideNewJsonSchema to be public.

@gleb-lobov
Copy link
Author

  1. Validating an entity against the schema

Yeah, I can skip it, but it would be nice to not even call circeToJackson. But I can skip making validateOnReadySchema public if you don't think it makes sense.

  1. Compiling the json schema

You are absolutely right that this is more problematic, and I have to bring a lot. My idea was that if the evaluateSchema is public and validateOnReadySchema is public, the implementation for my compileJsonSchema would be just

  def compileJsonSchema(data: Json, schema: Json): Either[ValidatorError, Unit] =
    for {
      jacksonJson <- circeToJackson(schema, Int.MaxValue).leftMap(_.toInvalidSchema)
      schema      <- evaluateSchema(jacksonJson)
      _           <- validateOnReadySchema(schema, data, Int.MaxValue)
    } yield ()

that is just a copy of a CirceValidator#validate

I think the confusion is because Im looking at CirceValidator#validate and you are looking at CirceValidator.WithCaching#validate

Form the looks of it, I don't see any difference, since we'll not use caching. Let me know if I should make provideNewJsonSchema public instead

@istreeter
Copy link
Contributor

But for compiling the jsonschema you cannot have this signature

def compileJsonSchema(data: Json, schema: Json): Either[ValidatorError, Unit]

...because you don't have the data yet. The data comes later when you start enriching events. Whereas you compile the jsonschema during application start up.

@gleb-lobov
Copy link
Author

gleb-lobov commented Dec 5, 2025

Yeah, you're right, sorry, blind copy-paste

This is what I've meant(ignore the left channel, just converting everything to string so it compiles)

  def compileJsonSchema(circe: Json): Either[String, JsonSchema] =
    for {
      jacksonJson <- circeToJackson(circe, Int.MaxValue).leftMap(_.message)
      schema      <- evaluateSchema(jacksonJson).leftMap(_.toString)
    } yield schema

@istreeter
Copy link
Contributor

istreeter commented Dec 5, 2025

But I don't think that works either, because in your example schema is a jackson JsonNode, whereas you need to compile to a networknt JsonSchema.

Sorry ignore that comment I wrote the wrong thing.

@istreeter
Copy link
Contributor

OK I understand what you're saying now. So evaluateSchema and provideNewSchema are very similar functions. They both have the same return type. The former accepts a jackson json, whereas the latter accepts a circe json. From Enrich point of view I think we could make either one public. But from point of view of designing the public API of iglu-scala-client, I think it makes slightly more sense to expose a function that accepts a circe json.

And from point of view of designing the public API... it should probably live in the top-level object (like you said) and have a name like compileJsonSchema.

@gleb-lobov
Copy link
Author

@istreeter yeah, I see now what you mean. I've moved the provideNewSchema to main object, made it public, and renamed to compileJsonSchema. I've also moved the validateSchema to the main object, since it has nothing to do with caching. I've also changed the implementation of it to re-use the validateSchemaAgainstV4.

The WithCaching now looks clean to me, as it only deals with the caching logic. Let me know if the public API also looks fine from your side

@istreeter
Copy link
Contributor

Thanks @gleb-lobov I think the compileJsonSchema looks perfect now.

Regarding validateOnReadySchema I still lean towards not making this one public. My reason is, validateOnReadySchema validates the data AND then it calls fromValidationMessage to convert from ValidationMessage to ValidatorReport. This seems wasteful because:

  1. In your Enrich PR you don't need the ValidatorReport, you only need to know if the result is valid or invalid. i.e. you just call .isEmpty on this line.
  2. fromValidationMessage does some string manipulation, so it is not completely cheap
  3. For event spec inference I guess we expect lots of data to be invalid. So we will hit this fromValidationMessage quite often. Because the whole point of inference is we don't know in advance if some data is expected to validate.

I think your function isValidAgainstSchema in your Enrich PR is fine already. It is concise and easy to read. I don't feel like you need to shift the implementation over to iglu-scala-client.

@gleb-lobov
Copy link
Author

@istreeter good note about the error string manipulation, I didn't think about this.

I've made the validateOnReadySchema private, you are correct that the compileJsonSchema is enough to make enrich PR look decent, and we also save up some performance by doing the isValidAgainstSchema the way it is now

@gleb-lobov gleb-lobov changed the title Open up more methods for public usage Create new public method CirceValidator#compileJsonSchema Dec 8, 2025
Copy link
Contributor

@istreeter istreeter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all looks good now. I will merge this into develop for you, and I will tag a pre-release version which you can import into Enrich. We can turn it into a proper release after everything is tested in Enrich.

@istreeter istreeter merged commit 54a30ff into develop Dec 8, 2025
5 checks passed
@istreeter istreeter deleted the publish-circe-methods branch December 8, 2025 12:59
@istreeter
Copy link
Contributor

@gleb-lobov 4.1.0-M2 is now published to Sonatype, so you can import it into Enrich.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants