Use regex patterns to build structured json from text files
Parse creates two artifacts:
- a normal jar for java development
- an uber jar to use with
java -jarwithout specifying a classpath.
-s, --sourcespecifies a single source folder or archive (zip, gzip, etc) to parse-d, --destinationused with--sourceto specify where to save the resulting json-b, --batcha,separated list of sources to parse (separately). This will use a default output file pattern based on the source name (e.g../test/or./test.zipwill be saved as./test.json)-t, --threadsset the maximum number of threads to use for parsing sources. Each source will only use 1 thread (because of parser state) but in--batchthis can limit concurrent heap demanddefault = # cores--r, --rulesa,separated list of rule files (yamlorjson) to use instead of the default parse rules
The library was created as a java utility but also supports yaml or json rule definitions with javascript taking the place of java code.
There are a collection of default rules in resources/defaultRules.yaml which are loaded unless --disableDefault is used.
There is also resources/filerule-schema.json to use with the vscode Red Hat yaml plugin to validate any rule files.
Each rule works in 3 phases
- find matching files in the
source - scan each file to create json
- merge the json into the result json for the
source
path: java regex with named capture groups e.g. /(?<name>[^/]+)/
headerLines: number of lines to scan for findHeader and avoidHeader,
findHeader: java regex (or list of regex). Will match if at least one regex matches
avoidHeader: java regex (or list of regex). Will NOT match if any match (overrides findHeader)nest: json.path.${{name}}The parse creates a single json object from all of the rules that match a source.
Rules specify the key path to where the rule's output should be located. The above example
will create {json : {path : {foo :...} } } if the path matched /foo/ from the source.
Using ${{name}} will merge the value of from the rule with the value from any previous match.
If a previous rule created {json : {path : {foo : {a : 100} } } } and the current rule has {b : 200}
then using ${{...}} will result in {json : {path: {foo : {a : 100, b : 200} } } }
This technically happens before merge json but it by far the more complicated step so safe it for last.
The nest: key can also depend on path: so it is normally shortly after path: in the rule defintion.
The first step in creating the json is to specify how the rule should handle the file content. We currently have the following options:
asTextprocess the text file line by line. This the most common for logs and custom formatsasXmlread the file as an xml document then convert it tojsonasJsonread entire document asjsonasPathUse the fully qualified file path as an argument to ajavascriptmethod(path)=>{...}asContentread the entire content of the file into a single string. use this sparingly, e.g. for readingjava -version > /tmp/java-version.logasJbossCliread the entire document as the output of a jboss-cli command. Jboss-cli can output as json so this will be deprecated.
Reads the entire file into a string and adds it to the rule json under the key
asContent: keyThis expects a javascript function that accepts the fully qualified path the the file as the only argument.
The example could also be created with exp but it illustates creating 'java' objects in 'javascript'
asPath: |
(path)=>{
//do something in javascript with access to StringUtil, FileUtility, Exp, ExpMerge, Xml, Json
const rtrn = new Json(); //const because we are in javascript
FileUtility.stream(path).forEach(line => { //javascript uses => not ->
if(line.trim() === ""){
} else if (line.match( /^\d+$/ )) { // lines from date +%s >> log
rtrn.add(new Json()); //start a new entry
rtrn.getJson(rtrn.size() - 1).set("timestamp", parseInt(line)); //javascript parseInt()
} else {
rtrn.getJson(rtrn.size() - 1).set("data", Json.fromString(line)); //logged json output from curl to 1 line
}
}
return rtrn; //this is the output of the rule for the merge step
} This reads the entire document into xml | json. If the value of the tag is an empty string
then the entire document is merged according to nest: as json.
For xml that means attributes will be @attributes and the node value will be text()
asXml: '' # empty string means merge the entire documentThe other option is to filter the document to create a new json output for the rule.
Filters have the following options
name: just a useful name for logging / debugging
path: either jsonpath or xpath to the value or sub-document
nest: where to place the resulting json
#choose between children, regex, or exp
children: an array of filter to apply to each match from path before using result
regex: java regex applied to the string version of the current output
exp: an array of exp to use on the string version of the current output
# finally
result: either a javascript function or json using ${{key.path}} to substitue values from curent outputpath can be jsonpath for both asXml and asJson but xpath will only work with asXml
jsonpath must start with $. and xpath must start with /
nest specifies the key path to where the result should be merged.
This nest does not support ${{name}} substitution
children subsequent filter that will run on each match from path
This is useful to extract a values from one level in the document then use children
to extract values further down the document structure.
exp see asText for how to create exp
Most rules scan each line of a file looking for patterns to create structured json.
asText can be used as either the name of an existing file parser factory or a list of parse expressions (exp)
The existing file parser factories:
csvFactoryeach row will become{header1: value, header2: ...}dstatFactoryidentify the groups and stats then each row is{group.stat: value...}jep271Factorygc logs from jdk11+jmapHistoFactoryparsesjmap -histo <pid>outputjstackFactoryparsersjstack <pid>outputprintGcFactorygc logs from jdk < 11serverLogFactoryparsers the defaultserver.logformatsubstrateGcFactoryparses the gc logs from substrateVm (quarkus)wrkFactoryparses wrk output files (e.g.wrk http://lcoalhost:8080/api)xanFactoryparses faban xan files (e.g.xan.details)
If you need to parse a different format line by line then you need to create a list of exp
Exp are the key to how parse creates structured json from text files.
Each exp specifies a java regex pattern to match against part of the input line
then what to do when the pattern matches. We start with the initial settings:
name: exp name used for debug / logging #optional but helpful
pattern: java regex pattern with optional capture groupsexp need to be told where to try and match the input line.
The top level exp will start with the first character of the line then by default their
childen start from where the parent match finished but they can also match before the parent or the entire line
range: EntireLine | BeforeParent | AfterParent # default is AfterParentexp have a set of rules that define common actions to take when an exp matches the line.
The first group of rules are those that are invoked before the exp tries to add anything to the result json.
These rules follow the PreX naming convention.
PreCloseclose (finish) the current json object and start a new json object.PrePopTargetremove the current target or remove targets until they match the name for the rule
rules:
- PrePopTarget #remove the current target (unless it is the root of the json)
- PrePopTarget: namedTarget # remove targets until target or until the root of the json- PreClearTarget: reset the target json to the root of the current json
- TargetRoot: temporarily target the root for this
exp(and children) without changing the targets
The next phase is to determine where to merge the values from the pattern matching.
The default location is the current target json but that can be changed with nest
nest: json.path.${{name}}.$[[otherName]]nest for an exp is similar to nest from the rule and filter with the addition of $[[...]] syntax
The usual ${{name}} syntax will merge keys if two objects have the same name
but $[[name]] will treat each object as a separete entry in an array.
For example: if {a: 100} and {b: 200} both used nest: ${{name}}
then the result would be
{"name": {"a": 100, "b": 200}}but if they both used nest: $[[name]] then the result would be
{"name": [{"a": 100}, {"b": 200}]}At this point any named capture groups from the pattern are merged into the current json target
along with any with name and value pairs. The name from the capture group is the key and the value is automatically converted based on the string value
- integer numbers
-?\\d{1,16}are converted toLong - decimal numbers are converted to
Double - memory size patterns
(\\d+\\.?\\d*[bBkKmMgGtT])are converted intoLongnumber of bytes - json like patterns
{...}or[...]are converted intojsonYou can override the default by setting a type in thepatternor by setting the field type underfields
name: exampleExp
pattern: /(?<name:type>[^/]+)/(?<otherName>[^/]+)/
with:
anotherName: 10 #add this to the json when merging exampleExp
fields:
otherName:
type: Auto | String | KMG | Integer | Decimal | Json
merge: Auto # default is Auto but read on for merge optionsexp will often try to add the same field name to a json object that already has a value.
In this case the merge setting controls how multiple matches are handled.
Autothe first match is set to{key: value}but multiple matches yield{key: [value, value2]}BooleanKeyset{[name]: true}where name is the name of the capture group frompatternBooleanValueset{[value] : true}where value is the matched value as a stringTargetIdstarts a new json object ifcurrentTarget[name] !== value. Useful with eventId in multi-line loggingCountthe number of times the capture group matchedAddconvert thevalueto a number and add it tocurrentTarget[name]Listcreate a list of values even if there is only one value. e.g.{[name] : [value]}Setlike list but only adds unique valuesKeyusevalueas thenamefor the referenced capture group.Firstonly save the firstvalueLastonly save the lastvalueTreeSiblinguse the string length ofvalueto create a tree where children are undernamesuccessive matches of the same length are treated as sibling branchesTreeMergingsame asTreeSiblingexcept successive matches are merged into one branch
Once the values are merged into the json the exp can enable and disable tags.
Tags are just string values used to turn exp on or off so they only try and match lines
when appropriate. For example: verboseGC parsers turn off the g1gc parser support if the log is Using Shenandoah
requires: ["g1gc"]
enables: ["g1gc-verbose"]
disables: ["parallelGc","shenandoah"]We didn't mention requires earlier but exp will first check that the all the requires tags
are enabled before attempting to match the line. This lets us disable exp that would otherwise
match lines from the wrong part of the source file.
The next step if for the exp to modify the line. exp remove the matched part of the line by default so both children
exp and any subsequent sibling exp do not have to construct regex to avoid that part of the line.
The line modification is configured through eat and has the following options
name: eatExample
eat: 10 #will eat 10 characters from the start of the match
eat: None | Match | ToMatch | LineNone- do not modify the lineMatch- thedefaultbehaviour of removing the section theexpmatchedToMatch- removes everything up to the end of the matchLine- remove the entire line. This prevents otherexpfrom parsing the line but will take effect after childrenexptry and parse the line
The next step is to allow all of the children exp to parse the remaining line but first
all pre-children rules are invoked. At the moment those are:
PushTargetset the current json object (fromnest) as the target json object for subsequentexp
rules:
- PushTarget # pushes the target without a name
- PushTarget: name # uses a name for later PopTarget callsNow the children exp parse the remaining line the order they are defined.
All of the children will run and can be re-run if the exp has the RepeatChildren rule.
RepeatChildren will repeat all of the children if any one of the children matched the line
Be sure to modify the line when using RepeatChildren or parsing will never end
At this point the exp can re-run along with everything up to and including the children exp
if the exp has the Repeat rule.
rules: [ Repeat ]Repeat includes checking requires matching the remaining line...
everything up to and including the children matching the line as well
Finally any post-children rules are applied. Those rules follow the PostX naming convention
PostCloseconsider the current json finished and start a new json for the nextexpPostPopTargetchange back to the previous target for the nextexp. This can accept anamejust likePrePopTargetPostClearTargetclear all targets back to the root json
The process is repeated for each exp until all exp do not match or the input line is empty
Building the native cli utility
sdk use java 22-graalce
export GRAALVM_HOME=${JAVA_HOME}
mvn clean
quarkus build --native --no-tests -Dquarkus.native.container-build=falseReleasing is current separate from building the CLI utility. Ideally we can do both in one command in the future
mvn release:prepare -Prelease
mvn release:perform -Prelease