-
Notifications
You must be signed in to change notification settings - Fork 598
Create a new ElasticsearchDatastreamWriter to more efficiently store data in elasticsearch. #10577
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Create a new ElasticsearchDatastreamWriter to more efficiently store data in elasticsearch. #10577
Conversation
mcodato
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about putting this feature behind a build-flag that is OFF by default.
IMHO, having it compiled by default, when the PR is merged, could make adoption easier and faster. The option to enable or disable the feature already exists.
WDYT @lippserd ?
jschmidt-icinga
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just gave this a quick look, so this is not a complete review.
Firstly, I agree with @mcodato that the build flag is unnecessary. It can just be on by default same as all the other perfdata writers.
Secondly, I'd suggest you squash down your commits because they're all operating on the same new component added by this PR. Also there's some whitespace back-and-forth cluttering up the diff.
See below for a some additional comments on the code, mostly things the linter complained about:
|
#10577 (review) Hi, thanks for having a look at it, I really appreciate it.
I left the commits separate deliberately, so it's easier to review. Each commit can be looked at on their own, without having to reason about all changes at once. |
4a9699b to
9a83f44
Compare
|
Thank you for this PR. I've briefly spoken to @lippserd about this and I'm going to take another look when I can. I know next to nothing about Elasticsearch yet, so it might take some until I can test this thoroughly. |
|
Hi, Just a hint, I recently addressed other issue with the ElasticsearchWriter format here #10511 I provided a small but breaking change here: #10518 If the ElasticsearchDatastreamWriter is introduced, both Writers should use the same format, right? @jschmidt-icinga let me know if you need help with Elasticsearch knowhow, you know where my office is. |
|
FYI, I think this would also (somewhat) addresses this issue #9837 With datastreams the backing indices are managed by Elasticsearch. |
| if (!pdv->GetCrit().IsEmpty() && GetEnableSendThresholds()) | ||
| metric->Set("crit", pdv->GetCrit()); | ||
|
|
||
| pd_fields->Set(pdv->GetLabel(), metric); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi,
Am I reading this correct?
This would result in a schema where the perfdata label is in the field name?
Like so:
"_source": {
"timestamp": "1761225482",
"perfdata.rta": {
"value": 0.091,
"warn": 100,
"min": 0,
"crit": 200,
"unit": "ms"
},
"perfdata.pl": {
"value": 0,
"warn": 5,
"min": 0,
"crit": 15,
"unit": "%"
}
}I think this might cause issues in the longterm, as described here: #6805 and #10511
Since each new field will cause a new mapping in the index. Correct me if I'm wrong.
The issue with adding new fields for every label is this: https://www.elastic.co/docs/troubleshoot/elasticsearch/mapping-explosion
In the change to the Elasticwriter I proposed, I used a field called "label" in a list of objects.
Like so:
"_source": {
"timestamp": "1761225482",
"perfdata": [
{
"value": 0.091,
"label": "rta"
"warn": 100,
"min": 0,
"crit": 200,
"unit": "ms"
},
{
"value": 0,
"label": "pl"
"warn": 5,
"min": 0,
"crit": 15,
"unit": "%"
}
]
}This would just create a single field. And there is an expected field name where to find the label.
Otherwise you need to fetch the entire document, scan for all of the fields that start with "perfdata." to find the perfdata fields... and would still need to split the key name at the . to get to the label, right?
This field can then the either object or nested:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to make things a bit clearer. When using an array using nested here might be better, since it maintains the independence of each object.
For example, when searching all perfdata for an object:
curl -X POST "http://localhost:9200/icinga2/_search" -H 'Content-Type: application/json' -d'
{
"query": {
"bool": {
"must": [
{ "match": { "host": "myhost" }},
{ "match": { "service": "ping6" }}
]
}
},
"fields": [
"check_result.perfdata.*"
],
"_source": false
}
'An object mapping returns this:
{
"_id": "AZoRQYZWWEqPxFmVW73R",
"fields": {
"check_result.perfdata.unit": [ "ms", "%" ],
"check_result.perfdata.label": [ "rta", "pl" ],
"check_result.perfdata.warn": [ 100, 5 ],
"check_result.perfdata.min": [ 0, 0 ],
"check_result.perfdata.value": [ 0, 0 ],
"check_result.perfdata.crit": [ 200, 15 ]
}
}And a nested mapping returns:
{
"_id": "AZoRUkSmucmeEsXVMUwm",
"fields": {
"check_result.perfdata": [
{
"warn": [ 100 ],
"unit": [ "ms" ],
"min": [ 0 ],
"crit": [200 ],
"label": [ "rta" ],
"value": [ 0 ]
},
{
"warn": [ 5 ],
"unit": [ "%" ],
"min": [ 0 ],
"crit": [ 15 ],
"label": ["pl" ],
"value": [ 0 ]
}
]
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I know the puzzle piece you are missing:
We are indexing the documents based on the check, similar to what elasticsearch already does (e.g. beats might store data in metrics-system.diskio-master).
If you read the config file, you might realize that we are doing something similar with metrics-icinga2.<check>-<namespace>. Since the same check should always produce the same output, we won't have an explosion of fields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, neither object nor nested work with timeseries datastreams. (If you know any better, feel free to correct me on that.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ok I see. Thanks for the clarification. So each check would get its own datastream and thus indices. 👍 I hadn't had the time to compile and test the new writer yet, just skimmed over the code.
Are there any things we need to consider when having a datastream per check? There also is a limit on the number of shards per node. https://www.elastic.co/docs/reference/elasticsearch/configuration-reference/miscellaneous-cluster-settings
Might not be an issue, just wanted to mention it. And users can always add more data nodes to scale up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finally had some time to test it. Looks good.
curl localhost:9200/_cat/indices
yellow open .ds-metrics-icinga2.swap-default-2025.10.30-000001 CBEhsKj5QweXa12A4PFafQ 1 1 17 0 224.4kb 224.4kb 224.4kb
yellow open .ds-metrics-icinga2.ping4-default-2025.10.30-000001 fe8buP9QROOWv5xt6qxI5A 1 1 16 0 221.3kb 221.3kb 221.3kb
yellow open .ds-metrics-icinga2.ping6-default-2025.10.30-000001 kqhy-33dRl2CtNw2tWe7MQ 1 1 15 0 214.6kb 214.6kb 214.6kb
yellow open .ds-metrics-icinga2.ssh-default-2025.10.30-000001 LKgw--JYSF66eAEZNhZFIQ 1 1 15 0 213.8kb 213.8kb 213.8kb
I still wonder how the amount of incides/shards might affect larger setups.
But the search does what it should do:
curl -X POST "localhost:9200/metrics-icinga2.load-default/_search" -H 'Content-Type: application/json' -d'
{
"query": {
"bool": {
"must": [
{ "match": { "service.name": "3ecab54cb896!load" }}
]
}
},
"fields": [
"perfdata.*"
],
"_source": false
}
It's still a bit tricky to fetch the labels for each metric (load1, load5, etc), since it's in a key, but it's an improvement over the current ElasticsearchWriter. Example result:
{
"_index": ".ds-metrics-icinga2.load-default-2025.10.30-000001",
"_id": "Cj19dEiEhn-MNA9xAAABmjWiFKQ",
"_score": 0.03390155,
"fields": {
"perfdata.load15.min": [
0.0
],
"perfdata.load1.min": [
0.0
],
"perfdata.load5.value": [
2.22
],
"perfdata.load5.min": [
0.0
],
"perfdata.load15.value": [
2.63
],
"perfdata.load1.value": [
2.16
]
}
}
when I don't query for the fields, it's a bit simpler:
...
"perfdata": {
"load1": {
"min": 0,
"value": 2.3
},
"load15": {
"min": 0,
"value": 2.2
},
"load5": {
"min": 0,
"value": 2
}
}
...
Overall very nice solution.
| "mode": "time_series", | ||
| "routing_path": "check.checkable", | ||
| "mapping": { | ||
| "total_fields.limit": 2000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we want to do this. The default limit of 1000 is high enough and instead we should avoid writing so many fields.
By that I mean, instead of using the perfdata label in a field name, we should have it in a field "label"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not get much over this, but in my tests I found the icingadb check in particular to write a lot of performance data into its datastream metrics-icinga2.icingadb-default. I don't think any other check will go far beyond that.
| } | ||
| } | ||
| }, | ||
| "mappings": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering if we could use a mapping closer to the current ElasticsearchWriter?
It should contain everything we need for a timeseries. Instead of a new "checkable" fields, we could add "time_series_dimension": true to both the host and service fields, probably also the checkcommand.
For example:
{
"template": {
"mappings": {
"properties": {
"@timestamp": {
"type": "date"
},
"check_command": {
"type": "keyword",
"time_series_dimension": true
},
"check_result": {
"properties": {
"check_source": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"command": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"execution_end": {
"type": "date"
},
"execution_start": {
"type": "date"
},
"execution_time": {
"type": "float"
},
"exit_status": {
"type": "long"
},
"latency": {
"type": "float"
},
"output": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"perfdata": {
"type": "nested",
"properties": {
"crit": {
"type": "long"
},
"label": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"max": {
"type": "long"
},
"min": {
"type": "long"
},
"unit": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"value": {
"type": "long"
},
"warn": {
"type": "long"
}
}
},
"schedule_end": {
"type": "date"
},
"schedule_start": {
"type": "date"
},
"state": {
"type": "long"
},
"vars_after": {
"properties": {
"attempt": {
"type": "long"
},
"reachable": {
"type": "boolean"
},
"state": {
"type": "long"
},
"state_type": {
"type": "long"
}
}
},
"vars_before": {
"properties": {
"attempt": {
"type": "long"
},
"reachable": {
"type": "boolean"
},
"state": {
"type": "long"
},
"state_type": {
"type": "long"
}
}
}
}
},
"current_check_attempt": {
"type": "long"
},
"host": {
"type": "keyword",
"time_series_dimension": true
},
"last_hard_state": {
"type": "long"
},
"last_state": {
"type": "long"
},
"max_check_attempts": {
"type": "long"
},
"reachable": {
"type": "boolean"
},
"service": {
"type": "keyword",
"time_series_dimension": true
},
"state": {
"type": "long"
},
"state_type": {
"type": "long"
},
"timestamp": {
"type": "date"
},
"type": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be a _component_template that is used for both Writers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right on the checkable. We can make the host and service name timeseries dimensions. But since we are already writing checks into their own datastreams, I don't think the check_command needs to be a dimension as well.
jschmidt-icinga
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've also commented a few more nitpicks, but for the moment I want to primarily focus on two areas:
- Finalize a format/index-template that can also be used by the basic ElasticsearchWriter in the future. (Please see @martialblog's detailed comments in addition to mine)
- Minimizing the impact this has on
ProcessCheckResultexecution times by moving most stuff into the work queue.
I'm open to suggestions, especially regarding the data format.
9e06023 to
5c270a5
Compare
|
Thanks @martialblog for the insight. I'd encourage you to read the commit messages, as I tried to lay out my reasoning for some decisions there. Also, take a look at the example config file, as I think it clears up some of your concerns. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took a look at the filter option today and I think that's a very good idea to add something like this to a perfdata writer. We should definitely update the other writers with something similar when we get the chance.
doc/14-features.md
Outdated
| > **Note** | ||
| > | ||
| > This feature requires Elasticsearch to support timeseries datastreams and have the ecs component template installed. | ||
| > This feature was tested successfully with Elasticsearch 8.12 and 9.0.8. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to explicitly mention that the ElasticsearchDatastreamWriter will not work with OpenSearch? Since the current ElasticsearchWriter works with OpenSearch, users might expect the same here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did some testing with OpenSearch, luckily there is little that needs to change for this implementation to work with OpenSerch.
- The
index-template.jsonneeds to be changed, the users can do that themselves - OpenSearch responds with
charset=UTF-8notcharset=utf-8
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@w1ll-i-code Can we loosen the charset requirement? Modern JSON should use UTF-8 anyways. https://www.rfc-editor.org/rfc/rfc8259
JSON text exchanged between systems that are not part of a closed ecosystem MUST be encoded using UTF-8
A "tolower()" might also work, but maybe @jschmidt-icinga has some ideas on how the codebase should handle it.
I can take care of the documentation regarding OpenSearch once this PR is merged.
|
@w1ll-i-code Maybe we need to handle fact that CheckCommand object can have names that are not allowed and index names. For example: This will result in:
Not sure yet what a good solution might be. Maybe some form of normalization, maybe just letting the users know they need to be careful with the object names, maybe both. |
6621df7 to
e348fd2
Compare
a628338 to
b61955f
Compare
b61955f to
4bb3ab0
Compare
Copy the elasticsearchwriter to be adapted for the new datastreamwriter. Setup the build system and hide the feature behind a build flag for now.
Restructure the data sent to elasticsearch to align with the Elastic Common Schema specification and separate document indices by check to reduce the number of distinct fields per index.
Handle the errors returned by elasticsearch gracefully to let the user know what went wrong during execution. Do not discard data as soon as a request to elasticsearch fails, but keep retrying until the data can be sent, relying on the WorkQueue for back pressure. Improve the handling of the flush timer, by rescheduling the timer after each flush, making sure that there are no needless flushes under heavy load.
Re-add support for tags, but make them conform to the ecs specification. Add also support for labels, which are the former tags in the elasticsearchwriter. ref: https://www.elastic.co/docs/reference/ecs/ecs-base
Allow the user to filter for which data should be sent to elasticsearch. This allows the user to use multiple datastreams to send the data to different namespaces, for multi-tenancy of different retention policies.
Accept also an ApiKey for authentication in elasticsearch in addition to username + password and certificates.
Icinga2 should manage its own index template if possible. This allows us to ship potential additions to the document later without requiring further user input. However the user has the option to disable the feature, should they want to manage the template manually. For user customization, we ship the `icinga2@custom` component template, so that users can change the behaviour of the template without having to edit the managed one, making updates easier.
Add a template config with all possible config options for the user as well as a short description on what the parameter does and how it can be used. This allows a user to quickly configure the writer without having to look up lots of documentation online.
Update the documentation to give a clearer overview of the new elasticsearchdatastreamwriter feature and add the object and its fields to the syntax highlighting of nano and vim.
Drop messages in the data buffer if the connection to elasticsearch fails. This guarantees that the icinga2 process can still shut down, even with a missconfigured writer or if elasticsearch is down or not reachable without stalling.
Allow the 'datasteam_namespace' variable to contain macros and expand them properly. This allows data to be written into different datastreams based on object properties, e.g. by zone or custom var.
General improvements to code and documentation. Fixing comments by: - Mattia Codato - Johannes Schmidt
|
@w1ll-i-code I've resolved all the conversations solved by your last push but there are still some outstanding comments not yet addressed. Can you please fix those and I think we're good from my side. |
4bb3ab0 to
b732f47
Compare
| String ctLower = String(contentType.data()); | ||
| boost::to_lower(ctLower); | ||
| if (!(ctLower == "application/json" | ||
| || boost::starts_with(ctLower, "application/json;"))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the semicolon needs to be removed here
| || boost::starts_with(ctLower, "application/json;"))) { | |
| || boost::starts_with(ctLower, "application/json"))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't even see that and now I wonder when it was changed. The "to_lower()" was supposed to simplify the check on "UTF-8" vs "utf-8", this just checks against the prefix, which makes both the to_lower and the first half of the if condition pointless. I'd suggest something like this instead:
if (!(ctLower == "application/json" || ctLower == "application/json; charset=utf-8")) {
Because that's the only two values that make sense. Technically it doesn't matter much, since if you send anything other than UTF-8 it's invalid anyway for JSON, but we might as well be correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just tested 3090881, still seeing this error:
[2025-11-18 12:06:45 +0000] critical/ElasticsearchDatastreamWriter: Unexpected Content-Type: 'application/json
Strange, since:
curl -i localhost:9200
HTTP/1.1 200 OK
X-elastic-product: Elasticsearch
content-type: application/json
content-length: 540
boost::starts_with(ctLower, "application/json") works though
| - Namespace: The datastream_namespace string may contain macros. If a macro is missing or resolves to an empty value, the writer falls back to the default namespace "default". | ||
| - Validation: A template string with an unterminated '$' (e.g. "$host.name") raises a configuration validation error referencing the original string. | ||
| - Macros never partially substitute: either all macros in the string resolve and the rendered value is used, or (for tags/labels) the entry is skipped. | ||
| - Normalization: Performance data metric labels and the resolved datastream namespace undergo normalization: any leading whitespace and leading special characters are trimmed; all remaining special (non-alphanumeric) characters are replaced with an underscore; consecutive underscores are collapsed; leading/trailing underscores are removed. This ensures stable, Elasticsearch-friendly field and namespace names. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Normalization doesn't seem to work or maybe I'm doing something wrong? I just tested with commit b732f4723.
check_command = "example_FOOBAR_bar-foo"
I still see:
[2025-11-17 14:16:10 +0000] warning/ElasticsearchDatastreamWriter:
Error during document creation: illegal_argument_exception: data_stream
[metrics-icinga2.example_FOOBAR_bar-foo-default] must be lowercase
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just tested 309088156, looks good now:
yellow open .ds-metrics-icinga2.i_like_icinga-default-2025.11.18-000001 ZmMtvAD0RJaX2LBGHG3ktA 1 1 2 0 36.2kb 36.2kb 36.2kb
yellow open .ds-metrics-icinga2.example_foobar_bar_foo-default-2025.11.18-000001 7F9DZE7VRSa0yT8kJZ0Jtg 1 1 2 0 227b 227b 227b
b732f47 to
3090881
Compare
| auto& contentType (response[http::field::content_type]); | ||
| /* Accept application/json with optional charset (any variant), case-insensitive. */ | ||
| String ctLower = String(contentType.data()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can't just initialize a string with a string_view's data method, because it is not null-terminated, which means the constructor will read random memory until the next null-byte into the string, unless it segfaults first, which is why the comparison subsequently fails. You either need to use the pointer + size constructor of String/std::string or construct it directly from the string_view. I know why you probably did it, because icinga::String doesn't have that constructor, but it's fine to just use std::string in this case:
| auto& contentType (response[http::field::content_type]); | |
| /* Accept application/json with optional charset (any variant), case-insensitive. */ | |
| String ctLower = String(contentType.data()); | |
| /* Accept application/json with optional charset (any variant), case-insensitive. */ | |
| std::string ctLower = response[http::field::content_type]; |
Reviewers: - jschmidt-icinga - martialblog
3090881 to
d39b63b
Compare
|
Hi,I had to change the |
|
@w1ll-i-code Please also address my last review comment above. Edit: This might also be a good point where you squash down your PR to one or two commits. I'd suggest one commit for the entire DatastreamWriter and one for the changes to remote/endpoint. Documentation changes could also be separate commit if you prefer. |
PR for #10576. For right now, this will be hidden behind a build-flag, until the feature is sufficiently tested and ready for production use. For all the changes made to the original elasticsearch writer, please consult the individual commits.