Skip to content

Commit df0de20

Browse files
committed
ADR about security-responder
Signed-off-by: manuelbuil <[email protected]>
1 parent badcdaa commit df0de20

File tree

1 file changed

+97
-0
lines changed

1 file changed

+97
-0
lines changed
Lines changed: 97 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,97 @@
1+
# security-responder client
2+
3+
Date: 2025-10-27
4+
5+
## Status
6+
7+
Proposed
8+
9+
## Context
10+
11+
### Background
12+
13+
RKE2 currently lacks two critical capabilities essential for its maintainability and user security. Firstly, there is no structured mechanism for users to voluntarily share cluster metadata (such as Kubernetes version or CNI plugins). This data is vital for maintainers to understand real-world adoption, correctly prioritize future development or testing efforts.
14+
15+
Secondly, users lack a rapid, in-cluster method to learn about security threats. This includes an immediate list of CVEs impacting their current components and a recommended, secure version they can upgrade to.
16+
17+
18+
### Current State
19+
20+
- No telemetry collection exists in rke2. The team lack insights into deployment patterns, version adoption or selected configurations
21+
- Difficult for users to learn about existing CVEs impacting their current RKE2 version
22+
23+
### Requirements
24+
25+
- Collect only non-personally identifiable cluster metadata
26+
- Opt-out mechanism with clear documentation
27+
- Minimal resource overhead
28+
- Fails gracefully in disconnected environments
29+
- There is no need for retry mechanisms or a persistent daemon; the data is non-critical and loss of a few data points harmless. Resource savings on the nodes are more important.
30+
- Work well in rke2
31+
- Provides useful information to users
32+
33+
## Decision
34+
35+
Implement a `security-responder` client at `github.com/rancher/rke2-security-responder` (similar to existing components) as a separate, optional component deployed via the rke2 manifest system that is triggered periodically.
36+
37+
### Architecture
38+
39+
- **Deployment Method**: `CronJob` in `kube-system` namespace
40+
- **Location**: `/var/lib/rancher/rke2/server/manifests/security-responder.yaml`
41+
- **Scheduling**: CronJob running thrice daily (`0 */8 * * *`)
42+
- **Configuration**: ConfigMap-based with environment variable override
43+
- **Default State**: Enabled by default (opt-out well documented)
44+
45+
### Data Collection
46+
47+
The collected data will include the following information:
48+
- Kubernetes version
49+
- clusteruuid
50+
- nodeCount
51+
- serverNodeCount
52+
- agentNodeCount
53+
- cni-plugin
54+
- os
55+
- selinux
56+
57+
Example payload structure:
58+
```json
59+
{
60+
"appVersion": "v1.31.6+rke2r1",
61+
"extraTagInfo": {
62+
"kubernetesVersion": "v1.31.6",
63+
"clusteruuid": "53741f60-f208-48fc-ae81-8a969510a598"
64+
},
65+
"extraFieldInfo": {
66+
"nodeCount": 5,
67+
"serverNodeCount": 3,
68+
"agentNodeCount": 2,
69+
"cni-plugin": "flannel",
70+
"os": "ubuntu",
71+
"selinux": "enabled"
72+
}
73+
}
74+
```
75+
76+
The `clusteruuid` is needed to differentiate between different deployments (the UUID of `kube-system`). It is completely random and does not expose privacy considerations.
77+
78+
### Configuration Interface Example
79+
80+
```yaml
81+
# /etc/rancher/rke2/config.yaml
82+
security-responder-enabled: true # default
83+
```
84+
85+
## Alternatives Considered
86+
87+
### Agent-based Implementation
88+
89+
Would require agents on all nodes. Periodic CronJob is more efficient for cluster-level metadata collection.
90+
91+
### Instrumenting/leveraging update.rke2.io
92+
93+
No easy access to CDN logs, no insights into deployed versions, not as privacy-preserving.
94+
95+
## Consequences
96+
97+
Basic telemetry coverage and analytics to improve project decisions and project visibility.

0 commit comments

Comments
 (0)