-
Notifications
You must be signed in to change notification settings - Fork 404
Description
Describe the bug
In a healthy 5 node cluster, after restart one node, this node is not detected by the rest (dkron raft list-peers) and this node can not detect any peer. (context deadline exceeded and runtime error: invalid memory address or nil pointer dereference) . After 5 minutes or so, the node is detected again.
The node is visibile in the web ui, without any rpc port
There is no active firewall between nodes
These are direct servers with direct visibility
All servers are provisioned automatically with the same configuration
The cluster was already stablished and working with 5 nodes
This has just happened after upgrade to 3.2.7 and still happens in 4.0.4. This never happened in 3.1.10, same config, same servers.
To Reproduce
I just simply stop dkron service in a cluster and when started, without upgrading or changing anything, the node was not detected. Rebooting the node does not fix the issue.
Expected behavior
after a simple restart of the service, the cluster recognized 5 nodes again, without having to wait several minutes to connect.
Screenshots
If applicable, add screenshots to help explain your problem.
Specifications:
- OS: linux, lxc containers into a proxmox servers,
- Version: 3.2.7 and 4.0.4
Additional context
Although the original setup for3.1.10 included a iptables firewall setup with 6969 ports tcp, udp opened, the setup is tested with and without firewall and the problem exists in any case. The cluster has about 15 nodes including servers and agents