Skip to content

Server not promptly detected after restart on a healthy 5-node cluster. #1702

@EmilioMoreno

Description

@EmilioMoreno

Describe the bug
In a healthy 5 node cluster, after restart one node, this node is not detected by the rest (dkron raft list-peers) and this node can not detect any peer. (context deadline exceeded and runtime error: invalid memory address or nil pointer dereference) . After 5 minutes or so, the node is detected again.

The node is visibile in the web ui, without any rpc port
There is no active firewall between nodes

These are direct servers with direct visibility
All servers are provisioned automatically with the same configuration
The cluster was already stablished and working with 5 nodes
This has just happened after upgrade to 3.2.7 and still happens in 4.0.4. This never happened in 3.1.10, same config, same servers.

To Reproduce
I just simply stop dkron service in a cluster and when started, without upgrading or changing anything, the node was not detected. Rebooting the node does not fix the issue.

Expected behavior
after a simple restart of the service, the cluster recognized 5 nodes again, without having to wait several minutes to connect.

Screenshots
If applicable, add screenshots to help explain your problem.

Specifications:

  • OS: linux, lxc containers into a proxmox servers,
  • Version: 3.2.7 and 4.0.4

Additional context
Although the original setup for3.1.10 included a iptables firewall setup with 6969 ports tcp, udp opened, the setup is tested with and without firewall and the problem exists in any case. The cluster has about 15 nodes including servers and agents

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions