Skip to content

Conversation

@gsaudade99
Copy link
Contributor

First draft for ssh user key management.

Feel free too:

  • Change the naming convention of the role name
  • comment / question / change / add / completely purge an nuke this changes out of existence

I will leave this here for now until I have a better visibility on the servers.

Role-based SSH key management:

  • Added group_vars/ssh.yml to define ssh_users with associated keys and roles, supporting both direct key strings and URLs for fetching keys.

Root user SSH key handling:

  • Added roles/ssh-manager/tasks/root.yml to manage the root account: ensures .ssh directory exists, collects keys from both direct input and URLs, deduplicates, and writes the authorized_keys file for root.

Non-root (role-based) user SSH key handling:

  • Refactored roles/ssh-manager/tasks/generic.yml to dynamically discover non-root role accounts, ensure their .ssh directories exist, collect keys from both direct input and URLs, deduplicate, and update their authorized_keys files.

Task orchestration:

  • Updated roles/ssh-manager/tasks/main.yml to orchestrate the inclusion of root and generic user SSH key tasks.
  • Ensure that variable list is a unique list of keys
  • Copy those keys to authorized_keys file.

Questions:
Is there any servers without root account? The root tasks would need to be changed (probably)
There is this variable galaxy_user_public_key witch I haven't taken into account yet.

usegalaxy-eu/issues#785

@gsaudade99 gsaudade99 requested a review from kysrpex October 15, 2025 09:10
@gsaudade99 gsaudade99 self-assigned this Oct 15, 2025
@gsaudade99 gsaudade99 changed the title First Draft Ssh key management first draft (for follow ups) Oct 15, 2025
@gsaudade99 gsaudade99 changed the title Ssh key management first draft (for follow ups) Ssh key management first draft Oct 15, 2025
Copy link
Contributor

@kysrpex kysrpex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for submitting the draft. Let's iteratively evolve the PR through several rounds of a discussion plus one or several commits.

Regarding how users and roles are defined in group_vars/ssh.yml, I what I meant with the role-based access control is closer to this approach.

roles:
  - name: admin
    hosts: all
    users:
      # operating system users
      - centos
      - stats
      - root  # can be omitted for now, centos can `sudo su`
      - "{{ galaxy_user.name }}"

  - name: stats
    hosts: sn09
    users:
      - stats

  ...

users:
  - name: gsaudade
    keys:
      - [email protected] AAAAGnNrLXNzaC1lZDI1NTE5QG9wZW5zc2guY29tAAAAIAXgFT+ynWEgCCAKepVgkSHauy+mLo+KlIGIIj0dkAXeAAAABHNzaDo= [email protected]
    roles:
      - admin
      - stats

  ...

One of the advantages of RBAC is to avoid this kind of repetition: roles: [root,stats,centos, {{ galaxy_user.name }}], and to avoid having to, for example when creating one more user that admins need to have access as, add one more user for all repetitions.

Besides that, I have the hope that the tasks for root.yml and generic.yml can be joined into a single file. After all, they're kind of the same tasks and generic.yml is already looping over a list of several users, probably "root" can be included in that list.

Notice also that in the approach I have proposed above, roles have a key hosts (e.g. hosts: all) that determines which hosts are managed by the role, because it may be interesting to grant someone permission to manage one specific host or service without granting access to all the infrastructure. This is what I am talking about when I speak about "hooking into the same logic that Ansible uses to resolve the hosts a play applies to".

Another suggestion I have is using the Ansible module dedicated to managing SSH keys ansible.posix.authorized_key (with exclusive: true). In fact you already thought of this yourself. This way you can avoid parsing keys from GitHub, and possibly even avoid ensuring the .ssh directory exists.

EDIT: Copilot suggests using exclusive: false as default for the first run; then switching to true after confirming the generated keys list is complete (nice catch, this is one more safety measure to avoid getting locked out, although I would think of using --check mode too).

@gsaudade99
Copy link
Contributor Author

Refactored the way the keys are managed according to the role-based group vars.

1 - Identifies which ssh_roles apply to the current target host
2 - Only processes roles that include the target host in their hosts list

3 - For each applicable role, collects SSH keys from all ssh_users who belong to that role
4 - Fetches keys from URL sources (GitHub, etc.) and processes them properly
5- Handles both newline-separated and comma-separated key formats

6 - Combines keys from multiple roles for each target user
7 - Automatically deduplicates identical keys
8 - Maintains proper key formatting and validation

9 - Discovers actual home directories for each target user
10 - Creates .ssh directories with proper permissions (700)
11 - Deploys keys using ansible.posix.authorized_key with exclusive management
12 - Sets correct ownership and permissions (600) for authorized_keys files

Copy link
Contributor

@kysrpex kysrpex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last refactor was imo a large step in the right direction. Let's go for a second review round to refine it further.

Please call out any bullshit I may be spitting out. I know it's very annoying to go through several rounds of reviews, but it is specially important to get this PR right (due to the potential pitfalls such as getting locked out).

@kysrpex kysrpex changed the title Ssh key management first draft Centralized SSH key management Oct 21, 2025
@kysrpex
Copy link
Contributor

kysrpex commented Oct 23, 2025

I do not have time left today to complete a review, but I still want to comment that generally this is looking good (although I want to "check the fine print still" and make some suggestions). There's a very high probability that we can get this merged next week :)

roles:
- admin

- user: dominguj
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- user: dominguj
- user: kysrpex

If this is referencing to GH user handles this would be @kysrpex

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its more for identification purposes. Its not used in playbook (:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, this name is not even used in the playbook.

Imo this field should contain something that can be easily traced to a person and a full name would do wonders (after all, our full names are all already here https://usegalaxy-eu.github.io/people).

@mira-miracoli
Copy link
Contributor

I would like to discuss this and I think this could need an ADR

@kysrpex
Copy link
Contributor

kysrpex commented Oct 28, 2025

I added some commits yesterday doing two things:

  • Try to integrate this solution more deeply with the other contents of the repository.
  • Modifications that address some of the review comments that got no answer last time (you can of course review/reject such modifications).

There are still two points open from @mira-miracoli:

Can we rename the users to e.g. unix_users and the human users to persons?
The current state is a bit confusing to me.

Sounds good to me, makes it obvious what you're dealing with.

I think we need to be careful with using GitHub keys api because we ran into the ratelimit in the past and then ansible will fail at this step.

I guess we can't do anything with the ratelimit itself, but imo we should make sure lookup('url', item) below fails if hitting the rate limit so that the error does not silently propagate to the task that applies the keys.

- name: "Fetch keys from URLs for role: {{ current_role.name }}"
  ansible.builtin.set_fact:
    fetched_keys: >-
      {{ fetched_keys | default([]) + (
           lookup('url', item)
           | default('')
           | replace(',', '\n')
           | split('\n')
           | map('trim')
           | reject('equalto', '')
           | select('match', '^(ssh-rsa|ssh-dss|ssh-ed25519|ecdsa-sha2-|[email protected]|sk-ecdsa-sha2-)')
           | list
         ) }}
  loop: "{{ role_url_keys | default([]) | unique }}"
  when: role_url_keys | default([]) | length > 0

I want to add two extra points that I did not notice before but I think need to be addressed:

  • If a UNIX user was mentioned as part of a role definition and is removed, its SSH keys won't be removed from the server (unless mentioned also as part of a different role definition that also applies to that same user on the same host). This is a security issue.
  • I think it would definitely be very positive to include some "linting" GitHub workflow based on the expected diff so that if some user's authorized keys file would have zero lines after merging, the linting would fail.

I would like to discuss this and I think this could need an ADR

Definitely needs another round of discussion, it's quite high-stakes.

@kysrpex
Copy link
Contributor

kysrpex commented Oct 28, 2025

I forgot to add that I ran 78f89ec in --diff --check mode (no changes applied) against stats.galaxyproject.eu (hosts: all -> hosts: grafana in all.yml) and it works as expected.

With this I mean that the last commit is in an "executable" state.

Adjust the authorized persons configuration based on an exhaustive examination of the contents of the authorized keys files for each server. It does not match the current status of each server (that would be a mistake), but provides a reasonably similar (and updated access policy) that can be tweaked within the coming months.

In addition, use `key` instead of `keys` (the latter is a reserved Ansible keyword) and encrypt the public keys of all users.
@kysrpex kysrpex force-pushed the feat/ssh-key-managment branch from fad60e7 to a07df90 Compare November 10, 2025 16:43
@kysrpex
Copy link
Contributor

kysrpex commented Nov 11, 2025

I added a Jenkins project for the playbook all.yml and executed a test run (using --diff --check).

Besides a couple of hosts being unreachable, it looks like it's working as intended. Keep in mind that 03e7462 needs to be reverted before merging (this commit is what is enabling the test run right now).

Ideally, we'd also fix the connection issues and cleanup the nodes that no longer exist. or are not in use.

@kysrpex
Copy link
Contributor

kysrpex commented Nov 11, 2025

We still need to keep in mind that, if tailscale is running on machines, this button and changing the ACLs can grant ssh access to people in the tailnet: image

Are all hosts connected to Tailscale? If that's the case, to further reduce the risk of being locked out due to a miscalculation, I'd enable this until 2 real playbook runs are complete and successful (that implies it would be possible to log-in into all servers from the Jenkins worker, so we'd not be locked out).

…r temporarily (for initial test)"

This reverts commit 03e7462.
@gsaudade99
Copy link
Contributor Author

gsaudade99 commented Nov 11, 2025

Don't we have IPMI access to at least sn09? This already enables probably access to most of the VMs...
We could save the shared pub keys locally just in case something goes wrong too.

edit: as I wrote we don't need to save them lol they won't be overwritten ...

@mira-miracoli
Copy link
Contributor

Are all hosts connected to Tailscale? If that's the case, to further reduce the risk of being locked out due to a miscalculation, I'd enable this until 2 real playbook runs are complete and successful (that implies it would be possible to log-in into all servers from the Jenkins worker, so we'd not be locked out).

No not now, but I think it would be more secure in general to have only tailscale access to ssh.. but this is for the future.
I can add it :)

Jenkins worker nodes cannot be managed via Jenkins because their host keys are not trusted (they do not seem to be signed by the usegalaxy.eu cert authority).

Although the solution is to sign them, as a workaround add them to the `known_hosts` Makefile target so that PR usegalaxy-eu#1673 (Centralized SSH key management, usegalaxy-eu#1673) can be merged.
Node sn10 cannot be managed via Jenkins because its host key is not trusted (it is not signed by the usegalaxy.eu cert authority).

Although the solution is to sign it, as a workaround add it to the `known_hosts` Makefile target so that PR usegalaxy-eu#1673 (Centralized SSH key management, usegalaxy-eu#1673) can be merged.
Node sn12 cannot be managed via Jenkins because its host key is not trusted (it is not signed by the usegalaxy.eu cert authority).

Although the solution is to sign it, as a workaround add it to the `known_hosts` Makefile target so that PR usegalaxy-eu#1673 (Centralized SSH key management, usegalaxy-eu#1673) can be merged.
@kysrpex kysrpex force-pushed the feat/ssh-key-managment branch from af17a4b to 4a1de84 Compare November 14, 2025 15:28
Node osiris cannot be managed via Jenkins because its host key is not trusted (it is not signed by the usegalaxy.eu cert authority).

Although the solution is to sign it, as a workaround add it to the `known_hosts` Makefile target so that PR usegalaxy-eu#1673 (Centralized SSH key management, usegalaxy-eu#1673) can be merged.
Node dnbd3-primary cannot be managed via Jenkins because its host key is not trusted (it is not signed by the usegalaxy.eu cert authority).

Although the solution is to sign it, as a workaround add it to the `known_hosts` Makefile target so that PR usegalaxy-eu#1673 (Centralized SSH key management, usegalaxy-eu#1673) can be merged.
@kysrpex
Copy link
Contributor

kysrpex commented Nov 14, 2025

FYI this is the current status of the unreachability of the hosts from Jenkins (all other hosts are reachable 🎉.

# connection refused
fatal: [apps.galaxyproject.eu]: UNREACHABLE! => {"changed": false, "msg": "Data could not be sent to remote host \"apps.galaxyproject.eu\". Make sure this host can be reached over ssh: ssh: connect to host apps.galaxyproject.eu port 22: Connection refused\r\n", "unreachable": true}

# connectiont timed out
fatal: [beacon.galaxyproject.eu]: UNREACHABLE! => {"changed": false, "msg": "Data could not be sent to remote host \"beacon.galaxyproject.eu\". Make sure this host can be reached over ssh: ssh: connect to host beacon.galaxyproject.eu port 22: Connection timed out\r\n", "unreachable": true}
fatal: [telescope.internal.galaxyproject.eu]: UNREACHABLE! => {"changed": false, "msg": "Data could not be sent to remote host \"telescope.internal.galaxyproject.eu\". Make sure this host can be reached over ssh: ssh: connect to host telescope.internal.galaxyproject.eu port 22: Connection timed out\r\n", "unreachable": true}
fatal: [sn05.galaxyproject.eu]: UNREACHABLE! => {"changed": false, "msg": "Data could not be sent to remote host \"sn05.galaxyproject.eu\". Make sure this host can be reached over ssh: ssh: connect to host sn05.galaxyproject.eu port 22: Connection timed out\r\n", "unreachable": true}
fatal: [test.internal.usegalaxy.eu]: UNREACHABLE! => {"changed": false, "msg": "Data could not be sent to remote host \"test.internal.usegalaxy.eu\". Make sure this host can be reached over ssh: ssh: connect to host test.internal.usegalaxy.eu port 22: Connection timed out\r\n", "unreachable": true}

# no route to host
fatal: [cvmfs-stratum0-test.galaxyproject.eu]: UNREACHABLE! => {"changed": false, "msg": "Data could not be sent to remote host \"cvmfs-stratum0-test.galaxyproject.eu\". Make sure this host can be reached over ssh: ssh: connect to host cvmfs-stratum0-test.galaxyproject.eu port 22: No route to host\r\n", "unreachable": true}

I assume sn05, test.internal.usegalaxy.eu, cvmfs-stratum0-test.galaxyproject.eu are "off" (physically off? cloud node destroyed?, clarification would help). But I would expect apps.galaxyproject.eu, beacon.galaxyproject.eu and telescope.internal.galaxyproject.eu to be reachable.

If anybody has explanations for the hosts being unreachable and the type of error because of having worked with them I'd be grateful @mira-miracoli, @bgruening.

Let admins login to all machines using the user ubuntu if it exists
@kysrpex
Copy link
Contributor

kysrpex commented Nov 27, 2025

Just for the record and FYI, if you ever find yourself locked out of cloud servers, use this: openstack --os-cloud=freiburg_galaxy server rescue --image "Debian 13" apps.galaxyproject.eu. That will shut down the VM and reboot it using the specified image (and they keypairs configured for the VM). The root disk of the original VM will be accessible as a device and you can use that to chroot and fix whatever is needed.

After you're done, run openstack --os-cloud=freiburg_galaxy server unrescue apps.galaxyproject.eu.

Due to how our cloud is configured, you MUST specify an image, otherwise rescue mode fails (I guess because nobody configured a default image).

kysrpex added a commit to gsaudade99/infrastructure-playbook that referenced this pull request Nov 27, 2025
Node apps cannot be managed via Jenkins because its host key is not trusted (it is not signed by the usegalaxy.eu cert authority).

Although the solution is to sign it, as a workaround add it to the `known_hosts` Makefile target so that PR usegalaxy-eu#1673 (Centralized SSH key management, usegalaxy-eu#1673) can be merged.
Node apps cannot be managed via Jenkins because its host key is not trusted (it is not signed by the usegalaxy.eu cert authority).

Although the solution is to sign it, as a workaround add it to the `known_hosts` Makefile target so that PR usegalaxy-eu#1673 (Centralized SSH key management, usegalaxy-eu#1673) can be merged.
@kysrpex kysrpex force-pushed the feat/ssh-key-managment branch from 4d75e16 to af64aac Compare November 27, 2025 16:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants