-
Notifications
You must be signed in to change notification settings - Fork 108
Centralized SSH key management #1673
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Centralized SSH key management #1673
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for submitting the draft. Let's iteratively evolve the PR through several rounds of a discussion plus one or several commits.
Regarding how users and roles are defined in group_vars/ssh.yml, I what I meant with the role-based access control is closer to this approach.
roles:
- name: admin
hosts: all
users:
# operating system users
- centos
- stats
- root # can be omitted for now, centos can `sudo su`
- "{{ galaxy_user.name }}"
- name: stats
hosts: sn09
users:
- stats
...
users:
- name: gsaudade
keys:
- [email protected] AAAAGnNrLXNzaC1lZDI1NTE5QG9wZW5zc2guY29tAAAAIAXgFT+ynWEgCCAKepVgkSHauy+mLo+KlIGIIj0dkAXeAAAABHNzaDo= [email protected]
roles:
- admin
- stats
...One of the advantages of RBAC is to avoid this kind of repetition: roles: [root,stats,centos, {{ galaxy_user.name }}], and to avoid having to, for example when creating one more user that admins need to have access as, add one more user for all repetitions.
Besides that, I have the hope that the tasks for root.yml and generic.yml can be joined into a single file. After all, they're kind of the same tasks and generic.yml is already looping over a list of several users, probably "root" can be included in that list.
Notice also that in the approach I have proposed above, roles have a key hosts (e.g. hosts: all) that determines which hosts are managed by the role, because it may be interesting to grant someone permission to manage one specific host or service without granting access to all the infrastructure. This is what I am talking about when I speak about "hooking into the same logic that Ansible uses to resolve the hosts a play applies to".
Another suggestion I have is using the Ansible module dedicated to managing SSH keys ansible.posix.authorized_key (with exclusive: true). In fact you already thought of this yourself. This way you can avoid parsing keys from GitHub, and possibly even avoid ensuring the .ssh directory exists.
EDIT: Copilot suggests using exclusive: false as default for the first run; then switching to true after confirming the generated keys list is complete (nice catch, this is one more safety measure to avoid getting locked out, although I would think of using --check mode too).
|
Refactored the way the keys are managed according to the role-based group vars. 1 - Identifies which 3 - For each applicable role, collects SSH keys from all 6 - Combines keys from multiple roles for each target user 9 - Discovers actual home directories for each target user |
kysrpex
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The last refactor was imo a large step in the right direction. Let's go for a second review round to refine it further.
Please call out any bullshit I may be spitting out. I know it's very annoying to go through several rounds of reviews, but it is specially important to get this PR right (due to the potential pitfalls such as getting locked out).
|
I do not have time left today to complete a review, but I still want to comment that generally this is looking good (although I want to "check the fine print still" and make some suggestions). There's a very high probability that we can get this merged next week :) |
group_vars/ssh.yml
Outdated
| roles: | ||
| - admin | ||
|
|
||
| - user: dominguj |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| - user: dominguj | |
| - user: kysrpex |
If this is referencing to GH user handles this would be @kysrpex
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its more for identification purposes. Its not used in playbook (:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, this name is not even used in the playbook.
Imo this field should contain something that can be easily traced to a person and a full name would do wonders (after all, our full names are all already here https://usegalaxy-eu.github.io/people).
Co-authored-by: Mira <[email protected]>
Co-authored-by: Mira <[email protected]>
Broaden the scope of the playbook to all hosts in the inventory. If something else needs to me standardized across all hosts in the future, it could also be deployed using this playbook.
The variable `target_host` is not defined since commit b59df72.
|
I would like to discuss this and I think this could need an ADR |
|
I added some commits yesterday doing two things:
There are still two points open from @mira-miracoli:
Sounds good to me, makes it obvious what you're dealing with.
I guess we can't do anything with the ratelimit itself, but imo we should make sure - name: "Fetch keys from URLs for role: {{ current_role.name }}"
ansible.builtin.set_fact:
fetched_keys: >-
{{ fetched_keys | default([]) + (
lookup('url', item)
| default('')
| replace(',', '\n')
| split('\n')
| map('trim')
| reject('equalto', '')
| select('match', '^(ssh-rsa|ssh-dss|ssh-ed25519|ecdsa-sha2-|[email protected]|sk-ecdsa-sha2-)')
| list
) }}
loop: "{{ role_url_keys | default([]) | unique }}"
when: role_url_keys | default([]) | length > 0I want to add two extra points that I did not notice before but I think need to be addressed:
Definitely needs another round of discussion, it's quite high-stakes. |
|
I forgot to add that I ran 78f89ec in With this I mean that the last commit is in an "executable" state. |
Adjust the authorized persons configuration based on an exhaustive examination of the contents of the authorized keys files for each server. It does not match the current status of each server (that would be a mistake), but provides a reasonably similar (and updated access policy) that can be tweaked within the coming months. In addition, use `key` instead of `keys` (the latter is a reserved Ansible keyword) and encrypt the public keys of all users.
fad60e7 to
a07df90
Compare
… feat/ssh-key-managment # Conflicts: # requirements.yaml
|
I added a Jenkins project for the playbook all.yml and executed a test run (using Besides a couple of hosts being unreachable, it looks like it's working as intended. Keep in mind that 03e7462 needs to be reverted before merging (this commit is what is enabling the test run right now). Ideally, we'd also fix the connection issues and cleanup the nodes that no longer exist. or are not in use. |
…r temporarily (for initial test)" This reverts commit 03e7462.
|
Don't we have IPMI access to at least sn09? This already enables probably access to most of the VMs... edit: as I wrote we don't need to save them lol they won't be overwritten ... |
No not now, but I think it would be more secure in general to have only tailscale access to ssh.. but this is for the future. |
Jenkins worker nodes cannot be managed via Jenkins because their host keys are not trusted (they do not seem to be signed by the usegalaxy.eu cert authority). Although the solution is to sign them, as a workaround add them to the `known_hosts` Makefile target so that PR usegalaxy-eu#1673 (Centralized SSH key management, usegalaxy-eu#1673) can be merged.
Node sn10 cannot be managed via Jenkins because its host key is not trusted (it is not signed by the usegalaxy.eu cert authority). Although the solution is to sign it, as a workaround add it to the `known_hosts` Makefile target so that PR usegalaxy-eu#1673 (Centralized SSH key management, usegalaxy-eu#1673) can be merged.
Node sn12 cannot be managed via Jenkins because its host key is not trusted (it is not signed by the usegalaxy.eu cert authority). Although the solution is to sign it, as a workaround add it to the `known_hosts` Makefile target so that PR usegalaxy-eu#1673 (Centralized SSH key management, usegalaxy-eu#1673) can be merged.
af17a4b to
4a1de84
Compare
Node osiris cannot be managed via Jenkins because its host key is not trusted (it is not signed by the usegalaxy.eu cert authority). Although the solution is to sign it, as a workaround add it to the `known_hosts` Makefile target so that PR usegalaxy-eu#1673 (Centralized SSH key management, usegalaxy-eu#1673) can be merged.
Node dnbd3-primary cannot be managed via Jenkins because its host key is not trusted (it is not signed by the usegalaxy.eu cert authority). Although the solution is to sign it, as a workaround add it to the `known_hosts` Makefile target so that PR usegalaxy-eu#1673 (Centralized SSH key management, usegalaxy-eu#1673) can be merged.
|
FYI this is the current status of the unreachability of the hosts from Jenkins (all other hosts are reachable 🎉. # connection refused
fatal: [apps.galaxyproject.eu]: UNREACHABLE! => {"changed": false, "msg": "Data could not be sent to remote host \"apps.galaxyproject.eu\". Make sure this host can be reached over ssh: ssh: connect to host apps.galaxyproject.eu port 22: Connection refused\r\n", "unreachable": true}
# connectiont timed out
fatal: [beacon.galaxyproject.eu]: UNREACHABLE! => {"changed": false, "msg": "Data could not be sent to remote host \"beacon.galaxyproject.eu\". Make sure this host can be reached over ssh: ssh: connect to host beacon.galaxyproject.eu port 22: Connection timed out\r\n", "unreachable": true}
fatal: [telescope.internal.galaxyproject.eu]: UNREACHABLE! => {"changed": false, "msg": "Data could not be sent to remote host \"telescope.internal.galaxyproject.eu\". Make sure this host can be reached over ssh: ssh: connect to host telescope.internal.galaxyproject.eu port 22: Connection timed out\r\n", "unreachable": true}
fatal: [sn05.galaxyproject.eu]: UNREACHABLE! => {"changed": false, "msg": "Data could not be sent to remote host \"sn05.galaxyproject.eu\". Make sure this host can be reached over ssh: ssh: connect to host sn05.galaxyproject.eu port 22: Connection timed out\r\n", "unreachable": true}
fatal: [test.internal.usegalaxy.eu]: UNREACHABLE! => {"changed": false, "msg": "Data could not be sent to remote host \"test.internal.usegalaxy.eu\". Make sure this host can be reached over ssh: ssh: connect to host test.internal.usegalaxy.eu port 22: Connection timed out\r\n", "unreachable": true}
# no route to host
fatal: [cvmfs-stratum0-test.galaxyproject.eu]: UNREACHABLE! => {"changed": false, "msg": "Data could not be sent to remote host \"cvmfs-stratum0-test.galaxyproject.eu\". Make sure this host can be reached over ssh: ssh: connect to host cvmfs-stratum0-test.galaxyproject.eu port 22: No route to host\r\n", "unreachable": true}I assume sn05, test.internal.usegalaxy.eu, cvmfs-stratum0-test.galaxyproject.eu are "off" (physically off? cloud node destroyed?, clarification would help). But I would expect apps.galaxyproject.eu, beacon.galaxyproject.eu and telescope.internal.galaxyproject.eu to be reachable. If anybody has explanations for the hosts being unreachable and the type of error because of having worked with them I'd be grateful @mira-miracoli, @bgruening. |
Let admins login to all machines using the user ubuntu if it exists
|
Just for the record and FYI, if you ever find yourself locked out of cloud servers, use this: After you're done, run Due to how our cloud is configured, you MUST specify an image, otherwise rescue mode fails (I guess because nobody configured a default image). |
Node apps cannot be managed via Jenkins because its host key is not trusted (it is not signed by the usegalaxy.eu cert authority). Although the solution is to sign it, as a workaround add it to the `known_hosts` Makefile target so that PR usegalaxy-eu#1673 (Centralized SSH key management, usegalaxy-eu#1673) can be merged.
Node apps cannot be managed via Jenkins because its host key is not trusted (it is not signed by the usegalaxy.eu cert authority). Although the solution is to sign it, as a workaround add it to the `known_hosts` Makefile target so that PR usegalaxy-eu#1673 (Centralized SSH key management, usegalaxy-eu#1673) can be merged.
4d75e16 to
af64aac
Compare

First draft for ssh user key management.
Feel free too:
I will leave this here for now until I have a better visibility on the servers.
Role-based SSH key management:
group_vars/ssh.ymlto definessh_userswith associated keys and roles, supporting both direct key strings and URLs for fetching keys.Root user SSH key handling:
roles/ssh-manager/tasks/root.ymlto manage the root account: ensures.sshdirectory exists, collects keys from both direct input and URLs, deduplicates, and writes theauthorized_keysfile for root.Non-root (role-based) user SSH key handling:
roles/ssh-manager/tasks/generic.ymlto dynamically discover non-root role accounts, ensure their.sshdirectories exist, collect keys from both direct input and URLs, deduplicate, and update theirauthorized_keysfiles.Task orchestration:
roles/ssh-manager/tasks/main.ymlto orchestrate the inclusion of root and generic user SSH key tasks.Questions:
Is there any servers without root account? The root tasks would need to be changed (probably)
There is this variable
galaxy_user_public_keywitch I haven't taken into account yet.usegalaxy-eu/issues#785