-
Notifications
You must be signed in to change notification settings - Fork 5
Open
Description
Anecdotally, @plotnick and I (at least) have observed that a4x2 launch works pretty reliably a few times after a system has rebooted, but after a while, it starts failing. I dug into one of these failures today. It looks like:
- reason TBD: propolis failed to allocate memory for the VM
- then propolis crashed due to failing to start a VM produces a "cannot drop a runtime in a context where blocking is not allowed" panic propolis#838. This was fixed recently. But I don't think we can use the fixed Propolis until Move to latest propolis #126.
What a4x2/falcon reported was:
dap@ivanova a4x2 $ pfexec ./a4x2 launch
Apr 09 16:13:39.145 INFO creating links
Apr 09 16:13:39.145 INFO creating simnet link 'a4x2_ce_vn_sim0'
Apr 09 16:13:39.146 INFO creating vnic link 'a4x2_ce_vn_vnic0'
Apr 09 16:13:39.154 INFO creating simnet link 'a4x2_cr1_vn_sim0'
Apr 09 16:13:39.155 INFO creating vnic link 'a4x2_cr1_vn_vnic0'
Apr 09 16:13:39.159 INFO creating simnet link 'a4x2_ce_vn_sim1'
Apr 09 16:13:39.160 INFO creating vnic link 'a4x2_ce_vn_vnic1'
Apr 09 16:13:39.164 INFO creating simnet link 'a4x2_cr2_vn_sim0'
Apr 09 16:13:39.164 INFO creating vnic link 'a4x2_cr2_vn_vnic0'
Apr 09 16:13:39.168 INFO creating simnet link 'a4x2_g0_sn_sim0'
Apr 09 16:13:39.169 INFO creating vnic link 'a4x2_g0_sn_vnic0'
Apr 09 16:13:39.173 INFO creating simnet link 'a4x2_g0_vn_sim0'
Apr 09 16:13:39.174 INFO creating vnic link 'a4x2_g0_vn_vnic0'
Apr 09 16:13:39.179 INFO creating simnet link 'a4x2_g3_sn_sim0'
Apr 09 16:13:39.180 INFO creating vnic link 'a4x2_g3_sn_vnic0'
Apr 09 16:13:39.185 INFO creating simnet link 'a4x2_g0_vn_sim2'
Apr 09 16:13:39.186 INFO creating vnic link 'a4x2_g0_vn_vnic2'
Apr 09 16:13:39.190 INFO creating simnet link 'a4x2_g0_sn_sim4'
Apr 09 16:13:39.191 INFO creating vnic link 'a4x2_g0_sn_vnic4'
Apr 09 16:13:39.195 INFO creating simnet link 'a4x2_g1_vn_sim0'
Apr 09 16:13:39.196 INFO creating vnic link 'a4x2_g1_vn_vnic0'
Apr 09 16:13:39.200 INFO creating simnet link 'a4x2_g3_sn_sim1'
Apr 09 16:13:39.201 INFO creating vnic link 'a4x2_g3_sn_vnic1'
Apr 09 16:13:39.205 INFO creating simnet link 'a4x2_g1_vn_sim1'
Apr 09 16:13:39.206 INFO creating vnic link 'a4x2_g1_vn_vnic1'
Apr 09 16:13:39.210 INFO creating simnet link 'a4x2_g0_sn_sim5'
Apr 09 16:13:39.211 INFO creating vnic link 'a4x2_g0_sn_vnic5'
Apr 09 16:13:39.215 INFO creating simnet link 'a4x2_g2_vn_sim0'
Apr 09 16:13:39.216 INFO creating vnic link 'a4x2_g2_vn_vnic0'
Apr 09 16:13:39.222 INFO creating simnet link 'a4x2_g3_sn_sim2'
Apr 09 16:13:39.222 INFO creating vnic link 'a4x2_g3_sn_vnic2'
Apr 09 16:13:39.227 INFO creating simnet link 'a4x2_g2_vn_sim1'
Apr 09 16:13:39.227 INFO creating vnic link 'a4x2_g2_vn_vnic1'
Apr 09 16:13:39.232 INFO creating simnet link 'a4x2_g0_sn_sim6'
Apr 09 16:13:39.233 INFO creating vnic link 'a4x2_g0_sn_vnic6'
Apr 09 16:13:39.237 INFO creating simnet link 'a4x2_g3_vn_sim3'
Apr 09 16:13:39.238 INFO creating vnic link 'a4x2_g3_vn_vnic3'
Apr 09 16:13:39.244 INFO creating simnet link 'a4x2_g3_sn_sim4'
Apr 09 16:13:39.245 INFO creating vnic link 'a4x2_g3_sn_vnic4'
Apr 09 16:13:39.251 INFO creating simnet link 'a4x2_g3_vn_sim4'
Apr 09 16:13:39.251 INFO creating vnic link 'a4x2_g3_vn_vnic4'
Apr 09 16:13:39.256 INFO creating simnet link 'a4x2_g0_sn_sim7'
Apr 09 16:13:39.257 INFO creating vnic link 'a4x2_g0_sn_vnic7'
Apr 09 16:13:39.261 INFO creating simnet link 'a4x2_cr1_vn_sim1'
Apr 09 16:13:39.261 INFO creating vnic link 'a4x2_cr1_vn_vnic1'
Apr 09 16:13:39.266 INFO creating simnet link 'a4x2_g3_sn_sim7'
Apr 09 16:13:39.267 INFO creating vnic link 'a4x2_g3_sn_vnic7'
Apr 09 16:13:39.271 INFO creating simnet link 'a4x2_cr1_vn_sim2'
Apr 09 16:13:39.272 INFO creating vnic link 'a4x2_cr1_vn_vnic2'
Apr 09 16:13:39.276 INFO creating simnet link 'a4x2_g0_sn_sim8'
Apr 09 16:13:39.277 INFO creating vnic link 'a4x2_g0_sn_vnic8'
Apr 09 16:13:39.282 INFO creating simnet link 'a4x2_cr2_vn_sim1'
Apr 09 16:13:39.283 INFO creating vnic link 'a4x2_cr2_vn_vnic1'
Apr 09 16:13:39.288 INFO creating simnet link 'a4x2_g3_sn_sim8'
Apr 09 16:13:39.289 INFO creating vnic link 'a4x2_g3_sn_vnic8'
Apr 09 16:13:39.293 INFO creating simnet link 'a4x2_cr2_vn_sim2'
Apr 09 16:13:39.294 INFO creating vnic link 'a4x2_cr2_vn_vnic2'
Apr 09 16:13:39.298 INFO creating external links
Apr 09 16:13:39.298 INFO creating external link a4x2_ce_vn_vnic2
Apr 09 16:13:39.299 INFO creating external link a4x2_g0_vn_vnic3
Apr 09 16:13:39.300 INFO creating external link a4x2_g1_vn_vnic2
Apr 09 16:13:39.301 INFO creating external link a4x2_g2_vn_vnic2
Apr 09 16:13:39.302 INFO creating external link a4x2_g3_vn_vnic6
Apr 09 16:13:39.303 INFO creating external link a4x2_cr1_vn_vnic3
Apr 09 16:13:39.303 INFO creating external link a4x2_cr2_vn_vnic3
Apr 09 16:13:39.304 INFO creating nodes
Apr 09 16:13:39.346 INFO launched instance cr1 with pid 3878 on port 52418
Apr 09 16:13:39.417 INFO instance ensure: cr1
Apr 09 16:13:39.428 INFO launched instance cr2 with pid 3879 on port 51401
Apr 09 16:13:39.465 INFO instance ensure: cr2
Apr 09 16:13:39.465 INFO launched instance g0 with pid 3880 on port 38084
Apr 09 16:13:39.501 INFO instance ensure: g0
Apr 09 16:13:39.502 INFO launched instance g1 with pid 3881 on port 59666
Apr 09 16:13:39.538 INFO instance ensure: g1
Apr 09 16:13:39.539 INFO launched instance g2 with pid 3882 on port 64792
Apr 09 16:13:39.575 INFO instance ensure: g2
Apr 09 16:13:39.575 INFO launched instance g3 with pid 3883 on port 48711
Apr 09 16:13:39.612 INFO instance ensure: g3
Apr 09 16:13:39.612 INFO launched instance ce with pid 3877 on port 45437
Apr 09 16:13:39.648 INFO instance ensure: ce
Apr 09 16:13:43.003 INFO instance run: cr2
Apr 09 16:13:43.240 INFO instance run: ce
Apr 09 16:13:43.287 INFO instance run: cr1
Apr 09 16:13:45.907 INFO instance run: g0
Apr 09 16:13:45.913 INFO instance run: g3
Apr 09 16:13:48.760 INFO instance ensure: g1
Apr 09 16:13:48.760 INFO instance ensure: g2
Apr 09 16:13:49.939 INFO instance ensure: g2
Apr 09 16:13:49.943 INFO instance ensure: g1
Apr 09 16:13:50.940 INFO instance ensure: g2
Apr 09 16:13:50.945 INFO instance ensure: g1
Apr 09 16:13:51.941 INFO instance ensure: g2
Apr 09 16:13:51.946 INFO instance ensure: g1
Apr 09 16:13:52.944 INFO instance ensure: g2
Apr 09 16:13:52.948 INFO instance ensure: g1
Apr 09 16:13:53.945 INFO instance ensure: g2
Apr 09 16:13:53.950 INFO instance ensure: g1
Apr 09 16:13:54.948 INFO instance ensure: g2
Apr 09 16:13:54.952 INFO instance ensure: g1
Apr 09 16:13:55.949 INFO instance ensure: g2
Apr 09 16:13:55.953 INFO instance ensure: g1
Apr 09 16:13:56.952 INFO instance ensure: g2
Apr 09 16:13:56.956 INFO instance ensure: g1
Apr 09 16:13:57.953 INFO instance ensure: g2
Apr 09 16:13:57.957 INFO instance ensure: g1
Apr 09 16:13:58.955 INFO instance ensure: g2
Apr 09 16:13:58.959 INFO instance ensure: g1
Apr 09 16:13:59.957 INFO instance ensure: g2
Apr 09 16:13:59.961 INFO instance ensure: g1
Apr 09 16:14:00.958 INFO instance ensure: g2
Apr 09 16:14:00.962 INFO instance ensure: g1
Apr 09 16:14:01.960 INFO instance ensure: g2
Apr 09 16:14:01.965 INFO instance ensure: g1
Apr 09 16:14:02.962 INFO instance ensure: g2
Apr 09 16:14:02.967 INFO instance ensure: g1
Apr 09 16:14:03.964 INFO instance ensure: g2
Apr 09 16:14:03.968 INFO instance ensure: g1
Apr 09 16:14:04.965 INFO instance ensure: g2
Apr 09 16:14:04.969 INFO instance ensure: g1
Apr 09 16:14:05.967 INFO instance ensure: g2
Apr 09 16:14:05.970 INFO instance ensure: g1
Apr 09 16:14:06.968 INFO instance ensure: g2
Apr 09 16:14:06.971 INFO instance ensure: g1
Apr 09 16:14:07.970 INFO instance ensure: g2
Apr 09 16:14:07.972 INFO instance ensure: g1
Apr 09 16:14:08.973 INFO instance ensure: g2
Apr 09 16:14:08.975 INFO instance ensure: g1
Apr 09 16:14:09.974 INFO instance ensure: g2
Apr 09 16:14:09.976 INFO instance ensure: g1
Apr 09 16:14:10.977 INFO instance ensure: g2
Apr 09 16:14:10.978 INFO instance ensure: g1
Apr 09 16:14:11.576 INFO g0: mounting /opt/cargo-bay
Apr 09 16:14:11.606 INFO g3: mounting /opt/cargo-bay
Apr 09 16:14:11.978 INFO instance ensure: g2
Apr 09 16:14:11.979 INFO instance ensure: g1
Apr 09 16:14:12.981 INFO instance ensure: g1
Apr 09 16:14:12.981 INFO instance ensure: g2
Apr 09 16:14:13.983 INFO instance ensure: g1
Apr 09 16:14:13.983 INFO instance ensure: g2
Apr 09 16:14:14.984 INFO instance ensure: g1
Apr 09 16:14:14.985 INFO instance ensure: g2
Apr 09 16:14:15.987 INFO instance ensure: g1
Apr 09 16:14:15.987 INFO instance ensure: g2
Apr 09 16:14:16.989 INFO instance ensure: g1
Apr 09 16:14:16.990 INFO instance ensure: g2
Apr 09 16:14:31.137 INFO g0: finished mounting /opt/cargo-bay
Apr 09 16:14:31.178 INFO g3: finished mounting /opt/cargo-bay
Apr 09 16:14:31.182 INFO g0: logging out
Apr 09 16:14:31.223 INFO g3: logging out
Apr 09 16:14:31.296 INFO g0: logged out
Apr 09 16:14:31.335 INFO g3: logged out
Apr 09 16:14:59.943 INFO ce: mounting /opt/cargo-bay
Apr 09 16:14:59.997 INFO ce: finished mounting /opt/cargo-bay
Apr 09 16:15:00.041 INFO ce: logging out
Apr 09 16:15:00.676 INFO cr2: mounting /opt/cargo-bay
Apr 09 16:15:00.678 INFO cr1: mounting /opt/cargo-bay
Apr 09 16:15:00.710 INFO cr1: finished mounting /opt/cargo-bay
Apr 09 16:15:00.720 INFO cr2: finished mounting /opt/cargo-bay
Apr 09 16:15:00.755 INFO cr1: logging out
Apr 09 16:15:00.764 INFO cr2: logging out
Apr 09 16:15:01.177 INFO ce: logged out
Apr 09 16:15:01.996 INFO cr1: logged out
Apr 09 16:15:02.033 INFOerror: Communication Error: error sending request for url (http://[::1]:59666/instance): error trying to connect: tcp connect error: Connection refused (os error 146)
cr2: logged out
Apr 09 16:15:02.034 ERRO launch failed: Communication Error: error sending request for url (http://[::1]:59666/instance): error trying to connect: tcp connect error: Connection refused (os error 146)
Apr 09 16:15:02.034 INFO ce: init start
Apr 09 16:15:02.035 INFO cr1: init start
Apr 09 16:15:02.035 INFO cr2: init start
Apr 09 16:15:02.035 INFO g0: init start
Apr 09 16:15:02.036 INFO g1: init start
Apr 09 16:15:02.036 WARN [sc] g1: IO error: Connection refused (os error 146)
Apr 09 16:15:02.036 INFO g2: init start
Apr 09 16:15:02.036 WARN [sc] g2: IO error: Connection refused (os error 146)
Apr 09 16:15:02.037 INFO g3: init start
Apr 09 16:15:03.038 WARN [sc] g1: IO error: Connection refused (os error 146)
Apr 09 16:15:03.038 WARN [sc] g2: IO error: Connection refused (os error 146)
Apr 09 16:15:04.040 WARN [sc] g1: IO error: Connection refused (os error 146)
Apr 09 16:15:04.040 WARN [sc] g2: IO error: Connection refused (os error 146)
Apr 09 16:15:04.532 INFO cr2: init finish
Apr 09 16:15:04.759 INFO cr1: init finish
Apr 09 16:15:05.042 WARN [sc] g1: IO error: Connection refused (os error 146)
Apr 09 16:15:05.042 WARN [sc] g2: IO error: Connection refused (os error 146)
Apr 09 16:15:06.043 WARN [sc] g1: IO error: Connection refused (os error 146)
Apr 09 16:15:06.044 WARN [sc] g2: IO error: Connection refused (os error 146)
Apr 09 16:15:07.045 WARN [sc] g1: IO error: Connection refused (os error 146)
Apr 09 16:15:07.045 WARN [sc] g2: IO error: Connection refused (os error 146)
Apr 09 16:15:08.046 WARN [sc] g1: IO error: Connection refused (os error 146)
Apr 09 16:15:08.046 WARN [sc] g2: IO error: Connection refused (os error 146)
Apr 09 16:15:09.047 WARN [sc] g1: IO error: Connection refused (os error 146)
Apr 09 16:15:09.047 WARN [sc] g2: IO error: Connection refused (os error 146)
Apr 09 16:15:10.049 WARN [sc] g1: IO error: Connection refused (os error 146)
Apr 09 16:15:10.049 WARN [sc] g2: IO error: Connection refused (os error 146)
Apr 09 16:15:10.921 INFO ce: init finish
Apr 09 16:15:11.050 WARN [sc] g1: IO error: Connection refused (os error 146)
Apr 09 16:15:11.050 WARN [sc] g2: IO error: Connection refused (os error 146)
Apr 09 16:15:12.052 WARN [sc] g1: IO error: Connection refused (os error 146)
Apr 09 16:15:12.052 WARN [sc] g2: IO error: Connection refused (os error 146)
Apr 09 16:15:13.053 WARN [sc] g1: IO error: Connection refused (os error 146)
Apr 09 16:15:13.053 WARN [sc] g2: IO error: Connection refused (os error 146)
Apr 09 16:15:14.054 WARN [sc] g1: IO error: Connection refused (os error 146)
Apr 09 16:15:14.054 WARN [sc] g2: IO error: Connection refused (os error 146)
Apr 09 16:15:15.055 WARN [sc] g1: IO error: Connection refused (os error 146)
Apr 09 16:15:15.055 WARN [sc] g2: IO error: Connection refused (os error 146)
Apr 09 16:15:16.057 WARN [sc] g1: IO error: Connection refused (os error 146)
Apr 09 16:15:16.057 WARN [sc] g2: IO error: Connection refused (os error 146)
Apr 09 16:15:17.058 WARN [sc] g1: IO error: Connection refused (os error 146)
Apr 09 16:15:17.058 WARN [sc] g2: IO error: Connection refused (os error 146)
Apr 09 16:15:18.059 WARN [sc] g1: IO error: Connection refused (os error 146)
Apr 09 16:15:18.059 WARN [sc] g2: IO error: Connection refused (os error 146)
Apr 09 16:15:19.060 WARN [sc] g1: IO error: Connection refused (os error 146)
Apr 09 16:15:19.061 WARN [sc] g2: IO error: Connection refused (os error 146)
Apr 09 16:15:20.061 WARN [sc] g1: IO error: Connection refused (os error 146)
Apr 09 16:15:20.062 WARN [sc] g2: IO error: Connection refused (os error 146)
Apr 09 16:15:21.065 WARN [sc] g1: IO error: Connection refused (os error 146)
Apr 09 16:15:21.065 WARN [sc] g2: IO error: Connection refused (os error 146)
Apr 09 16:15:22.067 WARN [sc] g1: IO error: Connection refused (os error 146)
Apr 09 16:15:22.067 WARN [sc] g2: IO error: Connection refused (os error 146)
Apr 09 16:15:23.070 WARN [sc] g1: IO error: Connection refused (os error 146)
Apr 09 16:15:23.070 WARN [sc] g2: IO error: Connection refused (os error 146)
Apr 09 16:15:24.073 WARN [sc] g1: IO error: Connection refused (os error 146)
Apr 09 16:15:24.073 WARN [sc] g2: IO error: Connection refused (os error 146)
Apr 09 16:15:25.075 WARN [sc] g1: IO error: Connection refused (os error 146)
Apr 09 16:15:25.075 WARN [sc] g2: IO error: Connection refused (os error 146)
Apr 09 16:15:26.076 WARN [sc] g1: IO error: Connection refused (os error 146)
Apr 09 16:15:26.076 WARN [sc] g2: IO error: Connection refused (os error 146)
Apr 09 16:15:27.078 WARN [sc] g1: IO error: Connection refused (os error 146)
Apr 09 16:15:27.078 WARN [sc] g2: IO error: Connection refused (os error 146)
Apr 09 16:15:28.079 WARN [sc] g1: IO error: Connection refused (os error 146)
Apr 09 16:15:28.079 WARN [sc] g2: IO error: Connection refused (os error 146)
Apr 09 16:15:29.081 WARN [sc] g1: IO error: Connection refused (os error 146)
Apr 09 16:15:29.081 WARN [sc] g2: IO error: Connection refused (os error 146)
Apr 09 16:15:30.083 WARN [sc] g1: IO error: Connection refused (os error 146)
Apr 09 16:15:30.083 WARN [sc] g2: IO error: Connection refused (os error 146)
Apr 09 16:15:31.083 WARN [sc] g1: IO error: Connection refused (os error 146)
Apr 09 16:15:31.084 WARN [sc] g2: IO error: Connection refused (os error 146)
Apr 09 16:15:32.084 INFO g1: init finish
Apr 09 16:15:32.085 INFO g2: init finish
Apr 09 16:17:55.058 INFO g0: init finish
Apr 09 16:17:55.389 INFO g3: init finish
Error: g1: init IO error: Connection refused (os error 146)
Mostly through code inspection and grep'ing for error messages, I pieced together this sequence:
a4x2invokes falcon'srun()run()invokeslaunch()- that invokes
r.launch()onRunner- that runs preflight, then
r.do_launch()- that iterates the nodes and kicks off
n.launch()for each one
- this callslaunch_vm
- this usesclient.instance_ensure()on a Propolis (HTTP API) client. It retries 30 times. That's what we're seeing for all the "instance ensure" info-level messages.- for the successful nodes (most of them), this completes relatively quickly
- for g1 and g2: they get stuck
- then it uses futures'
join_allto wait for them, presumably in the order that they finish. it bails out on the first failure, which in this case I think is g1. - it immediately prints
error: Communication Error: error sending request for url (http://[::1]:59666/instance): error trying to connect: tcp connect error: Connection refused (os error 146), but this gets injected into the middle of an unrelated slog message in the output above
- that iterates the nodes and kicks off
- that runs preflight, then
- back in
launch(), it prints an error-level log message
- that invokes
run()does not return an error
a4x2, seeing no error, proceeds to init- for g1 and g2, this gets stuck in a retry loop trying to connect to the serial console (this is the "[sc]" in the messages).
I'm guessing it's a bug that we don't propagate the error from launch()? That's here:
Line 230 in 0412e97
| launch(r).await; |
I notice that some of the other code paths (like info and reboot) do propagate the error while others don't (like preflight).
Some suggestions for making this more debuggable:
- some context on the
[sc] g1: IO error: Connection refused (os error 146)messages would be useful (e.g., what was trying to make a connection and why?) -- something likeerror attempting to connect to serial console of g1: connect ws://.../: Connection refused (os error 146). - some context on the message
Communication Error: error sending request for url (http://[::1]:59666/instance): error trying to connect: tcp connect error: Connection refused (os error 146)saying what it was doing and why (e.g., which VM it was, that it was trying to talk to Propolis to configure the VM). - maybe even the "instance ensure" message could say "configuring Propolis with VM details" or something?
Metadata
Metadata
Assignees
Labels
No labels