Skip to content

failure to launch propolis could be easier to debug #136

@davepacheco

Description

@davepacheco

Anecdotally, @plotnick and I (at least) have observed that a4x2 launch works pretty reliably a few times after a system has rebooted, but after a while, it starts failing. I dug into one of these failures today. It looks like:

What a4x2/falcon reported was:

dap@ivanova a4x2 $ pfexec ./a4x2 launch
Apr 09 16:13:39.145 INFO creating links
Apr 09 16:13:39.145 INFO creating simnet link 'a4x2_ce_vn_sim0'
Apr 09 16:13:39.146 INFO creating vnic link 'a4x2_ce_vn_vnic0'
Apr 09 16:13:39.154 INFO creating simnet link 'a4x2_cr1_vn_sim0'
Apr 09 16:13:39.155 INFO creating vnic link 'a4x2_cr1_vn_vnic0'
Apr 09 16:13:39.159 INFO creating simnet link 'a4x2_ce_vn_sim1'
Apr 09 16:13:39.160 INFO creating vnic link 'a4x2_ce_vn_vnic1'
Apr 09 16:13:39.164 INFO creating simnet link 'a4x2_cr2_vn_sim0'
Apr 09 16:13:39.164 INFO creating vnic link 'a4x2_cr2_vn_vnic0'
Apr 09 16:13:39.168 INFO creating simnet link 'a4x2_g0_sn_sim0'
Apr 09 16:13:39.169 INFO creating vnic link 'a4x2_g0_sn_vnic0'
Apr 09 16:13:39.173 INFO creating simnet link 'a4x2_g0_vn_sim0'
Apr 09 16:13:39.174 INFO creating vnic link 'a4x2_g0_vn_vnic0'
Apr 09 16:13:39.179 INFO creating simnet link 'a4x2_g3_sn_sim0'
Apr 09 16:13:39.180 INFO creating vnic link 'a4x2_g3_sn_vnic0'
Apr 09 16:13:39.185 INFO creating simnet link 'a4x2_g0_vn_sim2'
Apr 09 16:13:39.186 INFO creating vnic link 'a4x2_g0_vn_vnic2'
Apr 09 16:13:39.190 INFO creating simnet link 'a4x2_g0_sn_sim4'
Apr 09 16:13:39.191 INFO creating vnic link 'a4x2_g0_sn_vnic4'
Apr 09 16:13:39.195 INFO creating simnet link 'a4x2_g1_vn_sim0'
Apr 09 16:13:39.196 INFO creating vnic link 'a4x2_g1_vn_vnic0'
Apr 09 16:13:39.200 INFO creating simnet link 'a4x2_g3_sn_sim1'
Apr 09 16:13:39.201 INFO creating vnic link 'a4x2_g3_sn_vnic1'
Apr 09 16:13:39.205 INFO creating simnet link 'a4x2_g1_vn_sim1'
Apr 09 16:13:39.206 INFO creating vnic link 'a4x2_g1_vn_vnic1'
Apr 09 16:13:39.210 INFO creating simnet link 'a4x2_g0_sn_sim5'
Apr 09 16:13:39.211 INFO creating vnic link 'a4x2_g0_sn_vnic5'
Apr 09 16:13:39.215 INFO creating simnet link 'a4x2_g2_vn_sim0'
Apr 09 16:13:39.216 INFO creating vnic link 'a4x2_g2_vn_vnic0'
Apr 09 16:13:39.222 INFO creating simnet link 'a4x2_g3_sn_sim2'
Apr 09 16:13:39.222 INFO creating vnic link 'a4x2_g3_sn_vnic2'
Apr 09 16:13:39.227 INFO creating simnet link 'a4x2_g2_vn_sim1'
Apr 09 16:13:39.227 INFO creating vnic link 'a4x2_g2_vn_vnic1'
Apr 09 16:13:39.232 INFO creating simnet link 'a4x2_g0_sn_sim6'
Apr 09 16:13:39.233 INFO creating vnic link 'a4x2_g0_sn_vnic6'
Apr 09 16:13:39.237 INFO creating simnet link 'a4x2_g3_vn_sim3'
Apr 09 16:13:39.238 INFO creating vnic link 'a4x2_g3_vn_vnic3'
Apr 09 16:13:39.244 INFO creating simnet link 'a4x2_g3_sn_sim4'
Apr 09 16:13:39.245 INFO creating vnic link 'a4x2_g3_sn_vnic4'
Apr 09 16:13:39.251 INFO creating simnet link 'a4x2_g3_vn_sim4'
Apr 09 16:13:39.251 INFO creating vnic link 'a4x2_g3_vn_vnic4'
Apr 09 16:13:39.256 INFO creating simnet link 'a4x2_g0_sn_sim7'
Apr 09 16:13:39.257 INFO creating vnic link 'a4x2_g0_sn_vnic7'
Apr 09 16:13:39.261 INFO creating simnet link 'a4x2_cr1_vn_sim1'
Apr 09 16:13:39.261 INFO creating vnic link 'a4x2_cr1_vn_vnic1'
Apr 09 16:13:39.266 INFO creating simnet link 'a4x2_g3_sn_sim7'
Apr 09 16:13:39.267 INFO creating vnic link 'a4x2_g3_sn_vnic7'
Apr 09 16:13:39.271 INFO creating simnet link 'a4x2_cr1_vn_sim2'
Apr 09 16:13:39.272 INFO creating vnic link 'a4x2_cr1_vn_vnic2'
Apr 09 16:13:39.276 INFO creating simnet link 'a4x2_g0_sn_sim8'
Apr 09 16:13:39.277 INFO creating vnic link 'a4x2_g0_sn_vnic8'
Apr 09 16:13:39.282 INFO creating simnet link 'a4x2_cr2_vn_sim1'
Apr 09 16:13:39.283 INFO creating vnic link 'a4x2_cr2_vn_vnic1'
Apr 09 16:13:39.288 INFO creating simnet link 'a4x2_g3_sn_sim8'
Apr 09 16:13:39.289 INFO creating vnic link 'a4x2_g3_sn_vnic8'
Apr 09 16:13:39.293 INFO creating simnet link 'a4x2_cr2_vn_sim2'
Apr 09 16:13:39.294 INFO creating vnic link 'a4x2_cr2_vn_vnic2'
Apr 09 16:13:39.298 INFO creating external links
Apr 09 16:13:39.298 INFO creating external link a4x2_ce_vn_vnic2
Apr 09 16:13:39.299 INFO creating external link a4x2_g0_vn_vnic3
Apr 09 16:13:39.300 INFO creating external link a4x2_g1_vn_vnic2
Apr 09 16:13:39.301 INFO creating external link a4x2_g2_vn_vnic2
Apr 09 16:13:39.302 INFO creating external link a4x2_g3_vn_vnic6
Apr 09 16:13:39.303 INFO creating external link a4x2_cr1_vn_vnic3
Apr 09 16:13:39.303 INFO creating external link a4x2_cr2_vn_vnic3
Apr 09 16:13:39.304 INFO creating nodes
Apr 09 16:13:39.346 INFO launched instance cr1 with pid 3878 on port 52418
Apr 09 16:13:39.417 INFO instance ensure: cr1
Apr 09 16:13:39.428 INFO launched instance cr2 with pid 3879 on port 51401
Apr 09 16:13:39.465 INFO instance ensure: cr2
Apr 09 16:13:39.465 INFO launched instance g0 with pid 3880 on port 38084
Apr 09 16:13:39.501 INFO instance ensure: g0
Apr 09 16:13:39.502 INFO launched instance g1 with pid 3881 on port 59666
Apr 09 16:13:39.538 INFO instance ensure: g1
Apr 09 16:13:39.539 INFO launched instance g2 with pid 3882 on port 64792
Apr 09 16:13:39.575 INFO instance ensure: g2
Apr 09 16:13:39.575 INFO launched instance g3 with pid 3883 on port 48711
Apr 09 16:13:39.612 INFO instance ensure: g3
Apr 09 16:13:39.612 INFO launched instance ce with pid 3877 on port 45437
Apr 09 16:13:39.648 INFO instance ensure: ce
Apr 09 16:13:43.003 INFO instance run: cr2
Apr 09 16:13:43.240 INFO instance run: ce
Apr 09 16:13:43.287 INFO instance run: cr1
Apr 09 16:13:45.907 INFO instance run: g0
Apr 09 16:13:45.913 INFO instance run: g3
Apr 09 16:13:48.760 INFO instance ensure: g1
Apr 09 16:13:48.760 INFO instance ensure: g2
Apr 09 16:13:49.939 INFO instance ensure: g2
Apr 09 16:13:49.943 INFO instance ensure: g1
Apr 09 16:13:50.940 INFO instance ensure: g2
Apr 09 16:13:50.945 INFO instance ensure: g1
Apr 09 16:13:51.941 INFO instance ensure: g2
Apr 09 16:13:51.946 INFO instance ensure: g1
Apr 09 16:13:52.944 INFO instance ensure: g2
Apr 09 16:13:52.948 INFO instance ensure: g1
Apr 09 16:13:53.945 INFO instance ensure: g2
Apr 09 16:13:53.950 INFO instance ensure: g1
Apr 09 16:13:54.948 INFO instance ensure: g2
Apr 09 16:13:54.952 INFO instance ensure: g1
Apr 09 16:13:55.949 INFO instance ensure: g2
Apr 09 16:13:55.953 INFO instance ensure: g1
Apr 09 16:13:56.952 INFO instance ensure: g2
Apr 09 16:13:56.956 INFO instance ensure: g1
Apr 09 16:13:57.953 INFO instance ensure: g2
Apr 09 16:13:57.957 INFO instance ensure: g1
Apr 09 16:13:58.955 INFO instance ensure: g2
Apr 09 16:13:58.959 INFO instance ensure: g1
Apr 09 16:13:59.957 INFO instance ensure: g2
Apr 09 16:13:59.961 INFO instance ensure: g1
Apr 09 16:14:00.958 INFO instance ensure: g2
Apr 09 16:14:00.962 INFO instance ensure: g1
Apr 09 16:14:01.960 INFO instance ensure: g2
Apr 09 16:14:01.965 INFO instance ensure: g1
Apr 09 16:14:02.962 INFO instance ensure: g2
Apr 09 16:14:02.967 INFO instance ensure: g1
Apr 09 16:14:03.964 INFO instance ensure: g2
Apr 09 16:14:03.968 INFO instance ensure: g1
Apr 09 16:14:04.965 INFO instance ensure: g2
Apr 09 16:14:04.969 INFO instance ensure: g1
Apr 09 16:14:05.967 INFO instance ensure: g2
Apr 09 16:14:05.970 INFO instance ensure: g1
Apr 09 16:14:06.968 INFO instance ensure: g2
Apr 09 16:14:06.971 INFO instance ensure: g1
Apr 09 16:14:07.970 INFO instance ensure: g2
Apr 09 16:14:07.972 INFO instance ensure: g1
Apr 09 16:14:08.973 INFO instance ensure: g2
Apr 09 16:14:08.975 INFO instance ensure: g1
Apr 09 16:14:09.974 INFO instance ensure: g2
Apr 09 16:14:09.976 INFO instance ensure: g1
Apr 09 16:14:10.977 INFO instance ensure: g2
Apr 09 16:14:10.978 INFO instance ensure: g1
Apr 09 16:14:11.576 INFO g0: mounting /opt/cargo-bay
Apr 09 16:14:11.606 INFO g3: mounting /opt/cargo-bay
Apr 09 16:14:11.978 INFO instance ensure: g2
Apr 09 16:14:11.979 INFO instance ensure: g1
Apr 09 16:14:12.981 INFO instance ensure: g1
Apr 09 16:14:12.981 INFO instance ensure: g2
Apr 09 16:14:13.983 INFO instance ensure: g1
Apr 09 16:14:13.983 INFO instance ensure: g2
Apr 09 16:14:14.984 INFO instance ensure: g1
Apr 09 16:14:14.985 INFO instance ensure: g2
Apr 09 16:14:15.987 INFO instance ensure: g1
Apr 09 16:14:15.987 INFO instance ensure: g2
Apr 09 16:14:16.989 INFO instance ensure: g1
Apr 09 16:14:16.990 INFO instance ensure: g2
Apr 09 16:14:31.137 INFO g0: finished mounting /opt/cargo-bay
Apr 09 16:14:31.178 INFO g3: finished mounting /opt/cargo-bay
Apr 09 16:14:31.182 INFO g0: logging out
Apr 09 16:14:31.223 INFO g3: logging out
Apr 09 16:14:31.296 INFO g0: logged out
Apr 09 16:14:31.335 INFO g3: logged out
Apr 09 16:14:59.943 INFO ce: mounting /opt/cargo-bay
Apr 09 16:14:59.997 INFO ce: finished mounting /opt/cargo-bay
Apr 09 16:15:00.041 INFO ce: logging out
Apr 09 16:15:00.676 INFO cr2: mounting /opt/cargo-bay
Apr 09 16:15:00.678 INFO cr1: mounting /opt/cargo-bay
Apr 09 16:15:00.710 INFO cr1: finished mounting /opt/cargo-bay
Apr 09 16:15:00.720 INFO cr2: finished mounting /opt/cargo-bay
Apr 09 16:15:00.755 INFO cr1: logging out
Apr 09 16:15:00.764 INFO cr2: logging out
Apr 09 16:15:01.177 INFO ce: logged out
Apr 09 16:15:01.996 INFO cr1: logged out
Apr 09 16:15:02.033 INFOerror: Communication Error: error sending request for url (http://[::1]:59666/instance): error trying to connect: tcp connect error: Connection refused (os error 146)
 cr2: logged out
Apr 09 16:15:02.034 ERRO launch failed: Communication Error: error sending request for url (http://[::1]:59666/instance): error trying to connect: tcp connect error: Connection refused (os error 146)
Apr 09 16:15:02.034 INFO ce: init start
Apr 09 16:15:02.035 INFO cr1: init start
Apr 09 16:15:02.035 INFO cr2: init start
Apr 09 16:15:02.035 INFO g0: init start
Apr 09 16:15:02.036 INFO g1: init start
Apr 09 16:15:02.036 WARN [sc] g1: IO error: Connection refused (os error 146)
Apr 09 16:15:02.036 INFO g2: init start
Apr 09 16:15:02.036 WARN [sc] g2: IO error: Connection refused (os error 146)
Apr 09 16:15:02.037 INFO g3: init start
Apr 09 16:15:03.038 WARN [sc] g1: IO error: Connection refused (os error 146)
Apr 09 16:15:03.038 WARN [sc] g2: IO error: Connection refused (os error 146)
Apr 09 16:15:04.040 WARN [sc] g1: IO error: Connection refused (os error 146)
Apr 09 16:15:04.040 WARN [sc] g2: IO error: Connection refused (os error 146)
Apr 09 16:15:04.532 INFO cr2: init finish
Apr 09 16:15:04.759 INFO cr1: init finish
Apr 09 16:15:05.042 WARN [sc] g1: IO error: Connection refused (os error 146)
Apr 09 16:15:05.042 WARN [sc] g2: IO error: Connection refused (os error 146)
Apr 09 16:15:06.043 WARN [sc] g1: IO error: Connection refused (os error 146)
Apr 09 16:15:06.044 WARN [sc] g2: IO error: Connection refused (os error 146)
Apr 09 16:15:07.045 WARN [sc] g1: IO error: Connection refused (os error 146)
Apr 09 16:15:07.045 WARN [sc] g2: IO error: Connection refused (os error 146)
Apr 09 16:15:08.046 WARN [sc] g1: IO error: Connection refused (os error 146)
Apr 09 16:15:08.046 WARN [sc] g2: IO error: Connection refused (os error 146)
Apr 09 16:15:09.047 WARN [sc] g1: IO error: Connection refused (os error 146)
Apr 09 16:15:09.047 WARN [sc] g2: IO error: Connection refused (os error 146)
Apr 09 16:15:10.049 WARN [sc] g1: IO error: Connection refused (os error 146)
Apr 09 16:15:10.049 WARN [sc] g2: IO error: Connection refused (os error 146)
Apr 09 16:15:10.921 INFO ce: init finish
Apr 09 16:15:11.050 WARN [sc] g1: IO error: Connection refused (os error 146)
Apr 09 16:15:11.050 WARN [sc] g2: IO error: Connection refused (os error 146)
Apr 09 16:15:12.052 WARN [sc] g1: IO error: Connection refused (os error 146)
Apr 09 16:15:12.052 WARN [sc] g2: IO error: Connection refused (os error 146)
Apr 09 16:15:13.053 WARN [sc] g1: IO error: Connection refused (os error 146)
Apr 09 16:15:13.053 WARN [sc] g2: IO error: Connection refused (os error 146)
Apr 09 16:15:14.054 WARN [sc] g1: IO error: Connection refused (os error 146)
Apr 09 16:15:14.054 WARN [sc] g2: IO error: Connection refused (os error 146)
Apr 09 16:15:15.055 WARN [sc] g1: IO error: Connection refused (os error 146)
Apr 09 16:15:15.055 WARN [sc] g2: IO error: Connection refused (os error 146)
Apr 09 16:15:16.057 WARN [sc] g1: IO error: Connection refused (os error 146)
Apr 09 16:15:16.057 WARN [sc] g2: IO error: Connection refused (os error 146)
Apr 09 16:15:17.058 WARN [sc] g1: IO error: Connection refused (os error 146)
Apr 09 16:15:17.058 WARN [sc] g2: IO error: Connection refused (os error 146)
Apr 09 16:15:18.059 WARN [sc] g1: IO error: Connection refused (os error 146)
Apr 09 16:15:18.059 WARN [sc] g2: IO error: Connection refused (os error 146)
Apr 09 16:15:19.060 WARN [sc] g1: IO error: Connection refused (os error 146)
Apr 09 16:15:19.061 WARN [sc] g2: IO error: Connection refused (os error 146)
Apr 09 16:15:20.061 WARN [sc] g1: IO error: Connection refused (os error 146)
Apr 09 16:15:20.062 WARN [sc] g2: IO error: Connection refused (os error 146)
Apr 09 16:15:21.065 WARN [sc] g1: IO error: Connection refused (os error 146)
Apr 09 16:15:21.065 WARN [sc] g2: IO error: Connection refused (os error 146)
Apr 09 16:15:22.067 WARN [sc] g1: IO error: Connection refused (os error 146)
Apr 09 16:15:22.067 WARN [sc] g2: IO error: Connection refused (os error 146)
Apr 09 16:15:23.070 WARN [sc] g1: IO error: Connection refused (os error 146)
Apr 09 16:15:23.070 WARN [sc] g2: IO error: Connection refused (os error 146)
Apr 09 16:15:24.073 WARN [sc] g1: IO error: Connection refused (os error 146)
Apr 09 16:15:24.073 WARN [sc] g2: IO error: Connection refused (os error 146)
Apr 09 16:15:25.075 WARN [sc] g1: IO error: Connection refused (os error 146)
Apr 09 16:15:25.075 WARN [sc] g2: IO error: Connection refused (os error 146)
Apr 09 16:15:26.076 WARN [sc] g1: IO error: Connection refused (os error 146)
Apr 09 16:15:26.076 WARN [sc] g2: IO error: Connection refused (os error 146)
Apr 09 16:15:27.078 WARN [sc] g1: IO error: Connection refused (os error 146)
Apr 09 16:15:27.078 WARN [sc] g2: IO error: Connection refused (os error 146)
Apr 09 16:15:28.079 WARN [sc] g1: IO error: Connection refused (os error 146)
Apr 09 16:15:28.079 WARN [sc] g2: IO error: Connection refused (os error 146)
Apr 09 16:15:29.081 WARN [sc] g1: IO error: Connection refused (os error 146)
Apr 09 16:15:29.081 WARN [sc] g2: IO error: Connection refused (os error 146)
Apr 09 16:15:30.083 WARN [sc] g1: IO error: Connection refused (os error 146)
Apr 09 16:15:30.083 WARN [sc] g2: IO error: Connection refused (os error 146)
Apr 09 16:15:31.083 WARN [sc] g1: IO error: Connection refused (os error 146)
Apr 09 16:15:31.084 WARN [sc] g2: IO error: Connection refused (os error 146)
Apr 09 16:15:32.084 INFO g1: init finish
Apr 09 16:15:32.085 INFO g2: init finish
Apr 09 16:17:55.058 INFO g0: init finish
Apr 09 16:17:55.389 INFO g3: init finish
Error: g1: init IO error: Connection refused (os error 146)

Mostly through code inspection and grep'ing for error messages, I pieced together this sequence:

  • a4x2 invokes falcon's run()
    • run() invokes launch()
      • that invokes r.launch() on Runner
        • that runs preflight, then r.do_launch()
          • that iterates the nodes and kicks off n.launch() for each one
            - this calls launch_vm
            - this uses client.instance_ensure() on a Propolis (HTTP API) client. It retries 30 times. That's what we're seeing for all the "instance ensure" info-level messages.
            • for the successful nodes (most of them), this completes relatively quickly
            • for g1 and g2: they get stuck
          • then it uses futures' join_all to wait for them, presumably in the order that they finish. it bails out on the first failure, which in this case I think is g1.
          • it immediately prints error: Communication Error: error sending request for url (http://[::1]:59666/instance): error trying to connect: tcp connect error: Connection refused (os error 146), but this gets injected into the middle of an unrelated slog message in the output above
      • back in launch(), it prints an error-level log message
    • run() does not return an error
  • a4x2, seeing no error, proceeds to init
    • for g1 and g2, this gets stuck in a retry loop trying to connect to the serial console (this is the "[sc]" in the messages).

I'm guessing it's a bug that we don't propagate the error from launch()? That's here:

launch(r).await;

I notice that some of the other code paths (like info and reboot) do propagate the error while others don't (like preflight).

Some suggestions for making this more debuggable:

  • some context on the [sc] g1: IO error: Connection refused (os error 146) messages would be useful (e.g., what was trying to make a connection and why?) -- something like error attempting to connect to serial console of g1: connect ws://.../: Connection refused (os error 146).
  • some context on the message Communication Error: error sending request for url (http://[::1]:59666/instance): error trying to connect: tcp connect error: Connection refused (os error 146) saying what it was doing and why (e.g., which VM it was, that it was trying to talk to Propolis to configure the VM).
  • maybe even the "instance ensure" message could say "configuring Propolis with VM details" or something?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions