Skip to content

Conversation

@ggevay
Copy link
Contributor

@ggevay ggevay commented Nov 2, 2025

This is just @aljoscha's #29673 rebased. Gonna run a new Nightly.

There was a somewhat scary conflict resolution in sequence_create_source: #29673 deleted the closure of the catalog_transact_with_ddl_transaction call, and moved its functionality into handle_create_source, but in the meantime the closure was changed on main by 6dd815f (around the coord.catalog().get_entry(&item_id).latest_global_id(); line of this commit). I applied this same change to handle_create_source: copied the line from 6dd815f that calls latest_global_id(), made it use the item_id argument given to handle_create_source (this argument was previously unused), and made the IngestionDescription::new use the new item_global_id instead of source.global_id. I hope this is correct.

Nightly runs:

Motivation

Tips for reviewer

Checklist

  • This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
  • This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
  • If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
  • If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
  • If this PR includes major user-facing behavior changes, I have pinged the relevant PM to schedule a changelog post.

aljoscha and others added 29 commits November 2, 2025 14:06
NOTE: This is the first of two commits. In here, we add the basic
scaffolding for deriving implications from catalog changes, the parsing
logic, all the plumbing. In the next commit we actually move code over
from closures in the sequencer and the drop logic from ddl.rs.

First step towards MaterializeInc/database-issues#8488

This change introduces a new architectural pattern for handling
implications for in-memory state, controller changes, or anything else
that might have to be done as a result of updating the catalog. Instead
of directly updating state or instructing controllers within the
sequencer logic (via closures) or in the drop logic of ddl.rs, we now
derive these implications from catalog state updates and apply them in a
separate phase.

Key architectural changes:
- Introduces `ParsedStateUpdate` that represent catalog changes that
  need to be applied
- Adds a new `catalog_update_implications` module containing logic to
  derive and apply implications from consolidated catalog updates
- Implements `apply_catalog_update_implications` that processes catalog
  updates and derives implications for items, clusters, and replicas,
  ongoing selects, subscribes, etc.
- Adds consolidation logic to handle "fluctuating" updates (e.g., from
  DROP OWNED BY) where intermediate states cancel out, ensuring only net
  changes are applied
- Moves update logic from inline closures in sequencer operations
  (particularly in `create_source_inner` and related methods) to the new
  centralized implication processing

The refactoring specifically handles:
- Table creation/deletion
- Source creation/deletion
- Storage policy initialization for newly created collections
- Proper handling of temporary items alongside durable ones

This approach provides better separation of concerns between catalog
mutations and applying implications, making the system more maintainable
and setting the foundation for eventually handling all implications
through this pattern.

Future work will extend this pattern to handle all catalog item types
including indexes, materialized views, sinks, and other database
objects.
NOTE: This is the second commit that moves over code from closures in
sequencer and the drop logic from ddl.rs.

First step towards MaterializeInc/database-issues#8488

This change introduces a new architectural pattern for handling
implications for in-memory state, controller changes, or anything else
that might have to be done as a result of updating the catalog. Instead
of directly updating state or instructing controllers within the
sequencer logic (via closures) or in the drop logic of ddl.rs, we now
derive these implications from catalog state updates and apply them in a
separate phase.

Key architectural changes:
- Introduces `ParsedStateUpdate` that represent catalog changes that
  need to be applied
- Adds a new `catalog_update_implications` module containing logic to
  derive and apply implications from consolidated catalog updates
- Implements `apply_catalog_update_implications` that processes catalog
  updates and derives implications for items, clusters, and replicas,
  ongoing selects, subscribes, etc.
- Adds consolidation logic to handle "fluctuating" updates (e.g., from
  DROP OWNED BY) where intermediate states cancel out, ensuring only net
  changes are applied
- Moves update logic from inline closures in sequencer operations
  (particularly in `create_source_inner` and related methods) to the new
  centralized implication processing

The refactoring specifically handles:
- Table creation/deletion
- Source creation/deletion
- Storage policy initialization for newly created collections
- Proper handling of temporary items alongside durable ones

This approach provides better separation of concerns between catalog
mutations and applying implications, making the system more maintainable
and setting the foundation for eventually handling all implications
through this pattern.

Future work will extend this pattern to handle all catalog item types
including indexes, materialized views, sinks, and other database
objects.
Picking the op_id, as before, seems bogus, because this is just an
increasing number within the transaction, and so the temporary updates
wouldn't be sorted in one batch with other updates when chunking up
durable and temporary catalog updates.
When applying updates that are derived from catalog changes to the
controllers we want the updates to be consolidated. Before we were doing
that right on the parsed controller updates, but we had to implement
bespoke Ord/Eq that would ignore the in-memory fields.

Now, we sort earlier in the transact/apply "pipeline", right when we
have the StateUpdates. This allows us to remove the bespoke Ord/Eq
impls, but comes with other downsides as explained in code comments.
which ... temporary sources are not a thing, but the types support it
and this way the code is the same as for non-temp items
@ggevay ggevay added the A-ADAPTER Topics related to the ADAPTER layer label Nov 2, 2025
@ggevay
Copy link
Contributor Author

ggevay commented Nov 2, 2025

Looking into the Nightly failures:

  • K8s recovery failed again, but it also failed on main here. On main it passed after a retry, so retrying also here. Edit: Failed on a retry too. Retrying one more time. Edit 2: I just noticed that on main a different K8s recovery test failed, not the envd one.
    • Note that the Validation failed in try 11/33 with isolation STRICT SERIALIZABLE, retrying. message is there also in the last successful run on main, so this is not the issue.
    • The test actually fails with a timeout, so something is taking more time than usual. The last several runs of K8s recovery: envd on failing node on main were all successful with a run time of 26-27 minutes, so the timeout after 1h 3m is a significant run time increase.
      • I'm wondering if this is related to the slight DDL regression that is expected from this PR. Probably not, because there is only a little DDL in this test. It doesn't seem to do DDL in a loop. (Also, Scalability benchmark (DDL) against merge base or 'latest' failure shows only around a 20% regression, which seems to be much less than the more than 2x slowdown here.)
      • I'm not sure how we could figure out which part is slower, because I don't see timestamps in run.log. Edit: There is actually a "Show timestamps" button!!
        • In the new retry, we are at Validation failed in try 9/33 with isolation STRICT SERIALIZABLE, retrying. after 22 minutes. (This is the retry loop in def validate_state.)
        • We are here after 39 minutes (the next log msg in a successful run should be something like Succeeded to achieve 'node recovery and live data' within 349.0 seconds (limit: 660s), where the within is fairly stable between 340-360):
Validation failed in try 16/33 with isolation STRICT SERIALIZABLE, retrying.
waiting for kubectl wait --for condition=Ready pod/testdrive --timeout 300s --context kind-mzcloud --namespace default ... success!
error: error connecting to server: Connection refused (os error 111): Connection refused (os error 111)
Validation failed in try 17/33 with isolation STRICT SERIALIZABLE, retrying.
Unable to use a TTY - input is not a terminal or the right kind of file
!!! Error Report
1 errors were encountered during execution
files involved: -
command terminated with exit code 1
waiting for kubectl wait --for condition=Ready pod/testdrive --timeout 300s --context kind-mzcloud --namespace default ... success!

We are still at the same spot after 48m. Edit: We timed out at this spot. Edit 2: I don't understand how can we be at the exact same try 17/33 for more than 20 minutes. Shouldn't the retry loop in def validate_state go on the the next try or give up?

kubectl-get-all.log has some further info: envd is not able to connect to most of the replicas (but it is able to connect to s2):

[pod/environmentd-0/environmentd] 2025-11-02T18:34:06.183693Z  INFO mz_service::transport: ctp: connected to server address=cluster-u3-replica-u4-gen-0-0.cluster-u3-replica-u4-gen-0.default.svc.cluster.local:2101
[pod/environmentd-0/environmentd] 2025-11-02T18:34:06.239656Z  INFO mz_compute_client::controller::replica: error connecting to replica: timed out replica_id=s3 next_backoff=1s
[pod/environmentd-0/environmentd] 2025-11-02T18:34:06.240320Z  INFO mz_service::transport: ctp: connected to server address=cluster-s3-replica-s3-gen-0-0.cluster-s3-replica-s3-gen-0.default.svc.cluster.local:2101
[pod/environmentd-0/environmentd] 2025-11-02T18:34:06.250730Z  INFO mz_compute_client::controller::replica: error connecting to replica: timed out replica_id=s1 next_backoff=1s
[pod/environmentd-0/environmentd] 2025-11-02T18:34:06.251416Z  INFO mz_service::transport: ctp: connected to server address=cluster-s1-replica-s1-gen-0-0.cluster-s1-replica-s1-gen-0.default.svc.cluster.local:2101
[pod/environmentd-0/environmentd] 2025-11-02T18:34:06.266643Z  INFO mz_compute_client::controller::replica: error connecting to replica: timed out replica_id=u2 next_backoff=1s
[pod/environmentd-0/environmentd] 2025-11-02T18:34:06.267311Z  INFO mz_service::transport: ctp: connected to server address=cluster-u1-replica-u2-gen-0-0.cluster-u1-replica-u2-gen-0.default.svc.cluster.local:2101
[pod/environmentd-0/environmentd] 2025-11-02T18:34:06.278518Z  INFO mz_compute_client::controller::replica: error connecting to replica: timed out replica_id=u3 next_backoff=1s
[pod/environmentd-0/environmentd] 2025-11-02T18:34:06.279141Z  INFO mz_service::transport: ctp: connected to server address=cluster-u2-replica-u3-gen-0-0.cluster-u2-replica-u3-gen-0.default.svc.cluster.local:2101
[pod/environmentd-0/environmentd] 2025-11-02T18:34:09.088920Z  WARN mz_server_core: error handling connection: connection closed server="pgwire" conn_uuid=019a45b8-5d12-7fa2-b4f1-e90c2ffef144
[pod/environmentd-0/environmentd] 2025-11-02T18:34:11.185569Z  INFO mz_compute_client::controller::replica: error connecting to replica: timed out replica_id=u4 next_backoff=1s
[pod/environmentd-0/environmentd] 2025-11-02T18:34:11.186263Z  INFO mz_service::transport: ctp: connected to server address=cluster-u3-replica-u4-gen-0-0.cluster-u3-replica-u4-gen-0.default.svc.cluster.local:2101
[pod/environmentd-0/environmentd] 2025-11-02T18:34:11.241526Z  INFO mz_compute_client::controller::replica: error connecting to replica: timed out replica_id=s3 next_backoff=1s
[pod/environmentd-0/environmentd] 2025-11-02T18:34:11.242150Z  INFO mz_service::transport: ctp: connected to server address=cluster-s3-replica-s3-gen-0-0.cluster-s3-replica-s3-gen-0.default.svc.cluster.local:2101
[pod/environmentd-0/environmentd] 2025-11-02T18:34:11.252164Z  INFO mz_compute_client::controller::replica: error connecting to replica: timed out replica_id=s1 next_backoff=1s
[pod/environmentd-0/environmentd] 2025-11-02T18:34:11.252722Z  INFO mz_service::transport: ctp: connected to server address=cluster-s1-replica-s1-gen-0-0.cluster-s1-replica-s1-gen-0.default.svc.cluster.local:2101
[pod/environmentd-0/environmentd] 2025-11-02T18:34:11.268305Z  INFO mz_compute_client::controller::replica: error connecting to replica: timed out replica_id=u2 next_backoff=1s
[pod/environmentd-0/environmentd] 2025-11-02T18:34:11.268945Z  INFO mz_service::transport: ctp: connected to server address=cluster-u1-replica-u2-gen-0-0.cluster-u1-replica-u2-gen-0.default.svc.cluster.local:2101
[pod/environmentd-0/environmentd] 2025-11-02T18:34:11.280138Z  INFO mz_compute_client::controller::replica: error connecting to replica: timed out replica_id=u3 next_backoff=1s
[pod/environmentd-0/environmentd] 2025-11-02T18:34:11.280672Z  INFO mz_service::transport: ctp: connected to server address=cluster-u2-replica-u3-gen-0-0.cluster-u2-replica-u3-gen-0.default.svc.cluster.local:2101

Can this be related to this PR? Maybe we are trying to connect to the replicas at wrong addresses due to something broken with how the new code drives the compute controller? Edit: Or, does ctp: connected to server address=cluster-s3-replica-s3-gen-0-0.cluster-s3-replica-s3-gen-0.default.svc.cluster.local:2101 mean that the connection itself succeeded? But then what exactly the mz_compute_client::controller::replica: error connecting to replica: timed out replica_id=s3 next_backoff=1s means?

Edit: Using the "Show timestamps" button it's very clearly visible that all the runs hang at this same point for 30+ minutes. It might be a test issue that the test hangs there, instead of some specific operation (e.g., a specific Testdrive query) timing out and failing the test. But the above replica connection issue is probably the real problem, independently from the test issue. Slack thread.

Other failures:

  • (Parallel Workload (0dt deploy) also failed on main a few days ago with the same error here, but passed after a retry. Retrying also here. Edit: Passed on a retry.)
  • The Zippy User Tables failure might be new. Retrying for now. Similar failures seem to have happened only a long time ago. (on main only in 2024) Edit: Well, passed on a retry. I'm hitting retry a few more times to get a feel of how often the failure happens. Maybe it is a very rare flake that already existed before this PR. Edit: It passed 3 times in a row now.
  • The Scalability benchmark (DDL) against merge base or 'latest' regression is expected, as discussed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-ADAPTER Topics related to the ADAPTER layer

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants