Add QPS measurements to Cluster Spec Sheet #34058

ggevay · 2025-11-07T16:21:31Z

This PR adds QPS measurements to the Cluster Spec Sheet, so that we have automated and repeatable QPS benchmarks.

It runs for about 15 minutes. Later, we can add more QPS scenarios. (There are some ideas in a code comment at the end of QpsEnvdStrongScalingScenario.)

Btw. I saw buildkite.shard_list in workflow_default, but this would work only if we made the different shards use different envs. Currently, the buildkite-level concurrency is set to 1, so it's not an issue for now. (This is the situation both before and after this PR.)

Edit: CI run: https://buildkite.com/materialize/release-qualification/builds/981

ggevay · 2025-11-07T17:00:44Z

test/cluster-spec-sheet/mzcompose.py

+    # Upload only cluster scaling results to Test Analytics for now, until the Test Analytics schema is extended.
+    # TODO: See slack discussion:
+    # https://materializeinc.slack.com/archives/C01LKF361MZ/p1762351652336819?thread_ts=1762348361.164759&cid=C01LKF361MZ
+    upload_results_to_test_analytics(c, cluster_path, not test_failed)


This is the point where we skip the upload to the Test Analytics for the new scenario for now.

So what this test is measuring is only QPS instead of size in bytes and time in milliseconds?

Yes:

The size of the arrangement that we are querying from is very small and doesn't matter.

"Time" could in theory be the latency of the queries, but I haven't added a measurement of latency for now, and it will probably be better to call it "latency" when we add a measurement for it.

@antiguru So, how should we solve this? Should I add another column to the same table or create a separate table for the envd runs?

I'd prefer to have a different table, instead of forcing the consumer to distinguish between latency and throughput. It tests fundamentally different properties.

Is it hard to add a new table?

Not hard, will do

ggevay · 2025-11-09T16:09:35Z

misc/python/materialize/mzcompose/composition.py

+    def cloud_hostname(
+        self, quiet: bool = False, timeout_secs: int = 180, poll_interval: float = 2.0
+    ) -> str:
+        """Uses the mz command line tool to get the hostname of the cloud instance, waiting until the region is ready."""


I had to add a retry loop here, because sometimes it would fail despite mz region enable already having come back saying that the cloud side of things are done. I think this started to happen when I added an mz region disable before my mz region enables. (The reason for adding the disabling is explained in a code comment in cloud_disable_enable_and_wait.)

def- · 2025-11-09T22:52:49Z

Do you have a sample run in CI? I didn't see one in https://buildkite.com/materialize/release-qualification/builds?branch=ggevay%3Aqps-cluster-spec-sheet

ggevay · 2025-11-10T08:06:48Z

Started a CI run now: https://buildkite.com/materialize/release-qualification/builds/981 (I wanted to start it yesterday evening, but somehow forgot.)

ggevay · 2025-11-10T15:11:41Z

The actual CI run has uncovered an issue that was not visible in my manual tests against my staging env: The new scenario wants to vary the number of CPU cores that we give to envd, which it tries to do with
mz region enable --environmentd-cpu-allocation
The problem is that this flag is not allowed on a prod env. (The decision to not allow this was made here, see dangerous_customizations. I didn't realize back then that the cluster spec sheet runs against a prod env.)
@def-, @antiguru, Would it be ok to set up the new scenario to run against a staging env? Or, to keep things simple, maybe we could run the entire cluster spec sheet against a staging env? What are the constraints for choosing the env?

def- · 2025-11-10T16:52:28Z

We didn't want to set up a fresh environment for the Cluster Spec Sheet, so we instead opted to reuse one that was already created for tests, but not running anything at the moment. If @antiguru is fine with it, I can try to move us to Staging.
As an alternative, could we allow dangerous customizations for this specific org in production or is that too hacky?

antiguru

Just a few comments from reading once.

antiguru · 2025-11-10T18:44:19Z

test/cluster-spec-sheet/mzcompose.py

+        quoted_flags = " ".join(shlex.quote(x) for x in flags)
+        script = (
+            "set -euo pipefail; "
+            "cat > /tmp/run.ini; "


It's better to use a temp file here instead of relying on /tmp/run.ini not being used otherwise.

Opened a slack thread in #operations Edit: I mixed this up with the below thread.

antiguru · 2025-11-10T18:56:35Z

test/dbbench/Dockerfile

+
+# Pin to specific commit for reproducibility
+# TODO: Fork it under the MaterializeInc organization (like e.g. sqlsmith).
+ADD https://api.github.com/repos/sjwiesman/dbbench/git/commits/10b2a0b5159f06945646fa1179bda4be51fe02b4 version.json


We should clone this into Mz's GH organization.

Done (using mktemp) Edit: I mixed this up with the above thread.

Did we clone the repo into our org?

I'm not sure. @def-, I thought that in this thread we were talking about this, but maybe that was actually a different thing?

No, that was not the Github repo, but the GHCR and Docker container name.
I'll clone the repo into our org now and then change the commit. Edit: Done

ggevay · 2025-11-10T20:05:18Z

If @antiguru is fine with it, I can try to move us to Staging.
As an alternative, could we allow dangerous customizations for this specific org in production or is that too hacky?

I would prefer moving it to Staging, if it's not too much hassle. Hardcoding this org in the cloud code seems not so elegant. Seems like a more flexible approach to handle this in Staging.

antiguru · 2025-11-11T08:10:03Z

I would prefer moving it to Staging, if it's not too much hassle. Hardcoding this org in the cloud code seems not so elegant. Seems like a more flexible approach to handle this in Staging.

I'd prefer to keep the cluster spec sheet portion in prod to avoid any uncertainty that there's a difference between prod and staging. If we were to move it to staging, we'd need a test that ensures they're indeed equivalent.

However, I think there's an alternative: We could run the cluster spec sheet selectively, and run the default set of operations on prod, and just the envd portion in staging.

ggevay · 2025-11-11T09:49:40Z

Ok, then let's split it, and run the envd/QPS part in Staging (in the region close to the Hetzner machines).

What's the procedure for creating an organization in Staging? Just asking in #operations? Or if it's simpler to point it at my Staging organization, then we can also do that for now. In both cases, I'll need to put an app password in Pulumi in i2, right?

def- · 2025-11-11T10:19:38Z

Let me check if we have something suitable in Staging already.

antiguru

I think this is headed in the right direction, but we should iterate on what we're trying to record. Just a QPS value doesn't feel enough, we also need the corresponding latency percentiles.

antiguru · 2025-11-11T10:46:18Z

test/cluster-spec-sheet/mzcompose.py

+        We capture and parse dbbench's output (stderr or stdout) to find the last
+        occurrence of a "QPS" value that dbbench reports in its summary, and store that
+        as the 'qps' column in the results CSV. If no QPS is found, the test fails.
+        (We omit time_ms for these rows since wall-clock time is not meaningful here.)


It feels better to capture dbbench's output file (query-log-file) and compute percentiles from it.

It feels better to capture dbbench's output file

(We do capture dbbench's output file, and save it in the dbbench-logs directory.)

query-log-file

Well, query-log-file seems to be an input file to dbbench, as far as I can tell from reading dbbench's tutorial. Did you mean to suggest query-log-file, or something else?

and compute percentiles from it

I was planning this for follow-up PRs, as mentioned in code comments, e.g.:

# TODO: Later, we'll want to also look at latency (avg and tail-latency), but seems too unstable for now.

(Continuing here from #34058 (comment) )

query-results-file

and compute percentiles from it

Sorry, I still don't understand. From a quick read, it seems that query-results-file records the actual data that the queries return, but does not record any timing information. What percentiles do you mean?

Note that dbbench prints a histogram on its own, without any extra options or manual analysis, e.g.

33.554432ms - 67.108864ms [ 35]: ▏ 67.108864ms - 134.217728ms [ 1234]: █ 134.217728ms - 268.435456ms [ 1111]: ▉ 268.435456ms - 536.870912ms [ 60545]: ██████████████████████████████████████████████████ 536.870912ms - 1.073741824s [ 248]: ▏ 1.073741824s - 2.147483648s [ 431]: ▎ 2.147483648s - 4.294967296s [ 540]: ▍ 4.294967296s - 8.589934592s [ 70]: ▏

Mind checking the contents of the results file? I think it should be name, start time, elapsed nanoseconds, rows, errors, but I might be mistaken.

So, turns out that query-results-file is broken in Seth's fork, in that the file comes out completely empty. But now I've tried running the original code from memsql, and then it seems to print just the query results. There is also no csv header.

This is my ini:

duration=10s [setup] query=create cluster qps (size='scale=1,workers=1') query=create materialized view if not exists test2 as select generate_series as x from generate_series(1, 100) query=create view if not exists tempview1 as select * from test2 where x % 2 = 1 query=create default index on tempview1 [teardown] query=drop materialized view test2 cascade query=drop cluster qps [loadtest] query=select * from tempview1 concurrency=16 start=3s query-results-file=/tmp/results.csv

And this is the first few lines of the results file:

1 3 5 7 9 11 13 15 17 19 21 23 25 27

which is (a part of) the result of the above query.

I guess we could modify our fork to output also timings, but how about we get back to figuring out how exactly to extract detailed timing info from dbbench later? I'd like to consider this out of scope for this PR, as mentioned here.

antiguru · 2025-11-11T10:47:32Z

test/cluster-spec-sheet/mzcompose.py

+        self.add_result(
+            category,
+            name,
+            # we have `duration` instead of `repetitions` in QPS benchmarks currently (but maybe it's good to keep
+            # `repetition` in the schema in case we ever want also multiple repetitions for QPS)
+            0,
+            None,
+            None,
+            qps=qps_val,
+        )


I feel the raw QPS is less meaningful than a histogram over durations. Without the latency, the QPS doesn't tell us much as it could correlate with very high latency.

I agree that we should record latency, but I'd like to do this later if that's ok. This is because I'd like to now urgently get back to the main part of the QPS work (#33713 and followup PRs). And then after the November deliverables are done for https://github.com/MaterializeInc/database-issues/issues/9593, I'd be happy follow up on making the measurements here fancier.

Btw. for looking at the latency histogram, I think it would be good to modify our fork of dbbench to be able to print more machine-readable output. This would also make it kind of a separate thing from this initial PR.

Without the latency, the QPS doesn't tell us much as it could correlate with very high latency.

Well, when QPS does correlate with latency, that means that measuring QPS also gives us information about latency. So, latency would be particularly important to record when QPS does not correlate with latency, since in that case a regression in latency might go unnoticed by just looking at QPS.

(Btw. I can say some anecdotal evidence for QPS being correlated with latency: I did have a situation during development where I had very bad tail latencies due to a glitch in the benchmark setup, and QPS also went down with it. Concretely, this happened when I didn't have the mz region disable before the enables, and a 0dt promotion sometimes came in the middle of dbbench measurements. This caused just a few queries (e.g., 4 queries out of thousands) to have a latency of over 1 minute, and this also made QPS very low.)

def- · 2025-11-11T14:00:09Z

I made the requested changes, new test run: https://buildkite.com/materialize/release-qualification/builds/985

antiguru · 2025-11-11T14:11:39Z

Sorry, it's called query-results-file!

def- · 2025-11-11T16:14:34Z

Failed because the dbbench repo on DockerHub/GHCR is not public, I asked Paul to help us out: https://buildkite.com/materialize/release-qualification/builds/985#019a7384-cfa9-4819-a0ee-fde99ea15a15
Edit: https://buildkite.com/materialize/release-qualification/builds/990

ggevay · 2025-11-11T16:42:43Z

Failed because the dbbench repo on DockerHub/GHCR is not public, I asked Paul to help us out

Thank you! So, this thread is superseded? Or we still also need the repo fork as well?

def- · 2025-11-11T16:59:05Z

Indeed, I didn't see that you posted already. For the future I have permissions, so you can just ping me directly

…parate table

def- · 2025-11-12T13:06:28Z

Green in CI: https://buildkite.com/materialize/release-qualification/builds/990
Results made it to the database:

ggevay · 2025-11-12T14:13:48Z

Great, thank you very much!

Btw. how did you resolve this redness in the normal CI?

Obtaining hostname of cloud instance ...
==> mzcompose: test case workflow-default failed: materialize.ui.UIError: failed to obtain cloud hostname within 180s: Network error during a Materialize cloud API request: error sending request for url (https://api.us-east-1.aws.cloud.materialize.com/api/region): operation timed out

I think this was caused by create_test_analytics_config catching CommandFailureCausedUIError, but cloud_hostname throwing a more general exception, UIError, since the first commit of the PR (sorry). Asking just to check whether you resolved this in some way or the new run just randomly didn't hit this flake.

def- · 2025-11-12T15:40:44Z

This flake doesn't seem related to this PR. I think we hit it in other Nightly runs too. I haven't investigated further though.

antiguru

Looks great! Left some comments.

antiguru · 2025-11-13T14:43:39Z

test/cluster-spec-sheet/mzcompose.py

+            # we have `duration` instead of `repetitions` in QPS benchmarks currently (but maybe it's good to keep
+            # `repetition` in the schema in case we ever want also multiple repetitions for QPS)


Is this comment still relevant? You could just say that 0 is the repetition.

antiguru · 2025-11-13T14:45:15Z

test/cluster-spec-sheet/mzcompose.py

+        # We'll also want to measure latency, including tail latency.
+
+
+def disable_region(c: Composition, hard: bool) -> None:


Note, but don't act immediately: We should extract the disable_region, wait_for_cloud, etc. to a module as they're not specific to the cluster spec sheet. Especially since they've become bigger with your changes!

antiguru · 2025-11-13T14:47:13Z

test/cluster-spec-sheet/mzcompose.py

+        # Upload only cluster scaling results to Test Analytics for now, until the Test Analytics schema is extended.
+        # TODO: See slack discussion:
+        # https://materializeinc.slack.com/archives/C01LKF361MZ/p1762351652336819?thread_ts=1762348361.164759&cid=C01LKF361MZ


Is this comment still relevant?

antiguru · 2025-11-13T14:48:27Z

test/cluster-spec-sheet/mzcompose.py



 class BenchTarget:
+    c: Composition


Do you mind renaming c with composition? We're also using cluster c, which might get confusing (that's my fault!).

antiguru · 2025-11-13T14:52:44Z

test/cluster-spec-sheet/mzcompose.py

 def workflow_plot(c: Composition, parser: WorkflowArgumentParser) -> None:
    """Analyze the results of the workflow."""

    parser.add_argument(
        "files",
        nargs="*",
        default="results_*.csv",
        type=str,
        help="Glob pattern of result files to plot.",
    )

    args = parser.parse_args()

    for file in itertools.chain(*map(glob.iglob, args.files)):
-        analyze_file(str(file))
+        analyze_results_file(str(file))


The runtime dispatch based on the file name isn't great. I often find myself concatenating multiple results into non-descript file names, and the mechanism would not work here. What about using two different targets instead, plot_cluster and plot_envd? You could then update the default to do the right globbing.

antiguru · 2025-11-13T14:53:26Z

test/cluster-spec-sheet/mzcompose.py

+    for (benchmark, category, mode), sub in df.groupby(
+        ["scenario", "category", "mode"]
+    ):


I think this will not work if you have repetitions, but might be OK for now.

antiguru · 2025-11-13T14:54:22Z

test/cluster-spec-sheet/README.md

+## Scenarios
+
+There are two kinds of scenarios:
+- cluster scaling: These measure run times and arrangement sizes.
+- envd scaling: These measure QPS.


Please document how to invoke each.

antiguru · 2025-11-13T14:54:55Z

test/dbbench/Dockerfile

+
+# Pin to specific commit for reproducibility
+# TODO: Fork it under the MaterializeInc organization (like e.g. sqlsmith).
+ADD https://api.github.com/repos/sjwiesman/dbbench/git/commits/10b2a0b5159f06945646fa1179bda4be51fe02b4 version.json


Did we clone the repo into our org?

ggevay force-pushed the qps-cluster-spec-sheet branch 4 times, most recently from d9c0271 to 23c64a8 Compare November 7, 2025 16:59

ggevay commented Nov 7, 2025

View reviewed changes

ggevay force-pushed the qps-cluster-spec-sheet branch 2 times, most recently from 93687ac to 5305afb Compare November 7, 2025 17:16

tests: Add a retry loop to Composition.cloud_hostname

a7cc759

ggevay force-pushed the qps-cluster-spec-sheet branch from 5305afb to 2113d83 Compare November 9, 2025 16:07

ggevay commented Nov 9, 2025

View reviewed changes

ggevay force-pushed the qps-cluster-spec-sheet branch 5 times, most recently from bfd9236 to 8f18497 Compare November 9, 2025 18:49

tests: Add QPS to Cluster Spec Sheet

d493ba0

ggevay force-pushed the qps-cluster-spec-sheet branch from 8f18497 to d493ba0 Compare November 9, 2025 18:51

ggevay marked this pull request as ready for review November 9, 2025 18:56

ggevay requested review from antiguru and def- November 9, 2025 18:57

ggevay added the T-testing Theme: tests or test infrastructure label Nov 9, 2025

antiguru reviewed Nov 10, 2025

View reviewed changes

Use mktemp

1eef74b

antiguru reviewed Nov 11, 2025

View reviewed changes

def- requested a review from a team as a code owner November 11, 2025 13:58

def- force-pushed the qps-cluster-spec-sheet branch from 9f89356 to 3ea08e6 Compare November 11, 2025 13:59

def- force-pushed the qps-cluster-spec-sheet branch 2 times, most recently from 54fe712 to 57af05f Compare November 11, 2025 15:24

def- force-pushed the qps-cluster-spec-sheet branch from 57af05f to 9bb0fae Compare November 11, 2025 16:57

def- force-pushed the qps-cluster-spec-sheet branch 2 times, most recently from 4499447 to d2c543d Compare November 11, 2025 23:05

cluster spec sheet: Introduce separate staging env and record into se…

a600dbe

…parate table

def- force-pushed the qps-cluster-spec-sheet branch from d2c543d to a600dbe Compare November 12, 2025 08:01

antiguru reviewed Nov 13, 2025

View reviewed changes

Fix repo for dbbench

c54bc80

		# we have `duration` instead of `repetitions` in QPS benchmarks currently (but maybe it's good to keep
		# `repetition` in the schema in case we ever want also multiple repetitions for QPS)

		# We'll also want to measure latency, including tail latency.


		def disable_region(c: Composition, hard: bool) -> None:

Add QPS measurements to Cluster Spec Sheet #34058

Are you sure you want to change the base?

Add QPS measurements to Cluster Spec Sheet #34058

Conversation

ggevay commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ggevay Nov 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

def- commented Nov 9, 2025

Uh oh!

ggevay commented Nov 10, 2025

Uh oh!

ggevay commented Nov 10, 2025

Uh oh!

def- commented Nov 10, 2025

Uh oh!

antiguru left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ggevay Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ggevay Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ggevay Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

def- Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ggevay commented Nov 10, 2025

Uh oh!

antiguru commented Nov 11, 2025

Uh oh!

ggevay commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

def- commented Nov 11, 2025

Uh oh!

antiguru left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ggevay Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ggevay commented Nov 7, 2025 •

edited

Loading

ggevay Nov 9, 2025 •

edited

Loading

ggevay Nov 10, 2025 •

edited

Loading

ggevay Nov 10, 2025 •

edited

Loading

ggevay Nov 13, 2025 •

edited

Loading

def- Nov 14, 2025 •

edited

Loading

ggevay commented Nov 11, 2025 •

edited

Loading

ggevay Nov 11, 2025 •

edited

Loading

ggevay Nov 11, 2025 •

edited

Loading

def- commented Nov 11, 2025 •

edited

Loading

def- commented Nov 11, 2025 •

edited

Loading

ggevay commented Nov 11, 2025 •

edited

Loading

ggevay commented Nov 12, 2025 •

edited

Loading