Skip to content

Conversation

@ggevay
Copy link
Contributor

@ggevay ggevay commented Nov 7, 2025

This PR adds QPS measurements to the Cluster Spec Sheet, so that we have automated and repeatable QPS benchmarks.

It runs for about 15 minutes. Later, we can add more QPS scenarios. (There are some ideas in a code comment at the end of QpsEnvdStrongScalingScenario.)

Btw. I saw buildkite.shard_list in workflow_default, but this would work only if we made the different shards use different envs. Currently, the buildkite-level concurrency is set to 1, so it's not an issue for now. (This is the situation both before and after this PR.)

Edit: CI run: https://buildkite.com/materialize/release-qualification/builds/981

@ggevay ggevay force-pushed the qps-cluster-spec-sheet branch 4 times, most recently from d9c0271 to 23c64a8 Compare November 7, 2025 16:59
# Upload only cluster scaling results to Test Analytics for now, until the Test Analytics schema is extended.
# TODO: See slack discussion:
# https://materializeinc.slack.com/archives/C01LKF361MZ/p1762351652336819?thread_ts=1762348361.164759&cid=C01LKF361MZ
upload_results_to_test_analytics(c, cluster_path, not test_failed)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the point where we skip the upload to the Test Analytics for the new scenario for now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what this test is measuring is only QPS instead of size in bytes and time in milliseconds?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes:

  • The size of the arrangement that we are querying from is very small and doesn't matter.
  • "Time" could in theory be the latency of the queries, but I haven't added a measurement of latency for now, and it will probably be better to call it "latency" when we add a measurement for it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@antiguru So, how should we solve this? Should I add another column to the same table or create a separate table for the envd runs?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to have a different table, instead of forcing the consumer to distinguish between latency and throughput. It tests fundamentally different properties.

Is it hard to add a new table?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not hard, will do

@ggevay ggevay force-pushed the qps-cluster-spec-sheet branch 2 times, most recently from 93687ac to 5305afb Compare November 7, 2025 17:16
@ggevay ggevay force-pushed the qps-cluster-spec-sheet branch from 5305afb to 2113d83 Compare November 9, 2025 16:07
def cloud_hostname(
self, quiet: bool = False, timeout_secs: int = 180, poll_interval: float = 2.0
) -> str:
"""Uses the mz command line tool to get the hostname of the cloud instance, waiting until the region is ready."""
Copy link
Contributor Author

@ggevay ggevay Nov 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to add a retry loop here, because sometimes it would fail despite mz region enable already having come back saying that the cloud side of things are done. I think this started to happen when I added an mz region disable before my mz region enables. (The reason for adding the disabling is explained in a code comment in cloud_disable_enable_and_wait.)

@ggevay ggevay force-pushed the qps-cluster-spec-sheet branch 5 times, most recently from bfd9236 to 8f18497 Compare November 9, 2025 18:49
@ggevay ggevay force-pushed the qps-cluster-spec-sheet branch from 8f18497 to d493ba0 Compare November 9, 2025 18:51
@ggevay ggevay marked this pull request as ready for review November 9, 2025 18:56
@ggevay ggevay requested review from antiguru and def- November 9, 2025 18:57
@ggevay ggevay added the T-testing Theme: tests or test infrastructure label Nov 9, 2025
@def-
Copy link
Contributor

def- commented Nov 9, 2025

Do you have a sample run in CI? I didn't see one in https://buildkite.com/materialize/release-qualification/builds?branch=ggevay%3Aqps-cluster-spec-sheet

@ggevay
Copy link
Contributor Author

ggevay commented Nov 10, 2025

Started a CI run now: https://buildkite.com/materialize/release-qualification/builds/981 (I wanted to start it yesterday evening, but somehow forgot.)

@ggevay
Copy link
Contributor Author

ggevay commented Nov 10, 2025

The actual CI run has uncovered an issue that was not visible in my manual tests against my staging env: The new scenario wants to vary the number of CPU cores that we give to envd, which it tries to do with
mz region enable --environmentd-cpu-allocation
The problem is that this flag is not allowed on a prod env. (The decision to not allow this was made here, see dangerous_customizations. I didn't realize back then that the cluster spec sheet runs against a prod env.)
@def-, @antiguru, Would it be ok to set up the new scenario to run against a staging env? Or, to keep things simple, maybe we could run the entire cluster spec sheet against a staging env? What are the constraints for choosing the env?

@def-
Copy link
Contributor

def- commented Nov 10, 2025

We didn't want to set up a fresh environment for the Cluster Spec Sheet, so we instead opted to reuse one that was already created for tests, but not running anything at the moment. If @antiguru is fine with it, I can try to move us to Staging.
As an alternative, could we allow dangerous customizations for this specific org in production or is that too hacky?

Copy link
Member

@antiguru antiguru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few comments from reading once.

quoted_flags = " ".join(shlex.quote(x) for x in flags)
script = (
"set -euo pipefail; "
"cat > /tmp/run.ini; "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to use a temp file here instead of relying on /tmp/run.ini not being used otherwise.

Copy link
Contributor Author

@ggevay ggevay Nov 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened a slack thread in #operations Edit: I mixed this up with the below thread.


# Pin to specific commit for reproducibility
# TODO: Fork it under the MaterializeInc organization (like e.g. sqlsmith).
ADD https://api.github.com/repos/sjwiesman/dbbench/git/commits/10b2a0b5159f06945646fa1179bda4be51fe02b4 version.json
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should clone this into Mz's GH organization.

Copy link
Contributor Author

@ggevay ggevay Nov 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done (using mktemp) Edit: I mixed this up with the above thread.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we clone the repo into our org?

Copy link
Contributor Author

@ggevay ggevay Nov 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure. @def-, I thought that in this thread we were talking about this, but maybe that was actually a different thing?

Copy link
Contributor

@def- def- Nov 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, that was not the Github repo, but the GHCR and Docker container name.
I'll clone the repo into our org now and then change the commit. Edit: Done

@ggevay
Copy link
Contributor Author

ggevay commented Nov 10, 2025

If @antiguru is fine with it, I can try to move us to Staging.
As an alternative, could we allow dangerous customizations for this specific org in production or is that too hacky?

I would prefer moving it to Staging, if it's not too much hassle. Hardcoding this org in the cloud code seems not so elegant. Seems like a more flexible approach to handle this in Staging.

@antiguru
Copy link
Member

I would prefer moving it to Staging, if it's not too much hassle. Hardcoding this org in the cloud code seems not so elegant. Seems like a more flexible approach to handle this in Staging.

I'd prefer to keep the cluster spec sheet portion in prod to avoid any uncertainty that there's a difference between prod and staging. If we were to move it to staging, we'd need a test that ensures they're indeed equivalent.

However, I think there's an alternative: We could run the cluster spec sheet selectively, and run the default set of operations on prod, and just the envd portion in staging.

@ggevay
Copy link
Contributor Author

ggevay commented Nov 11, 2025

Ok, then let's split it, and run the envd/QPS part in Staging (in the region close to the Hetzner machines).

What's the procedure for creating an organization in Staging? Just asking in #operations? Or if it's simpler to point it at my Staging organization, then we can also do that for now. In both cases, I'll need to put an app password in Pulumi in i2, right?

@def-
Copy link
Contributor

def- commented Nov 11, 2025

Let me check if we have something suitable in Staging already.

Copy link
Member

@antiguru antiguru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is headed in the right direction, but we should iterate on what we're trying to record. Just a QPS value doesn't feel enough, we also need the corresponding latency percentiles.

Comment on lines +254 to +257
We capture and parse dbbench's output (stderr or stdout) to find the last
occurrence of a "QPS" value that dbbench reports in its summary, and store that
as the 'qps' column in the results CSV. If no QPS is found, the test fails.
(We omit time_ms for these rows since wall-clock time is not meaningful here.)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels better to capture dbbench's output file (query-log-file) and compute percentiles from it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels better to capture dbbench's output file

(We do capture dbbench's output file, and save it in the dbbench-logs directory.)

query-log-file

Well, query-log-file seems to be an input file to dbbench, as far as I can tell from reading dbbench's tutorial. Did you mean to suggest query-log-file, or something else?

and compute percentiles from it

I was planning this for follow-up PRs, as mentioned in code comments, e.g.:

# TODO: Later, we'll want to also look at latency (avg and tail-latency), but seems too unstable for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Continuing here from #34058 (comment) )

query-results-file

and compute percentiles from it

Sorry, I still don't understand. From a quick read, it seems that query-results-file records the actual data that the queries return, but does not record any timing information. What percentiles do you mean?

Copy link
Contributor Author

@ggevay ggevay Nov 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that dbbench prints a histogram on its own, without any extra options or manual analysis, e.g.

 33.554432ms -  67.108864ms [    35]: ▏
 67.108864ms - 134.217728ms [  1234]: █
134.217728ms - 268.435456ms [  1111]: ▉
268.435456ms - 536.870912ms [ 60545]: ██████████████████████████████████████████████████
536.870912ms - 1.073741824s [   248]: ▏
1.073741824s - 2.147483648s [   431]: ▎
2.147483648s - 4.294967296s [   540]: ▍
4.294967296s - 8.589934592s [    70]: ▏

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mind checking the contents of the results file? I think it should be name, start time, elapsed nanoseconds, rows, errors, but I might be mistaken.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, turns out that query-results-file is broken in Seth's fork, in that the file comes out completely empty. But now I've tried running the original code from memsql, and then it seems to print just the query results. There is also no csv header.

This is my ini:

duration=10s

[setup]
query=create cluster qps (size='scale=1,workers=1')
query=create materialized view if not exists test2 as select generate_series as x from generate_series(1, 100)
query=create view if not exists tempview1 as select * from test2 where x % 2 = 1
query=create default index on tempview1

[teardown]
query=drop materialized view test2 cascade
query=drop cluster qps

[loadtest]
query=select * from tempview1
concurrency=16
start=3s
query-results-file=/tmp/results.csv

And this is the first few lines of the results file:

1
3
5
7
9
11
13
15
17
19
21
23
25
27

which is (a part of) the result of the above query.

Copy link
Contributor Author

@ggevay ggevay Nov 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we could modify our fork to output also timings, but how about we get back to figuring out how exactly to extract detailed timing info from dbbench later? I'd like to consider this out of scope for this PR, as mentioned here.

Comment on lines +361 to +370
self.add_result(
category,
name,
# we have `duration` instead of `repetitions` in QPS benchmarks currently (but maybe it's good to keep
# `repetition` in the schema in case we ever want also multiple repetitions for QPS)
0,
None,
None,
qps=qps_val,
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel the raw QPS is less meaningful than a histogram over durations. Without the latency, the QPS doesn't tell us much as it could correlate with very high latency.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that we should record latency, but I'd like to do this later if that's ok. This is because I'd like to now urgently get back to the main part of the QPS work (#33713 and followup PRs). And then after the November deliverables are done for https://github.com/MaterializeInc/database-issues/issues/9593, I'd be happy follow up on making the measurements here fancier.

Btw. for looking at the latency histogram, I think it would be good to modify our fork of dbbench to be able to print more machine-readable output. This would also make it kind of a separate thing from this initial PR.

Without the latency, the QPS doesn't tell us much as it could correlate with very high latency.

Well, when QPS does correlate with latency, that means that measuring QPS also gives us information about latency. So, latency would be particularly important to record when QPS does not correlate with latency, since in that case a regression in latency might go unnoticed by just looking at QPS.

(Btw. I can say some anecdotal evidence for QPS being correlated with latency: I did have a situation during development where I had very bad tail latencies due to a glitch in the benchmark setup, and QPS also went down with it. Concretely, this happened when I didn't have the mz region disable before the enables, and a 0dt promotion sometimes came in the middle of dbbench measurements. This caused just a few queries (e.g., 4 queries out of thousands) to have a latency of over 1 minute, and this also made QPS very low.)

@def- def- requested a review from a team as a code owner November 11, 2025 13:58
@def- def- force-pushed the qps-cluster-spec-sheet branch from 9f89356 to 3ea08e6 Compare November 11, 2025 13:59
@def-
Copy link
Contributor

def- commented Nov 11, 2025

I made the requested changes, new test run: https://buildkite.com/materialize/release-qualification/builds/985

@antiguru
Copy link
Member

Sorry, it's called query-results-file!

@def- def- force-pushed the qps-cluster-spec-sheet branch 2 times, most recently from 54fe712 to 57af05f Compare November 11, 2025 15:24
@def-
Copy link
Contributor

def- commented Nov 11, 2025

Failed because the dbbench repo on DockerHub/GHCR is not public, I asked Paul to help us out: https://buildkite.com/materialize/release-qualification/builds/985#019a7384-cfa9-4819-a0ee-fde99ea15a15
Edit: https://buildkite.com/materialize/release-qualification/builds/990

@ggevay
Copy link
Contributor Author

ggevay commented Nov 11, 2025

Failed because the dbbench repo on DockerHub/GHCR is not public, I asked Paul to help us out

Thank you! So, this thread is superseded? Or we still also need the repo fork as well?

@def- def- force-pushed the qps-cluster-spec-sheet branch from 57af05f to 9bb0fae Compare November 11, 2025 16:57
@def-
Copy link
Contributor

def- commented Nov 11, 2025

Indeed, I didn't see that you posted already. For the future I have permissions, so you can just ping me directly

@def- def- force-pushed the qps-cluster-spec-sheet branch 2 times, most recently from 4499447 to d2c543d Compare November 11, 2025 23:05
@def- def- force-pushed the qps-cluster-spec-sheet branch from d2c543d to a600dbe Compare November 12, 2025 08:01
@def-
Copy link
Contributor

def- commented Nov 12, 2025

Green in CI: https://buildkite.com/materialize/release-qualification/builds/990
Results made it to the database:
Screenshot 2025-11-12 at 14 06 10

@ggevay
Copy link
Contributor Author

ggevay commented Nov 12, 2025

Great, thank you very much!

Btw. how did you resolve this redness in the normal CI?

Obtaining hostname of cloud instance ...
==> mzcompose: test case workflow-default failed: materialize.ui.UIError: failed to obtain cloud hostname within 180s: Network error during a Materialize cloud API request: error sending request for url (https://api.us-east-1.aws.cloud.materialize.com/api/region): operation timed out

I think this was caused by create_test_analytics_config catching CommandFailureCausedUIError, but cloud_hostname throwing a more general exception, UIError, since the first commit of the PR (sorry). Asking just to check whether you resolved this in some way or the new run just randomly didn't hit this flake.

@def-
Copy link
Contributor

def- commented Nov 12, 2025

This flake doesn't seem related to this PR. I think we hit it in other Nightly runs too. I haven't investigated further though.

Copy link
Member

@antiguru antiguru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Left some comments.

Comment on lines +378 to +379
# we have `duration` instead of `repetitions` in QPS benchmarks currently (but maybe it's good to keep
# `repetition` in the schema in case we ever want also multiple repetitions for QPS)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this comment still relevant? You could just say that 0 is the repetition.

# We'll also want to measure latency, including tail latency.


def disable_region(c: Composition, hard: bool) -> None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note, but don't act immediately: We should extract the disable_region, wait_for_cloud, etc. to a module as they're not specific to the cluster spec sheet. Especially since they've become bigger with your changes!

Comment on lines +2307 to +2309
# Upload only cluster scaling results to Test Analytics for now, until the Test Analytics schema is extended.
# TODO: See slack discussion:
# https://materializeinc.slack.com/archives/C01LKF361MZ/p1762351652336819?thread_ts=1762348361.164759&cid=C01LKF361MZ
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this comment still relevant?



class BenchTarget:
c: Composition
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mind renaming c with composition? We're also using cluster c, which might get confusing (that's my fault!).

Comment on lines +2660 to +2674
def workflow_plot(c: Composition, parser: WorkflowArgumentParser) -> None:
"""Analyze the results of the workflow."""

parser.add_argument(
"files",
nargs="*",
default="results_*.csv",
type=str,
help="Glob pattern of result files to plot.",
)

args = parser.parse_args()

for file in itertools.chain(*map(glob.iglob, args.files)):
analyze_file(str(file))
analyze_results_file(str(file))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The runtime dispatch based on the file name isn't great. I often find myself concatenating multiple results into non-descript file names, and the mechanism would not work here. What about using two different targets instead, plot_cluster and plot_envd? You could then update the default to do the right globbing.

Comment on lines +2786 to +2788
for (benchmark, category, mode), sub in df.groupby(
["scenario", "category", "mode"]
):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will not work if you have repetitions, but might be OK for now.

Comment on lines +50 to +54
## Scenarios

There are two kinds of scenarios:
- cluster scaling: These measure run times and arrangement sizes.
- envd scaling: These measure QPS.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please document how to invoke each.


# Pin to specific commit for reproducibility
# TODO: Fork it under the MaterializeInc organization (like e.g. sqlsmith).
ADD https://api.github.com/repos/sjwiesman/dbbench/git/commits/10b2a0b5159f06945646fa1179bda4be51fe02b4 version.json
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we clone the repo into our org?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

T-testing Theme: tests or test infrastructure

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants