-
Notifications
You must be signed in to change notification settings - Fork 482
Add QPS measurements to Cluster Spec Sheet #34058
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
d9c0271 to
23c64a8
Compare
test/cluster-spec-sheet/mzcompose.py
Outdated
| # Upload only cluster scaling results to Test Analytics for now, until the Test Analytics schema is extended. | ||
| # TODO: See slack discussion: | ||
| # https://materializeinc.slack.com/archives/C01LKF361MZ/p1762351652336819?thread_ts=1762348361.164759&cid=C01LKF361MZ | ||
| upload_results_to_test_analytics(c, cluster_path, not test_failed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the point where we skip the upload to the Test Analytics for the new scenario for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So what this test is measuring is only QPS instead of size in bytes and time in milliseconds?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes:
- The size of the arrangement that we are querying from is very small and doesn't matter.
- "Time" could in theory be the latency of the queries, but I haven't added a measurement of latency for now, and it will probably be better to call it "latency" when we add a measurement for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@antiguru So, how should we solve this? Should I add another column to the same table or create a separate table for the envd runs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer to have a different table, instead of forcing the consumer to distinguish between latency and throughput. It tests fundamentally different properties.
Is it hard to add a new table?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not hard, will do
93687ac to
5305afb
Compare
5305afb to
2113d83
Compare
| def cloud_hostname( | ||
| self, quiet: bool = False, timeout_secs: int = 180, poll_interval: float = 2.0 | ||
| ) -> str: | ||
| """Uses the mz command line tool to get the hostname of the cloud instance, waiting until the region is ready.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to add a retry loop here, because sometimes it would fail despite mz region enable already having come back saying that the cloud side of things are done. I think this started to happen when I added an mz region disable before my mz region enables. (The reason for adding the disabling is explained in a code comment in cloud_disable_enable_and_wait.)
bfd9236 to
8f18497
Compare
8f18497 to
d493ba0
Compare
|
Do you have a sample run in CI? I didn't see one in https://buildkite.com/materialize/release-qualification/builds?branch=ggevay%3Aqps-cluster-spec-sheet |
|
Started a CI run now: https://buildkite.com/materialize/release-qualification/builds/981 (I wanted to start it yesterday evening, but somehow forgot.) |
|
The actual CI run has uncovered an issue that was not visible in my manual tests against my staging env: The new scenario wants to vary the number of CPU cores that we give to envd, which it tries to do with |
|
We didn't want to set up a fresh environment for the Cluster Spec Sheet, so we instead opted to reuse one that was already created for tests, but not running anything at the moment. If @antiguru is fine with it, I can try to move us to Staging. |
antiguru
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few comments from reading once.
test/cluster-spec-sheet/mzcompose.py
Outdated
| quoted_flags = " ".join(shlex.quote(x) for x in flags) | ||
| script = ( | ||
| "set -euo pipefail; " | ||
| "cat > /tmp/run.ini; " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's better to use a temp file here instead of relying on /tmp/run.ini not being used otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opened a slack thread in #operations Edit: I mixed this up with the below thread.
test/dbbench/Dockerfile
Outdated
|
|
||
| # Pin to specific commit for reproducibility | ||
| # TODO: Fork it under the MaterializeInc organization (like e.g. sqlsmith). | ||
| ADD https://api.github.com/repos/sjwiesman/dbbench/git/commits/10b2a0b5159f06945646fa1179bda4be51fe02b4 version.json |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should clone this into Mz's GH organization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done (using mktemp) Edit: I mixed this up with the above thread.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did we clone the repo into our org?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure. @def-, I thought that in this thread we were talking about this, but maybe that was actually a different thing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, that was not the Github repo, but the GHCR and Docker container name.
I'll clone the repo into our org now and then change the commit. Edit: Done
I would prefer moving it to Staging, if it's not too much hassle. Hardcoding this org in the cloud code seems not so elegant. Seems like a more flexible approach to handle this in Staging. |
I'd prefer to keep the cluster spec sheet portion in prod to avoid any uncertainty that there's a difference between prod and staging. If we were to move it to staging, we'd need a test that ensures they're indeed equivalent. However, I think there's an alternative: We could run the cluster spec sheet selectively, and run the default set of operations on prod, and just the envd portion in staging. |
|
Ok, then let's split it, and run the envd/QPS part in Staging (in the region close to the Hetzner machines). What's the procedure for creating an organization in Staging? Just asking in |
|
Let me check if we have something suitable in Staging already. |
antiguru
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is headed in the right direction, but we should iterate on what we're trying to record. Just a QPS value doesn't feel enough, we also need the corresponding latency percentiles.
| We capture and parse dbbench's output (stderr or stdout) to find the last | ||
| occurrence of a "QPS" value that dbbench reports in its summary, and store that | ||
| as the 'qps' column in the results CSV. If no QPS is found, the test fails. | ||
| (We omit time_ms for these rows since wall-clock time is not meaningful here.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels better to capture dbbench's output file (query-log-file) and compute percentiles from it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels better to capture dbbench's output file
(We do capture dbbench's output file, and save it in the dbbench-logs directory.)
query-log-file
Well, query-log-file seems to be an input file to dbbench, as far as I can tell from reading dbbench's tutorial. Did you mean to suggest query-log-file, or something else?
and compute percentiles from it
I was planning this for follow-up PRs, as mentioned in code comments, e.g.:
# TODO: Later, we'll want to also look at latency (avg and tail-latency), but seems too unstable for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Continuing here from #34058 (comment) )
query-results-file
and compute percentiles from it
Sorry, I still don't understand. From a quick read, it seems that query-results-file records the actual data that the queries return, but does not record any timing information. What percentiles do you mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that dbbench prints a histogram on its own, without any extra options or manual analysis, e.g.
33.554432ms - 67.108864ms [ 35]: ▏
67.108864ms - 134.217728ms [ 1234]: █
134.217728ms - 268.435456ms [ 1111]: ▉
268.435456ms - 536.870912ms [ 60545]: ██████████████████████████████████████████████████
536.870912ms - 1.073741824s [ 248]: ▏
1.073741824s - 2.147483648s [ 431]: ▎
2.147483648s - 4.294967296s [ 540]: ▍
4.294967296s - 8.589934592s [ 70]: ▏
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mind checking the contents of the results file? I think it should be name, start time, elapsed nanoseconds, rows, errors, but I might be mistaken.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, turns out that query-results-file is broken in Seth's fork, in that the file comes out completely empty. But now I've tried running the original code from memsql, and then it seems to print just the query results. There is also no csv header.
This is my ini:
duration=10s
[setup]
query=create cluster qps (size='scale=1,workers=1')
query=create materialized view if not exists test2 as select generate_series as x from generate_series(1, 100)
query=create view if not exists tempview1 as select * from test2 where x % 2 = 1
query=create default index on tempview1
[teardown]
query=drop materialized view test2 cascade
query=drop cluster qps
[loadtest]
query=select * from tempview1
concurrency=16
start=3s
query-results-file=/tmp/results.csv
And this is the first few lines of the results file:
1
3
5
7
9
11
13
15
17
19
21
23
25
27
which is (a part of) the result of the above query.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we could modify our fork to output also timings, but how about we get back to figuring out how exactly to extract detailed timing info from dbbench later? I'd like to consider this out of scope for this PR, as mentioned here.
| self.add_result( | ||
| category, | ||
| name, | ||
| # we have `duration` instead of `repetitions` in QPS benchmarks currently (but maybe it's good to keep | ||
| # `repetition` in the schema in case we ever want also multiple repetitions for QPS) | ||
| 0, | ||
| None, | ||
| None, | ||
| qps=qps_val, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel the raw QPS is less meaningful than a histogram over durations. Without the latency, the QPS doesn't tell us much as it could correlate with very high latency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that we should record latency, but I'd like to do this later if that's ok. This is because I'd like to now urgently get back to the main part of the QPS work (#33713 and followup PRs). And then after the November deliverables are done for https://github.com/MaterializeInc/database-issues/issues/9593, I'd be happy follow up on making the measurements here fancier.
Btw. for looking at the latency histogram, I think it would be good to modify our fork of dbbench to be able to print more machine-readable output. This would also make it kind of a separate thing from this initial PR.
Without the latency, the QPS doesn't tell us much as it could correlate with very high latency.
Well, when QPS does correlate with latency, that means that measuring QPS also gives us information about latency. So, latency would be particularly important to record when QPS does not correlate with latency, since in that case a regression in latency might go unnoticed by just looking at QPS.
(Btw. I can say some anecdotal evidence for QPS being correlated with latency: I did have a situation during development where I had very bad tail latencies due to a glitch in the benchmark setup, and QPS also went down with it. Concretely, this happened when I didn't have the mz region disable before the enables, and a 0dt promotion sometimes came in the middle of dbbench measurements. This caused just a few queries (e.g., 4 queries out of thousands) to have a latency of over 1 minute, and this also made QPS very low.)
9f89356 to
3ea08e6
Compare
|
I made the requested changes, new test run: https://buildkite.com/materialize/release-qualification/builds/985 |
|
Sorry, it's called |
54fe712 to
57af05f
Compare
|
Failed because the dbbench repo on DockerHub/GHCR is not public, I asked Paul to help us out: https://buildkite.com/materialize/release-qualification/builds/985#019a7384-cfa9-4819-a0ee-fde99ea15a15 |
Thank you! So, this thread is superseded? Or we still also need the repo fork as well? |
57af05f to
9bb0fae
Compare
|
Indeed, I didn't see that you posted already. For the future I have permissions, so you can just ping me directly |
4499447 to
d2c543d
Compare
d2c543d to
a600dbe
Compare
|
Green in CI: https://buildkite.com/materialize/release-qualification/builds/990 |
|
Great, thank you very much! Btw. how did you resolve this redness in the normal CI? I think this was caused by |
|
This flake doesn't seem related to this PR. I think we hit it in other Nightly runs too. I haven't investigated further though. |
antiguru
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Left some comments.
| # we have `duration` instead of `repetitions` in QPS benchmarks currently (but maybe it's good to keep | ||
| # `repetition` in the schema in case we ever want also multiple repetitions for QPS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this comment still relevant? You could just say that 0 is the repetition.
| # We'll also want to measure latency, including tail latency. | ||
|
|
||
|
|
||
| def disable_region(c: Composition, hard: bool) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note, but don't act immediately: We should extract the disable_region, wait_for_cloud, etc. to a module as they're not specific to the cluster spec sheet. Especially since they've become bigger with your changes!
| # Upload only cluster scaling results to Test Analytics for now, until the Test Analytics schema is extended. | ||
| # TODO: See slack discussion: | ||
| # https://materializeinc.slack.com/archives/C01LKF361MZ/p1762351652336819?thread_ts=1762348361.164759&cid=C01LKF361MZ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this comment still relevant?
|
|
||
|
|
||
| class BenchTarget: | ||
| c: Composition |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mind renaming c with composition? We're also using cluster c, which might get confusing (that's my fault!).
| def workflow_plot(c: Composition, parser: WorkflowArgumentParser) -> None: | ||
| """Analyze the results of the workflow.""" | ||
|
|
||
| parser.add_argument( | ||
| "files", | ||
| nargs="*", | ||
| default="results_*.csv", | ||
| type=str, | ||
| help="Glob pattern of result files to plot.", | ||
| ) | ||
|
|
||
| args = parser.parse_args() | ||
|
|
||
| for file in itertools.chain(*map(glob.iglob, args.files)): | ||
| analyze_file(str(file)) | ||
| analyze_results_file(str(file)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The runtime dispatch based on the file name isn't great. I often find myself concatenating multiple results into non-descript file names, and the mechanism would not work here. What about using two different targets instead, plot_cluster and plot_envd? You could then update the default to do the right globbing.
| for (benchmark, category, mode), sub in df.groupby( | ||
| ["scenario", "category", "mode"] | ||
| ): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this will not work if you have repetitions, but might be OK for now.
| ## Scenarios | ||
|
|
||
| There are two kinds of scenarios: | ||
| - cluster scaling: These measure run times and arrangement sizes. | ||
| - envd scaling: These measure QPS. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please document how to invoke each.
test/dbbench/Dockerfile
Outdated
|
|
||
| # Pin to specific commit for reproducibility | ||
| # TODO: Fork it under the MaterializeInc organization (like e.g. sqlsmith). | ||
| ADD https://api.github.com/repos/sjwiesman/dbbench/git/commits/10b2a0b5159f06945646fa1179bda4be51fe02b4 version.json |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did we clone the repo into our org?

This PR adds QPS measurements to the Cluster Spec Sheet, so that we have automated and repeatable QPS benchmarks.
It runs for about 15 minutes. Later, we can add more QPS scenarios. (There are some ideas in a code comment at the end of
QpsEnvdStrongScalingScenario.)Btw. I saw
buildkite.shard_listinworkflow_default, but this would work only if we made the different shards use different envs. Currently, the buildkite-level concurrency is set to 1, so it's not an issue for now. (This is the situation both before and after this PR.)Edit: CI run: https://buildkite.com/materialize/release-qualification/builds/981