-
Notifications
You must be signed in to change notification settings - Fork 319
Fix deadlock in dd-task-scheduler #10096
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| // Load JFR Handlers class early, if present (it has been moved and renamed in JDK23+). | ||
| // This prevents a deadlock. See PROF-13025. | ||
| try { | ||
| Class.forName("jdk.jfr.events.Handlers"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are touching this class only when we have JFR available, so loading the Handlers class should not be disturbing anything.
But - we have smap entry events disabled by default, yet we are getting these issues ... I have a suspicion that we are initing the support regardless of the enablement status. Could you check for PROFILING_SMAP_COLLECTION_ENABLED setting before merging this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be slighly more complex. The smap event is registered in datadog.trace.bootstrap.Agent and there we probably don't have easy access to PROFILING_SMAP_COLLECTION_ENABLED (not sure), and it might also affect other events. I filed PROF-13213 for a deeper investigation. I would merge this PR as-is. Is that ok?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
BenchmarksStartupParameters
See matching parameters
SummaryFound 0 performance improvements and 0 performance regressions! Performance is the same for 58 metrics, 7 unstable metrics. Startup time reports for petclinicgantt
title petclinic - global startup overhead: candidate=1.57.0-SNAPSHOT~99ec0365ca, baseline=1.57.0-SNAPSHOT~2b04c87b81
dateFormat X
axisFormat %s
section tracing
Agent [baseline] (1.079 s) : 0, 1079007
Total [baseline] (10.886 s) : 0, 10886271
Agent [candidate] (1.079 s) : 0, 1078682
Total [candidate] (10.849 s) : 0, 10848884
section appsec
Agent [baseline] (1.265 s) : 0, 1264719
Total [baseline] (11.183 s) : 0, 11183027
Agent [candidate] (1.264 s) : 0, 1264441
Total [candidate] (11.194 s) : 0, 11193692
section iast
Agent [baseline] (1.236 s) : 0, 1235624
Total [baseline] (11.289 s) : 0, 11289054
Agent [candidate] (1.222 s) : 0, 1222351
Total [candidate] (11.129 s) : 0, 11128717
section profiling
Agent [baseline] (1.203 s) : 0, 1202987
Total [baseline] (11.038 s) : 0, 11038378
Agent [candidate] (1.202 s) : 0, 1201745
Total [candidate] (11.089 s) : 0, 11089496
gantt
title petclinic - break down per module: candidate=1.57.0-SNAPSHOT~99ec0365ca, baseline=1.57.0-SNAPSHOT~2b04c87b81
dateFormat X
axisFormat %s
section tracing
crashtracking [baseline] (1.184 ms) : 0, 1184
crashtracking [candidate] (1.179 ms) : 0, 1179
BytebuddyAgent [baseline] (646.618 ms) : 0, 646618
BytebuddyAgent [candidate] (647.221 ms) : 0, 647221
GlobalTracer [baseline] (281.562 ms) : 0, 281562
GlobalTracer [candidate] (281.084 ms) : 0, 281084
AppSec [baseline] (32.477 ms) : 0, 32477
AppSec [candidate] (32.127 ms) : 0, 32127
Debugger [baseline] (68.268 ms) : 0, 68268
Debugger [candidate] (68.217 ms) : 0, 68217
Remote Config [baseline] (627.942 µs) : 0, 628
Remote Config [candidate] (640.27 µs) : 0, 640
Telemetry [baseline] (9.108 ms) : 0, 9108
Telemetry [candidate] (8.992 ms) : 0, 8992
Flare Poller [baseline] (3.76 ms) : 0, 3760
Flare Poller [candidate] (3.775 ms) : 0, 3775
section appsec
crashtracking [baseline] (1.201 ms) : 0, 1201
crashtracking [candidate] (1.215 ms) : 0, 1215
BytebuddyAgent [baseline] (688.488 ms) : 0, 688488
BytebuddyAgent [candidate] (688.887 ms) : 0, 688887
GlobalTracer [baseline] (259.503 ms) : 0, 259503
GlobalTracer [candidate] (259.09 ms) : 0, 259090
AppSec [baseline] (175.444 ms) : 0, 175444
AppSec [candidate] (173.471 ms) : 0, 173471
Debugger [baseline] (66.196 ms) : 0, 66196
Debugger [candidate] (68.061 ms) : 0, 68061
Remote Config [baseline] (680.401 µs) : 0, 680
Remote Config [candidate] (708.373 µs) : 0, 708
Telemetry [baseline] (8.98 ms) : 0, 8980
Telemetry [candidate] (9.026 ms) : 0, 9026
Flare Poller [baseline] (3.818 ms) : 0, 3818
Flare Poller [candidate] (4.02 ms) : 0, 4020
IAST [baseline] (24.85 ms) : 0, 24850
IAST [candidate] (24.456 ms) : 0, 24456
section iast
crashtracking [baseline] (1.201 ms) : 0, 1201
crashtracking [candidate] (1.192 ms) : 0, 1192
BytebuddyAgent [baseline] (798.952 ms) : 0, 798952
BytebuddyAgent [candidate] (790.449 ms) : 0, 790449
GlobalTracer [baseline] (258.398 ms) : 0, 258398
GlobalTracer [candidate] (255.549 ms) : 0, 255549
AppSec [baseline] (35.623 ms) : 0, 35623
AppSec [candidate] (34.421 ms) : 0, 34421
Debugger [baseline] (66.231 ms) : 0, 66231
Debugger [candidate] (66.184 ms) : 0, 66184
Remote Config [baseline] (559.928 µs) : 0, 560
Remote Config [candidate] (544.705 µs) : 0, 545
Telemetry [baseline] (8.503 ms) : 0, 8503
Telemetry [candidate] (8.377 ms) : 0, 8377
Flare Poller [baseline] (3.55 ms) : 0, 3550
Flare Poller [candidate] (3.46 ms) : 0, 3460
IAST [baseline] (27.154 ms) : 0, 27154
IAST [candidate] (26.813 ms) : 0, 26813
section profiling
ProfilingAgent [baseline] (97.109 ms) : 0, 97109
ProfilingAgent [candidate] (97.522 ms) : 0, 97522
crashtracking [baseline] (1.175 ms) : 0, 1175
crashtracking [candidate] (1.186 ms) : 0, 1186
BytebuddyAgent [baseline] (701.539 ms) : 0, 701539
BytebuddyAgent [candidate] (700.31 ms) : 0, 700310
GlobalTracer [baseline] (220.138 ms) : 0, 220138
GlobalTracer [candidate] (220.185 ms) : 0, 220185
AppSec [baseline] (32.304 ms) : 0, 32304
AppSec [candidate] (32.135 ms) : 0, 32135
Debugger [baseline] (67.866 ms) : 0, 67866
Debugger [candidate] (67.736 ms) : 0, 67736
Remote Config [baseline] (648.246 µs) : 0, 648
Remote Config [candidate] (597.266 µs) : 0, 597
Telemetry [baseline] (9.008 ms) : 0, 9008
Telemetry [candidate] (8.916 ms) : 0, 8916
Flare Poller [baseline] (3.739 ms) : 0, 3739
Flare Poller [candidate] (3.731 ms) : 0, 3731
Profiling [baseline] (97.69 ms) : 0, 97690
Profiling [candidate] (98.09 ms) : 0, 98090
Startup time reports for insecure-bankgantt
title insecure-bank - global startup overhead: candidate=1.57.0-SNAPSHOT~99ec0365ca, baseline=1.57.0-SNAPSHOT~2b04c87b81
dateFormat X
axisFormat %s
section tracing
Agent [baseline] (1.085 s) : 0, 1084993
Total [baseline] (8.787 s) : 0, 8787060
Agent [candidate] (1.081 s) : 0, 1081104
Total [candidate] (8.741 s) : 0, 8741069
section iast
Agent [baseline] (1.22 s) : 0, 1219631
Total [baseline] (9.421 s) : 0, 9421472
Agent [candidate] (1.228 s) : 0, 1227900
Total [candidate] (9.497 s) : 0, 9496861
gantt
title insecure-bank - break down per module: candidate=1.57.0-SNAPSHOT~99ec0365ca, baseline=1.57.0-SNAPSHOT~2b04c87b81
dateFormat X
axisFormat %s
section tracing
crashtracking [baseline] (1.191 ms) : 0, 1191
crashtracking [candidate] (1.19 ms) : 0, 1190
BytebuddyAgent [baseline] (651.916 ms) : 0, 651916
BytebuddyAgent [candidate] (648.189 ms) : 0, 648189
GlobalTracer [baseline] (282.406 ms) : 0, 282406
GlobalTracer [candidate] (283.25 ms) : 0, 283250
AppSec [baseline] (32.699 ms) : 0, 32699
AppSec [candidate] (32.428 ms) : 0, 32428
Debugger [baseline] (67.831 ms) : 0, 67831
Debugger [candidate] (67.214 ms) : 0, 67214
Remote Config [baseline] (659.31 µs) : 0, 659
Remote Config [candidate] (635.775 µs) : 0, 636
Telemetry [baseline] (9.052 ms) : 0, 9052
Telemetry [candidate] (8.965 ms) : 0, 8965
Flare Poller [baseline] (3.786 ms) : 0, 3786
Flare Poller [candidate] (3.733 ms) : 0, 3733
section iast
crashtracking [baseline] (1.18 ms) : 0, 1180
crashtracking [candidate] (1.192 ms) : 0, 1192
BytebuddyAgent [baseline] (788.613 ms) : 0, 788613
BytebuddyAgent [candidate] (794.659 ms) : 0, 794659
GlobalTracer [baseline] (255.246 ms) : 0, 255246
GlobalTracer [candidate] (256.781 ms) : 0, 256781
IAST [baseline] (26.721 ms) : 0, 26721
IAST [candidate] (26.847 ms) : 0, 26847
AppSec [baseline] (34.492 ms) : 0, 34492
AppSec [candidate] (35.724 ms) : 0, 35724
Debugger [baseline] (65.717 ms) : 0, 65717
Debugger [candidate] (64.665 ms) : 0, 64665
Remote Config [baseline] (536.645 µs) : 0, 537
Remote Config [candidate] (549.087 µs) : 0, 549
Telemetry [baseline] (8.385 ms) : 0, 8385
Telemetry [candidate] (8.484 ms) : 0, 8484
Flare Poller [baseline] (3.446 ms) : 0, 3446
Flare Poller [candidate] (3.499 ms) : 0, 3499
LoadParameters
See matching parameters
SummaryFound 3 performance improvements and 1 performance regressions! Performance is the same for 14 metrics, 18 unstable metrics.
Request duration reports for insecure-bankgantt
title insecure-bank - request duration [CI 0.99] : candidate=1.57.0-SNAPSHOT~99ec0365ca, baseline=1.57.0-SNAPSHOT~2b04c87b81
dateFormat X
axisFormat %s
section baseline
no_agent (1.226 ms) : 1214, 1238
. : milestone, 1226,
iast (3.44 ms) : 3389, 3491
. : milestone, 3440,
iast_FULL (5.869 ms) : 5811, 5927
. : milestone, 5869,
iast_GLOBAL (3.683 ms) : 3628, 3737
. : milestone, 3683,
profiling (2.109 ms) : 2089, 2128
. : milestone, 2109,
tracing (1.827 ms) : 1813, 1842
. : milestone, 1827,
section candidate
no_agent (1.192 ms) : 1180, 1203
. : milestone, 1192,
iast (3.191 ms) : 3148, 3234
. : milestone, 3191,
iast_FULL (5.952 ms) : 5892, 6012
. : milestone, 5952,
iast_GLOBAL (3.54 ms) : 3485, 3595
. : milestone, 3540,
profiling (2.042 ms) : 2023, 2061
. : milestone, 2042,
tracing (1.908 ms) : 1891, 1924
. : milestone, 1908,
Request duration reports for petclinicgantt
title petclinic - request duration [CI 0.99] : candidate=1.57.0-SNAPSHOT~99ec0365ca, baseline=1.57.0-SNAPSHOT~2b04c87b81
dateFormat X
axisFormat %s
section baseline
no_agent (18.92 ms) : 18732, 19107
. : milestone, 18920,
appsec (20.209 ms) : 19997, 20421
. : milestone, 20209,
code_origins (18.122 ms) : 17941, 18303
. : milestone, 18122,
iast (17.924 ms) : 17745, 18104
. : milestone, 17924,
profiling (18.509 ms) : 18321, 18697
. : milestone, 18509,
tracing (18.07 ms) : 17889, 18251
. : milestone, 18070,
section candidate
no_agent (17.037 ms) : 16864, 17209
. : milestone, 17037,
appsec (19.859 ms) : 19656, 20063
. : milestone, 19859,
code_origins (17.911 ms) : 17730, 18093
. : milestone, 17911,
iast (18.821 ms) : 18630, 19012
. : milestone, 18821,
profiling (18.481 ms) : 18296, 18666
. : milestone, 18481,
tracing (18.2 ms) : 18014, 18385
. : milestone, 18200,
DacapoParameters
See matching parameters
SummaryFound 0 performance improvements and 0 performance regressions! Performance is the same for 10 metrics, 2 unstable metrics. Execution time for tomcatgantt
title tomcat - execution time [CI 0.99] : candidate=1.57.0-SNAPSHOT~99ec0365ca, baseline=1.57.0-SNAPSHOT~2b04c87b81
dateFormat X
axisFormat %s
section baseline
no_agent (1.473 ms) : 1461, 1484
. : milestone, 1473,
appsec (3.733 ms) : 3515, 3952
. : milestone, 3733,
iast (2.215 ms) : 2151, 2280
. : milestone, 2215,
iast_GLOBAL (2.253 ms) : 2189, 2318
. : milestone, 2253,
profiling (2.085 ms) : 2031, 2139
. : milestone, 2085,
tracing (2.051 ms) : 2000, 2102
. : milestone, 2051,
section candidate
no_agent (1.467 ms) : 1455, 1478
. : milestone, 1467,
appsec (3.695 ms) : 3476, 3914
. : milestone, 3695,
iast (2.207 ms) : 2143, 2271
. : milestone, 2207,
iast_GLOBAL (2.269 ms) : 2204, 2334
. : milestone, 2269,
profiling (2.482 ms) : 2321, 2643
. : milestone, 2482,
tracing (2.049 ms) : 1998, 2100
. : milestone, 2049,
Execution time for biojavagantt
title biojava - execution time [CI 0.99] : candidate=1.57.0-SNAPSHOT~99ec0365ca, baseline=1.57.0-SNAPSHOT~2b04c87b81
dateFormat X
axisFormat %s
section baseline
no_agent (15.615 s) : 15615000, 15615000
. : milestone, 15615000,
appsec (14.498 s) : 14498000, 14498000
. : milestone, 14498000,
iast (18.182 s) : 18182000, 18182000
. : milestone, 18182000,
iast_GLOBAL (18.288 s) : 18288000, 18288000
. : milestone, 18288000,
profiling (14.786 s) : 14786000, 14786000
. : milestone, 14786000,
tracing (14.856 s) : 14856000, 14856000
. : milestone, 14856000,
section candidate
no_agent (15.078 s) : 15078000, 15078000
. : milestone, 15078000,
appsec (14.654 s) : 14654000, 14654000
. : milestone, 14654000,
iast (18.36 s) : 18360000, 18360000
. : milestone, 18360000,
iast_GLOBAL (17.875 s) : 17875000, 17875000
. : milestone, 17875000,
profiling (14.651 s) : 14651000, 14651000
. : milestone, 14651000,
tracing (14.836 s) : 14836000, 14836000
. : milestone, 14836000,
|
|
|
||
| static { | ||
| // Load JFR Handlers class early, if present (it has been moved and renamed in JDK23+). | ||
| // This prevents a deadlock. See PROF-13025. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: Is there a JDK bug id as well to drop here ?
The deadlock has been reported to upstream, and a fix has been proposed there
|
@jbachorik can you re-review this? I pulled the loading of Handlers class out to Agent.registerSmapEvent(). I think this is a better place, and it uses AGENT_CLASSLOADER directly and sits nicely besides the code that loads the SmapEntryFactory. WDYT? |
jbachorik
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good - can you add the JBS bug in the comment here, @bric3 requested?
What Does This Do
This is a fix/workaround that addresses a potential deadlock in JFR. The deadlock has been reported to upstream, and a fix has been proposed there, but it would be good to also fix it in our own code for when the code is running on an unpatched JDK.
Upstream bug: JDK-8371889
Motivation
This change fixes a deadlock in the profiling agent.
Additional Notes
We are loading the JFR handlers class early. This will prevent the deadlock that happens when we call EventType.getEventType() from the SmapEventFactory static initializer. That JFR code would first lock the Utils class lock, and then later try to acquire the Handlers' class initializer lock. If another thread is currently in the static initializer of the Handlers class, then that thread would attempt to acquire the Utils class lock, and thus deadlock. Calling the Handlers static initializer early avoids that scenario by ensuring that the Handlers class is fully initialized before calling into the JFR code that would acquire the Utils lock.
The problem is only present in JDKs < 22, and the Handlers class has been renamed and moved starting in JDK22. That is why the class is loaded reflectively and any errors while loading the class are silently ignored.
Contributor Checklist
type:and (comp:orinst:) labels in addition to any useful labelsclose,fixor any linking keywords when referencing an issue.Use
solvesinstead, and assign the PR milestone to the issueJira ticket: PROF-13025