Skip to content

Impossible to fully delete agent via a schedule #616

@PlasticLizard

Description

@PlasticLizard

Describe the bug
I'm having difficulty deleting an agent after a period of inactivity. The root cause of this appears to be that await this.ctx.storage.deleteAlarm(); is not always (maybe usually) respected when the object is subsequently immediately evicted with an untrappable exeption (due to .abort()).

What I'm seeing is that in my schedule handler I call .destroy(), which as advertised, deletes the DO and evicts. However, 2 seconds later (as per the retry docs for DO), the alarm is retried by the DO runtime, even though .destroy() clearly calls deleteAlarm().

This causes the agent to come back into existence and re-create its schema, which then will consume some (small amount) of storage. Although small, over time and many agents, these zombies will add up.

To work around this, rather than calling ".destroy()" I have my own version of the destroy method that omits the "abort()" call, so that the alarm handler won't throw and won't be retried. However this doesn't work because the "alarm()" handler uses the database to remove the executed schedule, but of course the database schema is gone once my handler deletes storage, so it throws anyway.

Possible solutions, which all I think require that the alarm handler does not throw when the object is destroyed because .deleteAlarm() appears unreliable, include:

  • making the "alarm" handler in the base class not readonly, so we can wrap it with a try/catch in our implementation classes and silently ignore the "table not found" errors when trying to remove the stored schedule from a now non-existent table
  • Swallowing this particular error in the base class, but returning early from the alarm handler (so that _scheduleNextAlarm isn't called which will have more errors)
  • Setting a member variable "_destroyed" in the destroy call, then using that to exit the alarm handler early after the user code callback. This would also require omitting the ".abort()" call or making it optional with a parameter to .destroy().

To Reproduce

  • Create an agent and use the api to create a schedule with a handler that calls and awaits destroy()
  • Observe in the Cloudflare logs that the alarm is being retried, and thus re-constructs the DO and the schema

Expected behavior
Schedule callbacks that call "destroy" are not retried by the DO infrastructure

Screenshots

Image

Version:
0.2.17

Additional context
none

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions