-
Notifications
You must be signed in to change notification settings - Fork 464
Re-queue failed pageview batch for retry on next interval #710
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
@rockinrimmer is attempting to deploy a commit to the goldflag's projects Team on Vercel. A member of the Team first needs to authorize it. |
WalkthroughThe error handling in the pageview queue's bulk insert operation has been enhanced to re-queue failed batches. When an insertion fails, the batch is now prepended to the front of the queue for retry on the next interval, with updated logging reflecting the number of re-queued items. Changes
Sequence Diagram(s)sequenceDiagram
actor Scheduler
participant Queue
participant DB as Bulk Insert
Scheduler->>Queue: Poll batch
Queue->>DB: Insert batch
alt Success
DB-->>Queue: Complete
Note over Queue: Batch removed from queue
else Failure
DB-->>Queue: Error
Note over Queue: Log re-queue message<br/>(X items re-queued)
Queue->>Queue: Unshift batch to front
Note over Queue: Batch ready for next interval
end
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes
Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (1)
server/src/services/tracker/pageviewQueue.ts (1)
45-116: Consider preserving processed pageviews to avoid redundant geo-data fetching.Currently, when a batch fails and is re-queued, the original unprocessed payloads are added back, which means the geo-location lookup (line 45) and all transformations (lines 48-116) are repeated on retry. If the ClickHouse insert failed but geo-data was successfully fetched, this re-processing is redundant.
You could optimize by:
- Caching the
processedPageviewsalongside the retry count when re-queuing- Or, rely on the geo-data service's caching layer if one exists
However, this adds complexity and may not be worth the effort if:
- Geo-data lookups are fast/cached
- Retry failures are rare
- Fresh data on retry is preferred
Only pursue this if profiling shows geo-data fetching is a bottleneck during retries.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
server/src/services/tracker/pageviewQueue.ts(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: Build Client Images (ubuntu-24.04-arm, linux/arm64)
- GitHub Check: Build Client Images (ubuntu-latest, linux/amd64)
| this.logger.error(error, `Error processing pageview queue, re-adding ${batch.length} items to the start of the queue`); | ||
| // Re-queue the failed batch so it can be retried on the next interval | ||
| this.queue.unshift(...batch); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Infinite retry loop and potential memory exhaustion.
The retry mechanism lacks safeguards for persistent failures:
- Unbounded retries: If ClickHouse insertion fails repeatedly (e.g., network issues, schema mismatch, or corrupt data in the batch), the same batch will retry indefinitely every 10 seconds.
- Memory leak: During prolonged ClickHouse outages, new pageviews continuously arrive while failed batches are re-queued, causing the queue to grow without bounds.
- No failure isolation: A single corrupt item in the batch causes the entire batch to fail and retry forever.
Consider implementing:
- Retry limit: Track retry attempts per batch (e.g., max 3 retries) and move permanently failed batches to a dead-letter queue or log them for investigation.
- Exponential backoff: Delay subsequent retries to reduce pressure on ClickHouse (e.g., 10s, 30s, 60s).
- Queue size limit: Implement a maximum queue size with alerting when approaching capacity.
- Item-level error handling: Consider catching per-item validation errors if ClickHouse supports partial batch inserts.
Example approach for retry tracking:
type TotalPayloadWithRetry = TotalPayload & {
retryCount?: number;
};
// In the error handler:
const batchWithRetry = batch.map(item => ({
...item,
retryCount: (item.retryCount || 0) + 1
}));
const itemsToRequeue = batchWithRetry.filter(item => (item.retryCount || 0) < 3);
const failedItems = batchWithRetry.filter(item => (item.retryCount || 0) >= 3);
if (itemsToRequeue.length > 0) {
this.queue.unshift(...itemsToRequeue);
}
if (failedItems.length > 0) {
this.logger.error({ count: failedItems.length }, "Dropping items after max retries");
// Optionally: write to dead-letter queue/file for investigation
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@goldflag I can add handling so we don't retry forever if you think its worth it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!
This re-adds pageviews to the queue if there is an error inserting into Clickhouse.
Summary by CodeRabbit