Skip to content

Conversation

@rockinrimmer
Copy link
Contributor

@rockinrimmer rockinrimmer commented Nov 18, 2025

This re-adds pageviews to the queue if there is an error inserting into Clickhouse.

Summary by CodeRabbit

  • Bug Fixes
    • Enhanced error recovery for pageview tracking to automatically re-attempt failed submissions on the next processing interval, improving overall data reliability and preventing data loss.

@vercel
Copy link

vercel bot commented Nov 18, 2025

@rockinrimmer is attempting to deploy a commit to the goldflag's projects Team on Vercel.

A member of the Team first needs to authorize it.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 18, 2025

Walkthrough

The error handling in the pageview queue's bulk insert operation has been enhanced to re-queue failed batches. When an insertion fails, the batch is now prepended to the front of the queue for retry on the next interval, with updated logging reflecting the number of re-queued items.

Changes

Cohort / File(s) Summary
Error recovery for bulk insert
server/src/services/tracker/pageviewQueue.ts
Modified error handler to re-queue failed batches at the front of the queue and enhanced logging to indicate items being re-queued. Only error path affected; successful processing unchanged.

Sequence Diagram(s)

sequenceDiagram
    actor Scheduler
    participant Queue
    participant DB as Bulk Insert
    
    Scheduler->>Queue: Poll batch
    Queue->>DB: Insert batch
    
    alt Success
        DB-->>Queue: Complete
        Note over Queue: Batch removed from queue
    else Failure
        DB-->>Queue: Error
        Note over Queue: Log re-queue message<br/>(X items re-queued)
        Queue->>Queue: Unshift batch to front
        Note over Queue: Batch ready for next interval
    end
Loading

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

  • Single file modification with localized error handling logic
  • Straightforward re-queue mechanism using unshift operation
  • Enhanced logging is additive and non-breaking

Poem

🐰 A batch that stumbles, lost in the night,
Now bounces back to the queue's front—polite!
Retry, persist, till success takes the lead,
unshift makes persistent dreams guaranteed! 🔄

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and directly summarizes the main change: re-queuing failed pageview batches for retry, which matches the PR's primary objective and the file modifications.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
server/src/services/tracker/pageviewQueue.ts (1)

45-116: Consider preserving processed pageviews to avoid redundant geo-data fetching.

Currently, when a batch fails and is re-queued, the original unprocessed payloads are added back, which means the geo-location lookup (line 45) and all transformations (lines 48-116) are repeated on retry. If the ClickHouse insert failed but geo-data was successfully fetched, this re-processing is redundant.

You could optimize by:

  • Caching the processedPageviews alongside the retry count when re-queuing
  • Or, rely on the geo-data service's caching layer if one exists

However, this adds complexity and may not be worth the effort if:

  • Geo-data lookups are fast/cached
  • Retry failures are rare
  • Fresh data on retry is preferred

Only pursue this if profiling shows geo-data fetching is a bottleneck during retries.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ba9a3a7 and f79a29f.

📒 Files selected for processing (1)
  • server/src/services/tracker/pageviewQueue.ts (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: Build Client Images (ubuntu-24.04-arm, linux/arm64)
  • GitHub Check: Build Client Images (ubuntu-latest, linux/amd64)

Comment on lines +127 to +129
this.logger.error(error, `Error processing pageview queue, re-adding ${batch.length} items to the start of the queue`);
// Re-queue the failed batch so it can be retried on the next interval
this.queue.unshift(...batch);
Copy link
Contributor

@coderabbitai coderabbitai bot Nov 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Infinite retry loop and potential memory exhaustion.

The retry mechanism lacks safeguards for persistent failures:

  1. Unbounded retries: If ClickHouse insertion fails repeatedly (e.g., network issues, schema mismatch, or corrupt data in the batch), the same batch will retry indefinitely every 10 seconds.
  2. Memory leak: During prolonged ClickHouse outages, new pageviews continuously arrive while failed batches are re-queued, causing the queue to grow without bounds.
  3. No failure isolation: A single corrupt item in the batch causes the entire batch to fail and retry forever.

Consider implementing:

  • Retry limit: Track retry attempts per batch (e.g., max 3 retries) and move permanently failed batches to a dead-letter queue or log them for investigation.
  • Exponential backoff: Delay subsequent retries to reduce pressure on ClickHouse (e.g., 10s, 30s, 60s).
  • Queue size limit: Implement a maximum queue size with alerting when approaching capacity.
  • Item-level error handling: Consider catching per-item validation errors if ClickHouse supports partial batch inserts.

Example approach for retry tracking:

type TotalPayloadWithRetry = TotalPayload & {
  retryCount?: number;
};

// In the error handler:
const batchWithRetry = batch.map(item => ({
  ...item,
  retryCount: (item.retryCount || 0) + 1
}));

const itemsToRequeue = batchWithRetry.filter(item => (item.retryCount || 0) < 3);
const failedItems = batchWithRetry.filter(item => (item.retryCount || 0) >= 3);

if (itemsToRequeue.length > 0) {
  this.queue.unshift(...itemsToRequeue);
}
if (failedItems.length > 0) {
  this.logger.error({ count: failedItems.length }, "Dropping items after max retries");
  // Optionally: write to dead-letter queue/file for investigation
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@goldflag I can add handling so we don't retry forever if you think its worth it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant