-
Notifications
You must be signed in to change notification settings - Fork 4.9k
docs: update jobs.md with socket mode architecture and speed improvements #69763
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: docs-architecture-for-speed
Are you sure you want to change the base?
docs: update jobs.md with socket mode architecture and speed improvements #69763
Conversation
…ents - Add documentation for Socket Mode (Bookkeeper) vs Legacy Mode (Orchestrator) - Explain architecture selection logic and conditions - Document 4-10x performance improvements from parallel processing - Add state management details for socket mode (partition_id, state ordering) - Clarify dual state emission to destination and bookkeeper - Update middleware container descriptions for accuracy Co-Authored-By: [email protected] <[email protected]>
Original prompt from [email protected] |
🤖 Devin AI EngineerI'll be helping with this pull request! Here's what you should know: ✅ I will automatically:
Note: I can only respond to comments from users who have write access to this repository. ⚙️ Control Options:
|
👋 Greetings, Airbyte Team Member!Here are some helpful tips and reminders for your convenience. Helpful Resources
PR Slash CommandsAirbyte Maintainers (that's you!) can execute the following slash commands on your PR:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remaining comments which cannot be posted as a review comment to avoid GitHub Rate Limit
markdownlint
[markdownlint] reported by reviewdog 🐶
MD004/ul-style Unordered list style [Expected: dash; Actual: asterisk]
| * Handles miscellaneous side effects (logging, auth token refresh flows, etc.) |
| There are two types of middleware containers: | ||
| * The Container Orchestrator | ||
| * The Connector Sidecar | ||
| * The Container Orchestrator (legacy mode) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[markdownlint] reported by reviewdog 🐶
MD004/ul-style Unordered list style [Expected: dash; Actual: asterisk]
| There are two types of middleware containers: | ||
| * The Container Orchestrator | ||
| * The Connector Sidecar | ||
| * The Container Orchestrator (legacy mode) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[markdownlint] reported by reviewdog 🐶
MD032/blanks-around-lists Lists should be surrounded by blank lines [Context: "* The Container Orchestrator (..."]
| * The Container Orchestrator | ||
| * The Connector Sidecar | ||
| * The Container Orchestrator (legacy mode) | ||
| * The Bookkeeper (socket mode) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[markdownlint] reported by reviewdog 🐶
MD004/ul-style Unordered list style [Expected: dash; Actual: asterisk]
| * The Connector Sidecar | ||
| * The Container Orchestrator (legacy mode) | ||
| * The Bookkeeper (socket mode) | ||
| * The Connector Sidecar (for CHECK, DISCOVER, SPEC operations) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[markdownlint] reported by reviewdog 🐶
MD004/ul-style Unordered list style [Expected: dash; Actual: asterisk]
| Socket mode is Airbyte's high-performance architecture that enables 4-10x faster data movement compared to legacy mode. In this mode, data flows directly from source to destination via Unix domain sockets, while control messages (logs, state, statistics) flow through the Bookkeeper via standard I/O. | ||
|
|
||
| **Architecture:** | ||
| ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[markdownlint] reported by reviewdog 🐶
MD031/blanks-around-fences Fenced code blocks should be surrounded by blank lines [Context: "```"]
|
|
||
| **Legacy mode is used when:** | ||
| * Any of the above conditions are not met | ||
| * The `ForceRunStdioMode` feature flag is enabled |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[markdownlint] reported by reviewdog 🐶
MD004/ul-style Unordered list style [Expected: dash; Actual: asterisk]
| **Legacy mode is used when:** | ||
| * Any of the above conditions are not met | ||
| * The `ForceRunStdioMode` feature flag is enabled | ||
| * IPC options are missing or incompatible |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[markdownlint] reported by reviewdog 🐶
MD004/ul-style Unordered list style [Expected: dash; Actual: asterisk]
| **State Ordering:** State messages include an incrementing `id` field to maintain proper ordering. Since states can arrive on any socket in any order due to parallel processing, the destination uses these IDs to commit states in the correct sequence, ensuring resumability if a sync fails. | ||
|
|
||
| **Dual State Emission:** In socket mode, state messages are sent to both: | ||
| * The destination via socket (for record count verification and ordering) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[markdownlint] reported by reviewdog 🐶
MD004/ul-style Unordered list style [Expected: dash; Actual: asterisk]
| **State Ordering:** State messages include an incrementing `id` field to maintain proper ordering. Since states can arrive on any socket in any order due to parallel processing, the destination uses these IDs to commit states in the correct sequence, ensuring resumability if a sync fails. | ||
|
|
||
| **Dual State Emission:** In socket mode, state messages are sent to both: | ||
| * The destination via socket (for record count verification and ordering) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[markdownlint] reported by reviewdog 🐶
MD032/blanks-around-lists Lists should be surrounded by blank lines [Context: "* The destination via socket (..."]
|
|
||
| **Dual State Emission:** In socket mode, state messages are sent to both: | ||
| * The destination via socket (for record count verification and ordering) | ||
| * The Bookkeeper via STDIO (for persistence and platform tracking) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[markdownlint] reported by reviewdog 🐶
MD004/ul-style Unordered list style [Expected: dash; Actual: asterisk]
| ## Airbyte Middleware and Bookkeeping Containers | ||
|
|
||
| Inside any connector operation pod, a special airbyte controlled container will run alongside the connector container(s) to process and interpret the results as well as perform necessary side effects. | ||
| Inside any connector operation pod, a special Airbyte-controlled container runs alongside the connector container(s) to process and interpret results and perform necessary side effects. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚫 [vale] reported by reviewdog 🐶
[Google.OptionalPlurals] Don't use plurals in parentheses such as in 'container(s)'.
| **Bookkeeper Responsibilities:** | ||
| * Processes control messages from source and destination via STDIO | ||
| * Persists state messages and statistics | ||
| * Handles heartbeating and job lifecycle management |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚫 [vale] reported by reviewdog 🐶
[Vale.Spelling] Did you really mean 'heartbeating'?
| * **Lower Latency**: Eliminates STDIO buffering delays | ||
| * **Higher Throughput**: Direct socket communication reduces overhead | ||
|
|
||
| **Socket Count:** The number of sockets is determined by `min(source_cpu_limit, destination_cpu_limit) * 2`, allowing parallel data transfer. For example, connectors with 4 CPU limits will use 8 sockets. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Google.Colons] ': T' should be in lowercase.
| * **Lower Latency**: Eliminates STDIO buffering delays | ||
| * **Higher Throughput**: Direct socket communication reduces overhead | ||
|
|
||
| **Socket Count:** The number of sockets is determined by `min(source_cpu_limit, destination_cpu_limit) * 2`, allowing parallel data transfer. For example, connectors with 4 CPU limits will use 8 sockets. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Google.Will] Avoid using 'will'.
|
|
||
| **Container Orchestrator Responsibilities:** | ||
| * Sits between source and destination connector containers | ||
| * Hosts middleware capabilities such as scrubbing PII, aggregating stats, transforming data, and checkpointing progress |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚫 [vale] reported by reviewdog 🐶
[Vale.Spelling] Did you really mean 'middleware'?
| 7. Compatible serialization format exists (PROTOBUF preferred, JSONL fallback) | ||
|
|
||
| **Legacy mode is used when:** | ||
| * Any of the above conditions are not met |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Google.WordList] Use 'preceding' instead of 'above'.
|
|
||
| Socket mode introduces enhanced state management to support parallel processing and ensure data consistency: | ||
|
|
||
| **Partition Identifiers:** Each record and state message includes a `partition_id` (a random alphanumeric string) that links records to their corresponding checkpoint state. This enables the destination to verify that all records from a partition have been received before committing the state. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Google.Colons] ': E' should be in lowercase.
|
|
||
| **Partition Identifiers:** Each record and state message includes a `partition_id` (a random alphanumeric string) that links records to their corresponding checkpoint state. This enables the destination to verify that all records from a partition have been received before committing the state. | ||
|
|
||
| **State Ordering:** State messages include an incrementing `id` field to maintain proper ordering. Since states can arrive on any socket in any order due to parallel processing, the destination uses these IDs to commit states in the correct sequence, ensuring resumability if a sync fails. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Google.Colons] ': S' should be in lowercase.
|
|
||
| **Partition Identifiers:** Each record and state message includes a `partition_id` (a random alphanumeric string) that links records to their corresponding checkpoint state. This enables the destination to verify that all records from a partition have been received before committing the state. | ||
|
|
||
| **State Ordering:** State messages include an incrementing `id` field to maintain proper ordering. Since states can arrive on any socket in any order due to parallel processing, the destination uses these IDs to commit states in the correct sequence, ensuring resumability if a sync fails. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚫 [vale] reported by reviewdog 🐶
[Vale.Spelling] Did you really mean 'resumability'?
|
|
||
| **State Ordering:** State messages include an incrementing `id` field to maintain proper ordering. Since states can arrive on any socket in any order due to parallel processing, the destination uses these IDs to commit states in the correct sequence, ensuring resumability if a sync fails. | ||
|
|
||
| **Dual State Emission:** In socket mode, state messages are sent to both: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Google.Colons] ': I' should be in lowercase.
| * The Container Orchestrator (legacy mode) | ||
| * The Bookkeeper (socket mode) | ||
| * The Connector Sidecar (for CHECK, DISCOVER, SPEC operations) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[markdownlint-fix] reported by reviewdog 🐶
| * The Container Orchestrator (legacy mode) | |
| * The Bookkeeper (socket mode) | |
| * The Connector Sidecar (for CHECK, DISCOVER, SPEC operations) | |
| - The Container Orchestrator (legacy mode) | |
| - The Bookkeeper (socket mode) | |
| - The Connector Sidecar (for CHECK, DISCOVER, SPEC operations) |
| Socket mode is Airbyte's high-performance architecture that enables 4-10x faster data movement compared to legacy mode. In this mode, data flows directly from source to destination via Unix domain sockets, while control messages (logs, state, statistics) flow through the Bookkeeper via standard I/O. | ||
|
|
||
| **Architecture:** | ||
| ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[markdownlint-fix] reported by reviewdog 🐶
| ``` | |
| ``` |
| * Processes control messages from source and destination via STDIO | ||
| * Persists state messages and statistics | ||
| * Handles heartbeating and job lifecycle management | ||
| * Lightweight resource footprint (1 CPU, 1024Mi memory) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[markdownlint-fix] reported by reviewdog 🐶
| * Processes control messages from source and destination via STDIO | |
| * Persists state messages and statistics | |
| * Handles heartbeating and job lifecycle management | |
| * Lightweight resource footprint (1 CPU, 1024Mi memory) | |
| - Processes control messages from source and destination via STDIO | |
| - Persists state messages and statistics | |
| - Handles heartbeating and job lifecycle management | |
| - Lightweight resource footprint (1 CPU, 1024Mi memory) |
| * **Parallel Processing**: Multiple Unix domain sockets enable concurrent data streams | ||
| * **Binary Serialization**: Protocol Buffers provide efficient data encoding and strong type safety | ||
| * **Lower Latency**: Eliminates STDIO buffering delays | ||
| * **Higher Throughput**: Direct socket communication reduces overhead |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[markdownlint-fix] reported by reviewdog 🐶
| * **Parallel Processing**: Multiple Unix domain sockets enable concurrent data streams | |
| * **Binary Serialization**: Protocol Buffers provide efficient data encoding and strong type safety | |
| * **Lower Latency**: Eliminates STDIO buffering delays | |
| * **Higher Throughput**: Direct socket communication reduces overhead | |
| - **Parallel Processing**: Multiple Unix domain sockets enable concurrent data streams | |
| - **Binary Serialization**: Protocol Buffers provide efficient data encoding and strong type safety | |
| - **Lower Latency**: Eliminates STDIO buffering delays | |
| - **Higher Throughput**: Direct socket communication reduces overhead |
| Legacy mode uses the traditional STDIO-based architecture where all data flows through the Container Orchestrator. | ||
|
|
||
| **Architecture:** | ||
| ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[markdownlint-fix] reported by reviewdog 🐶
| ``` | |
| ``` |
| * Sits between source and destination connector containers | ||
| * Hosts middleware capabilities such as scrubbing PII, aggregating stats, transforming data, and checkpointing progress | ||
| * Interprets and records connector operation results | ||
| * Handles miscellaneous side effects (e.g. logging, auth token refresh flows, etc. ) | ||
| * Handles miscellaneous side effects (logging, auth token refresh flows, etc.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[markdownlint-fix] reported by reviewdog 🐶
| * Sits between source and destination connector containers | |
| * Hosts middleware capabilities such as scrubbing PII, aggregating stats, transforming data, and checkpointing progress | |
| * Interprets and records connector operation results | |
| * Handles miscellaneous side effects (e.g. logging, auth token refresh flows, etc. ) | |
| * Handles miscellaneous side effects (logging, auth token refresh flows, etc.) | |
| - Sits between source and destination connector containers | |
| - Hosts middleware capabilities such as scrubbing PII, aggregating stats, transforming data, and checkpointing progress | |
| - Interprets and records connector operation results | |
| - Handles miscellaneous side effects (logging, auth token refresh flows, etc.) |
| The platform automatically determines which mode to use based on several factors: | ||
|
|
||
| **Socket mode is used when ALL conditions are met:** | ||
| 1. Not a file transfer operation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[markdownlint-fix] reported by reviewdog 🐶
| 1. Not a file transfer operation | |
| 1. Not a file transfer operation |
| * Any of the above conditions are not met | ||
| * The `ForceRunStdioMode` feature flag is enabled | ||
| * IPC options are missing or incompatible |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[markdownlint-fix] reported by reviewdog 🐶
| * Any of the above conditions are not met | |
| * The `ForceRunStdioMode` feature flag is enabled | |
| * IPC options are missing or incompatible | |
| - Any of the above conditions are not met | |
| - The `ForceRunStdioMode` feature flag is enabled | |
| - IPC options are missing or incompatible |
| * The destination via socket (for record count verification and ordering) | ||
| * The Bookkeeper via STDIO (for persistence and platform tracking) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[markdownlint-fix] reported by reviewdog 🐶
| * The destination via socket (for record count verification and ordering) | |
| * The Bookkeeper via STDIO (for persistence and platform tracking) | |
| - The destination via socket (for record count verification and ordering) | |
| - The Bookkeeper via STDIO (for persistence and platform tracking) |
| * Interprets and records connector operation results | ||
| * Handles miscellaneous side effects (e.g. logging, auth token refresh flows, etc. ) | ||
| * Handles miscellaneous side effects (logging, auth token refresh flows, etc.) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[markdownlint-fix] reported by reviewdog 🐶
| * Interprets and records connector operation results | |
| * Handles miscellaneous side effects (e.g. logging, auth token refresh flows, etc. ) | |
| * Handles miscellaneous side effects (logging, auth token refresh flows, etc.) | |
| - Interprets and records connector operation results | |
| - Handles miscellaneous side effects (logging, auth token refresh flows, etc.) |
|
Deploy preview for airbyte-docs ready! ✅ Preview Built with commit 8bb9b52. |
This PR targets the following PR:
What
Updates the jobs.md documentation to accurately reflect Airbyte's recent speed improvements and the introduction of socket mode architecture. The previous documentation described the Container Orchestrator as sitting "between" source and destination connectors, which is inaccurate for socket mode where data flows directly between connectors.
This work was requested by [email protected] and can be reviewed in the Devin session.
How
Review guide
Technical Accuracy - Verify the architecture descriptions match the actual implementation:
Scope Appropriateness - Confirm the level of detail is appropriate for a jobs/workloads documentation page:
Terminology Consistency - Check that terms align with usage elsewhere:
Completeness - Verify nothing critical is missing about how speed improvements affect job execution
User Impact
Users reading the jobs.md documentation will now have accurate information about:
This is documentation-only; no functional changes to the platform.
Can this PR be safely reverted and rolled back?
This is a documentation-only change with no code modifications.
Note: Documentation linters (Vale and MarkDownLint) were not available in the development environment, so this PR may have style issues that should be caught in CI or manual review.