Skip to content

Conversation

@devin-ai-integration
Copy link
Contributor

This PR targets the following PR:

  • airbytehq/airbyte#[docs-architecture-for-speed branch]

What

Updates the jobs.md documentation to accurately reflect Airbyte's recent speed improvements and the introduction of socket mode architecture. The previous documentation described the Container Orchestrator as sitting "between" source and destination connectors, which is inaccurate for socket mode where data flows directly between connectors.

This work was requested by [email protected] and can be reviewed in the Devin session.

How

  • Restructured the "Airbyte Middleware and Bookkeeping Containers" section to distinguish between Socket Mode (Bookkeeper) and Legacy Mode (Container Orchestrator)
  • Added detailed architecture diagrams showing data flow in both modes
  • Documented the architecture selection logic and conditions that determine which mode is used
  • Added a new section on state management in socket mode, explaining partition identifiers, state ordering, and dual state emission
  • Updated terminology throughout to be more precise (e.g., "Airbyte-controlled" instead of "airbyte controlled")

Review guide

  1. Technical Accuracy - Verify the architecture descriptions match the actual implementation:

    • Socket mode architecture diagram and data flow
    • Architecture selection conditions (lines 74-86)
    • State management details (lines 88-100)
    • Socket count calculation formula
  2. Scope Appropriateness - Confirm the level of detail is appropriate for a jobs/workloads documentation page:

    • Does it focus on job execution architecture rather than connector implementation details?
    • Is any information better suited for other documentation pages?
  3. Terminology Consistency - Check that terms align with usage elsewhere:

    • "Socket Mode" vs "Bookkeeper" vs "BOOKKEEPER"
    • "Legacy Mode" vs "Container Orchestrator" vs "ORCHESTRATOR"
    • partition_id, state id, dual state emission
  4. Completeness - Verify nothing critical is missing about how speed improvements affect job execution

User Impact

Users reading the jobs.md documentation will now have accurate information about:

  • How Airbyte achieves 4-10x speed improvements through socket mode
  • Why some connections use socket mode while others use legacy mode
  • How state management works in parallel processing scenarios

This is documentation-only; no functional changes to the platform.

Can this PR be safely reverted and rolled back?

  • YES 💚

This is a documentation-only change with no code modifications.


Note: Documentation linters (Vale and MarkDownLint) were not available in the development environment, so this PR may have style issues that should be caught in CI or manual review.

…ents

- Add documentation for Socket Mode (Bookkeeper) vs Legacy Mode (Orchestrator)
- Explain architecture selection logic and conditions
- Document 4-10x performance improvements from parallel processing
- Add state management details for socket mode (partition_id, state ordering)
- Clarify dual state emission to destination and bookkeeper
- Update middleware container descriptions for accuracy

Co-Authored-By: [email protected] <[email protected]>
@devin-ai-integration
Copy link
Contributor Author

Original prompt from [email protected]
@Devin Let's continue working on this branch <https://github.com/airbytehq/airbyte/tree/docs-architecture-for-speed> to improve the accuracy of our public documentation in regards to Airbyte's recent speed improvements. The section I'd like to target is /platform/next/understanding-airbyte/jobs.md.

I'm going to share some resources with you that will provide context on what has changed since this documentation was originally written.

Your task:

1. Evaluate what parts of this documentation are inaccurate or insufficient. Use the context I provide, airbyte platform source code, and airbyte connector source code to develop your understanding.
2. Propose a documentation update for this topic by branching off my branch and creating a PR back into it.
3. Try to remain on topic and limit your discussion only to things that affect workloads and jobs. Assume other updates will happen on other pages separately.
Thread URL: https://airbytehq-team.slack.com/archives/D08FX8EC9L0/p1763602804805489

@devin-ai-integration
Copy link
Contributor Author

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

@github-actions
Copy link
Contributor

👋 Greetings, Airbyte Team Member!

Here are some helpful tips and reminders for your convenience.

Helpful Resources

PR Slash Commands

Airbyte Maintainers (that's you!) can execute the following slash commands on your PR:

  • /format-fix - Fixes most formatting issues.
  • /bump-version - Bumps connector versions.
    • You can specify a custom changelog by passing changelog. Example: /bump-version changelog="My cool update"
    • Leaving the changelog arg blank will auto-populate the changelog from the PR title.
  • /run-cat-tests - Runs legacy CAT tests (Connector Acceptance Tests)
  • /build-connector-images - Builds and publishes a pre-release docker image for the modified connector(s).
  • JVM connectors:
    • /update-connector-cdk-version connector=<CONNECTOR_NAME> - Updates the specified connector to the latest CDK version.
      Example: /update-connector-cdk-version connector=destination-bigquery
    • /bump-bulk-cdk-version bump=patch changelog='foo' - Bump the Bulk CDK's version. bump can be major/minor/patch.
  • Python connectors:
    • /poe connector source-example lock - Run the Poe lock task on the source-example connector, committing the results back to the branch.
    • /poe source example lock - Alias for /poe connector source-example lock.
    • /poe source example use-cdk-branch my/branch - Pin the source-example CDK reference to the branch name specified.
    • /poe source example use-cdk-latest - Update the source-example CDK dependency to the latest available version.

📝 Edit this welcome message.

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remaining comments which cannot be posted as a review comment to avoid GitHub Rate Limit

markdownlint

[markdownlint] reported by reviewdog 🐶
MD004/ul-style Unordered list style [Expected: dash; Actual: asterisk]

* Handles miscellaneous side effects (logging, auth token refresh flows, etc.)

There are two types of middleware containers:
* The Container Orchestrator
* The Connector Sidecar
* The Container Orchestrator (legacy mode)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[markdownlint] reported by reviewdog 🐶
MD004/ul-style Unordered list style [Expected: dash; Actual: asterisk]

There are two types of middleware containers:
* The Container Orchestrator
* The Connector Sidecar
* The Container Orchestrator (legacy mode)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[markdownlint] reported by reviewdog 🐶
MD032/blanks-around-lists Lists should be surrounded by blank lines [Context: "* The Container Orchestrator (..."]

* The Container Orchestrator
* The Connector Sidecar
* The Container Orchestrator (legacy mode)
* The Bookkeeper (socket mode)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[markdownlint] reported by reviewdog 🐶
MD004/ul-style Unordered list style [Expected: dash; Actual: asterisk]

* The Connector Sidecar
* The Container Orchestrator (legacy mode)
* The Bookkeeper (socket mode)
* The Connector Sidecar (for CHECK, DISCOVER, SPEC operations)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[markdownlint] reported by reviewdog 🐶
MD004/ul-style Unordered list style [Expected: dash; Actual: asterisk]

Socket mode is Airbyte's high-performance architecture that enables 4-10x faster data movement compared to legacy mode. In this mode, data flows directly from source to destination via Unix domain sockets, while control messages (logs, state, statistics) flow through the Bookkeeper via standard I/O.

**Architecture:**
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[markdownlint] reported by reviewdog 🐶
MD031/blanks-around-fences Fenced code blocks should be surrounded by blank lines [Context: "```"]


**Legacy mode is used when:**
* Any of the above conditions are not met
* The `ForceRunStdioMode` feature flag is enabled
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[markdownlint] reported by reviewdog 🐶
MD004/ul-style Unordered list style [Expected: dash; Actual: asterisk]

**Legacy mode is used when:**
* Any of the above conditions are not met
* The `ForceRunStdioMode` feature flag is enabled
* IPC options are missing or incompatible
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[markdownlint] reported by reviewdog 🐶
MD004/ul-style Unordered list style [Expected: dash; Actual: asterisk]

**State Ordering:** State messages include an incrementing `id` field to maintain proper ordering. Since states can arrive on any socket in any order due to parallel processing, the destination uses these IDs to commit states in the correct sequence, ensuring resumability if a sync fails.

**Dual State Emission:** In socket mode, state messages are sent to both:
* The destination via socket (for record count verification and ordering)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[markdownlint] reported by reviewdog 🐶
MD004/ul-style Unordered list style [Expected: dash; Actual: asterisk]

**State Ordering:** State messages include an incrementing `id` field to maintain proper ordering. Since states can arrive on any socket in any order due to parallel processing, the destination uses these IDs to commit states in the correct sequence, ensuring resumability if a sync fails.

**Dual State Emission:** In socket mode, state messages are sent to both:
* The destination via socket (for record count verification and ordering)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[markdownlint] reported by reviewdog 🐶
MD032/blanks-around-lists Lists should be surrounded by blank lines [Context: "* The destination via socket (..."]


**Dual State Emission:** In socket mode, state messages are sent to both:
* The destination via socket (for record count verification and ordering)
* The Bookkeeper via STDIO (for persistence and platform tracking)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[markdownlint] reported by reviewdog 🐶
MD004/ul-style Unordered list style [Expected: dash; Actual: asterisk]

## Airbyte Middleware and Bookkeeping Containers

Inside any connector operation pod, a special airbyte controlled container will run alongside the connector container(s) to process and interpret the results as well as perform necessary side effects.
Inside any connector operation pod, a special Airbyte-controlled container runs alongside the connector container(s) to process and interpret results and perform necessary side effects.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚫 [vale] reported by reviewdog 🐶
[Google.OptionalPlurals] Don't use plurals in parentheses such as in 'container(s)'.

**Bookkeeper Responsibilities:**
* Processes control messages from source and destination via STDIO
* Persists state messages and statistics
* Handles heartbeating and job lifecycle management
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚫 [vale] reported by reviewdog 🐶
[Vale.Spelling] Did you really mean 'heartbeating'?

* **Lower Latency**: Eliminates STDIO buffering delays
* **Higher Throughput**: Direct socket communication reduces overhead

**Socket Count:** The number of sockets is determined by `min(source_cpu_limit, destination_cpu_limit) * 2`, allowing parallel data transfer. For example, connectors with 4 CPU limits will use 8 sockets.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ [vale] reported by reviewdog 🐶
[Google.Colons] ': T' should be in lowercase.

* **Lower Latency**: Eliminates STDIO buffering delays
* **Higher Throughput**: Direct socket communication reduces overhead

**Socket Count:** The number of sockets is determined by `min(source_cpu_limit, destination_cpu_limit) * 2`, allowing parallel data transfer. For example, connectors with 4 CPU limits will use 8 sockets.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ [vale] reported by reviewdog 🐶
[Google.Will] Avoid using 'will'.


**Container Orchestrator Responsibilities:**
* Sits between source and destination connector containers
* Hosts middleware capabilities such as scrubbing PII, aggregating stats, transforming data, and checkpointing progress
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚫 [vale] reported by reviewdog 🐶
[Vale.Spelling] Did you really mean 'middleware'?

7. Compatible serialization format exists (PROTOBUF preferred, JSONL fallback)

**Legacy mode is used when:**
* Any of the above conditions are not met
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ [vale] reported by reviewdog 🐶
[Google.WordList] Use 'preceding' instead of 'above'.


Socket mode introduces enhanced state management to support parallel processing and ensure data consistency:

**Partition Identifiers:** Each record and state message includes a `partition_id` (a random alphanumeric string) that links records to their corresponding checkpoint state. This enables the destination to verify that all records from a partition have been received before committing the state.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ [vale] reported by reviewdog 🐶
[Google.Colons] ': E' should be in lowercase.


**Partition Identifiers:** Each record and state message includes a `partition_id` (a random alphanumeric string) that links records to their corresponding checkpoint state. This enables the destination to verify that all records from a partition have been received before committing the state.

**State Ordering:** State messages include an incrementing `id` field to maintain proper ordering. Since states can arrive on any socket in any order due to parallel processing, the destination uses these IDs to commit states in the correct sequence, ensuring resumability if a sync fails.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ [vale] reported by reviewdog 🐶
[Google.Colons] ': S' should be in lowercase.


**Partition Identifiers:** Each record and state message includes a `partition_id` (a random alphanumeric string) that links records to their corresponding checkpoint state. This enables the destination to verify that all records from a partition have been received before committing the state.

**State Ordering:** State messages include an incrementing `id` field to maintain proper ordering. Since states can arrive on any socket in any order due to parallel processing, the destination uses these IDs to commit states in the correct sequence, ensuring resumability if a sync fails.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚫 [vale] reported by reviewdog 🐶
[Vale.Spelling] Did you really mean 'resumability'?


**State Ordering:** State messages include an incrementing `id` field to maintain proper ordering. Since states can arrive on any socket in any order due to parallel processing, the destination uses these IDs to commit states in the correct sequence, ensuring resumability if a sync fails.

**Dual State Emission:** In socket mode, state messages are sent to both:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ [vale] reported by reviewdog 🐶
[Google.Colons] ': I' should be in lowercase.

Comment on lines +21 to +23
* The Container Orchestrator (legacy mode)
* The Bookkeeper (socket mode)
* The Connector Sidecar (for CHECK, DISCOVER, SPEC operations)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[markdownlint-fix] reported by reviewdog 🐶

Suggested change
* The Container Orchestrator (legacy mode)
* The Bookkeeper (socket mode)
* The Connector Sidecar (for CHECK, DISCOVER, SPEC operations)
- The Container Orchestrator (legacy mode)
- The Bookkeeper (socket mode)
- The Connector Sidecar (for CHECK, DISCOVER, SPEC operations)

Socket mode is Airbyte's high-performance architecture that enables 4-10x faster data movement compared to legacy mode. In this mode, data flows directly from source to destination via Unix domain sockets, while control messages (logs, state, statistics) flow through the Bookkeeper via standard I/O.

**Architecture:**
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[markdownlint-fix] reported by reviewdog 🐶

Suggested change
```
```

Comment on lines +42 to +45
* Processes control messages from source and destination via STDIO
* Persists state messages and statistics
* Handles heartbeating and job lifecycle management
* Lightweight resource footprint (1 CPU, 1024Mi memory)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[markdownlint-fix] reported by reviewdog 🐶

Suggested change
* Processes control messages from source and destination via STDIO
* Persists state messages and statistics
* Handles heartbeating and job lifecycle management
* Lightweight resource footprint (1 CPU, 1024Mi memory)
- Processes control messages from source and destination via STDIO
- Persists state messages and statistics
- Handles heartbeating and job lifecycle management
- Lightweight resource footprint (1 CPU, 1024Mi memory)

Comment on lines +48 to +51
* **Parallel Processing**: Multiple Unix domain sockets enable concurrent data streams
* **Binary Serialization**: Protocol Buffers provide efficient data encoding and strong type safety
* **Lower Latency**: Eliminates STDIO buffering delays
* **Higher Throughput**: Direct socket communication reduces overhead
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[markdownlint-fix] reported by reviewdog 🐶

Suggested change
* **Parallel Processing**: Multiple Unix domain sockets enable concurrent data streams
* **Binary Serialization**: Protocol Buffers provide efficient data encoding and strong type safety
* **Lower Latency**: Eliminates STDIO buffering delays
* **Higher Throughput**: Direct socket communication reduces overhead
- **Parallel Processing**: Multiple Unix domain sockets enable concurrent data streams
- **Binary Serialization**: Protocol Buffers provide efficient data encoding and strong type safety
- **Lower Latency**: Eliminates STDIO buffering delays
- **Higher Throughput**: Direct socket communication reduces overhead

Legacy mode uses the traditional STDIO-based architecture where all data flows through the Container Orchestrator.

**Architecture:**
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[markdownlint-fix] reported by reviewdog 🐶

Suggested change
```
```

Comment on lines +65 to +68
* Sits between source and destination connector containers
* Hosts middleware capabilities such as scrubbing PII, aggregating stats, transforming data, and checkpointing progress
* Interprets and records connector operation results
* Handles miscellaneous side effects (e.g. logging, auth token refresh flows, etc. )
* Handles miscellaneous side effects (logging, auth token refresh flows, etc.)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[markdownlint-fix] reported by reviewdog 🐶

Suggested change
* Sits between source and destination connector containers
* Hosts middleware capabilities such as scrubbing PII, aggregating stats, transforming data, and checkpointing progress
* Interprets and records connector operation results
* Handles miscellaneous side effects (e.g. logging, auth token refresh flows, etc. )
* Handles miscellaneous side effects (logging, auth token refresh flows, etc.)
- Sits between source and destination connector containers
- Hosts middleware capabilities such as scrubbing PII, aggregating stats, transforming data, and checkpointing progress
- Interprets and records connector operation results
- Handles miscellaneous side effects (logging, auth token refresh flows, etc.)

The platform automatically determines which mode to use based on several factors:

**Socket mode is used when ALL conditions are met:**
1. Not a file transfer operation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[markdownlint-fix] reported by reviewdog 🐶

Suggested change
1. Not a file transfer operation
1. Not a file transfer operation

Comment on lines +84 to +86
* Any of the above conditions are not met
* The `ForceRunStdioMode` feature flag is enabled
* IPC options are missing or incompatible
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[markdownlint-fix] reported by reviewdog 🐶

Suggested change
* Any of the above conditions are not met
* The `ForceRunStdioMode` feature flag is enabled
* IPC options are missing or incompatible
- Any of the above conditions are not met
- The `ForceRunStdioMode` feature flag is enabled
- IPC options are missing or incompatible

Comment on lines +97 to +98
* The destination via socket (for record count verification and ordering)
* The Bookkeeper via STDIO (for persistence and platform tracking)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[markdownlint-fix] reported by reviewdog 🐶

Suggested change
* The destination via socket (for record count verification and ordering)
* The Bookkeeper via STDIO (for persistence and platform tracking)
- The destination via socket (for record count verification and ordering)
- The Bookkeeper via STDIO (for persistence and platform tracking)

Comment on lines 107 to 109
* Interprets and records connector operation results
* Handles miscellaneous side effects (e.g. logging, auth token refresh flows, etc. )
* Handles miscellaneous side effects (logging, auth token refresh flows, etc.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[markdownlint-fix] reported by reviewdog 🐶

Suggested change
* Interprets and records connector operation results
* Handles miscellaneous side effects (e.g. logging, auth token refresh flows, etc. )
* Handles miscellaneous side effects (logging, auth token refresh flows, etc.)
- Interprets and records connector operation results
- Handles miscellaneous side effects (logging, auth token refresh flows, etc.)

@github-actions
Copy link
Contributor

Deploy preview for airbyte-docs ready!

✅ Preview
https://airbyte-docs-bh86gqhq7-airbyte-growth.vercel.app

Built with commit 8bb9b52.
This pull request is being automatically deployed with vercel-action

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant