Skip to content

Commit 8bb9b52

Browse files
docs: update jobs.md with socket mode architecture and speed improvements
- Add documentation for Socket Mode (Bookkeeper) vs Legacy Mode (Orchestrator) - Explain architecture selection logic and conditions - Document 4-10x performance improvements from parallel processing - Add state management details for socket mode (partition_id, state ordering) - Clarify dual state emission to destination and bookkeeper - Update middleware container descriptions for accuracy Co-Authored-By: [email protected] <[email protected]>
1 parent 2ff8a77 commit 8bb9b52

File tree

1 file changed

+81
-12
lines changed
  • docs/platform/understanding-airbyte

1 file changed

+81
-12
lines changed

docs/platform/understanding-airbyte/jobs.md

Lines changed: 81 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -15,28 +15,97 @@ Generally, there are 2 types of workload pods:
1515

1616
## Airbyte Middleware and Bookkeeping Containers
1717

18-
Inside any connector operation pod, a special airbyte controlled container will run alongside the connector container(s) to process and interpret the results as well as perform necessary side effects.
18+
Inside any connector operation pod, a special Airbyte-controlled container runs alongside the connector container(s) to process and interpret results and perform necessary side effects.
1919

2020
There are two types of middleware containers:
21-
* The Container Orchestrator
22-
* The Connector Sidecar
21+
* The Container Orchestrator (legacy mode)
22+
* The Bookkeeper (socket mode)
23+
* The Connector Sidecar (for CHECK, DISCOVER, SPEC operations)
2324

24-
#### Container Orchestrator
25+
### Replication Architecture Modes
2526

26-
An airbyte controlled container that sits between the source and destination connector containers inside a Replication Pod.
27+
Airbyte supports two architecture modes for replication (sync) jobs, with the platform automatically selecting the optimal mode based on connector capabilities and connection configuration.
2728

28-
Responsibilities:
29-
* Hosts middleware capabilities such as scrubbing PPI, aggregating stats, transforming data, and checkpointing progress.
29+
#### Socket Mode (Bookkeeper)
30+
31+
Socket mode is Airbyte's high-performance architecture that enables 4-10x faster data movement compared to legacy mode. In this mode, data flows directly from source to destination via Unix domain sockets, while control messages (logs, state, statistics) flow through the Bookkeeper via standard I/O.
32+
33+
**Architecture:**
34+
```
35+
Source ────→ Unix Socket Files → Destination (direct data transfer)
36+
│ │
37+
└─────────→ Bookkeeper ←───────────┘
38+
(control messages, state, logs via STDIO)
39+
```
40+
41+
**Bookkeeper Responsibilities:**
42+
* Processes control messages from source and destination via STDIO
43+
* Persists state messages and statistics
44+
* Handles heartbeating and job lifecycle management
45+
* Lightweight resource footprint (1 CPU, 1024Mi memory)
46+
47+
**Performance Benefits:**
48+
* **Parallel Processing**: Multiple Unix domain sockets enable concurrent data streams
49+
* **Binary Serialization**: Protocol Buffers provide efficient data encoding and strong type safety
50+
* **Lower Latency**: Eliminates STDIO buffering delays
51+
* **Higher Throughput**: Direct socket communication reduces overhead
52+
53+
**Socket Count:** The number of sockets is determined by `min(source_cpu_limit, destination_cpu_limit) * 2`, allowing parallel data transfer. For example, connectors with 4 CPU limits will use 8 sockets.
54+
55+
#### Legacy Mode (Container Orchestrator)
56+
57+
Legacy mode uses the traditional STDIO-based architecture where all data flows through the Container Orchestrator.
58+
59+
**Architecture:**
60+
```
61+
Source → STDIO → Container Orchestrator → STDIO → Destination
62+
```
63+
64+
**Container Orchestrator Responsibilities:**
65+
* Sits between source and destination connector containers
66+
* Hosts middleware capabilities such as scrubbing PII, aggregating stats, transforming data, and checkpointing progress
3067
* Interprets and records connector operation results
31-
* Handles miscellaneous side effects (e.g. logging, auth token refresh flows, etc. )
68+
* Handles miscellaneous side effects (logging, auth token refresh flows, etc.)
69+
70+
#### Architecture Selection
71+
72+
The platform automatically determines which mode to use based on several factors:
73+
74+
**Socket mode is used when ALL conditions are met:**
75+
1. Not a file transfer operation
76+
2. Not a reset operation
77+
3. Both source and destination declare IPC capabilities in their metadata
78+
4. No hashed fields or mappers configured in the connection
79+
5. Matching data channel versions between source and destination
80+
6. Both connectors support socket transport
81+
7. Compatible serialization format exists (PROTOBUF preferred, JSONL fallback)
82+
83+
**Legacy mode is used when:**
84+
* Any of the above conditions are not met
85+
* The `ForceRunStdioMode` feature flag is enabled
86+
* IPC options are missing or incompatible
87+
88+
#### State Management in Socket Mode
89+
90+
Socket mode introduces enhanced state management to support parallel processing and ensure data consistency:
91+
92+
**Partition Identifiers:** Each record and state message includes a `partition_id` (a random alphanumeric string) that links records to their corresponding checkpoint state. This enables the destination to verify that all records from a partition have been received before committing the state.
93+
94+
**State Ordering:** State messages include an incrementing `id` field to maintain proper ordering. Since states can arrive on any socket in any order due to parallel processing, the destination uses these IDs to commit states in the correct sequence, ensuring resumability if a sync fails.
95+
96+
**Dual State Emission:** In socket mode, state messages are sent to both:
97+
* The destination via socket (for record count verification and ordering)
98+
* The Bookkeeper via STDIO (for persistence and platform tracking)
99+
100+
This dual emission ensures both the destination and platform maintain consistent state information throughout the sync.
32101

33-
#### Connector Sidecar
102+
### Connector Sidecar
34103

35-
An airbyte controlled container that reads the output of a connector container inside a Connector Pod (CHECK, DISCOVER, SPEC).
104+
An Airbyte-controlled container that reads the output of a connector container inside a Connector Pod for non-replication operations (CHECK, DISCOVER, SPEC).
36105

37-
Responsibilities:
106+
**Responsibilities:**
38107
* Interprets and records connector operation results
39-
* Handles miscellaneous side effects (e.g. logging, auth token refresh flows, etc. )
108+
* Handles miscellaneous side effects (logging, auth token refresh flows, etc.)
40109

41110

42111
## Workload launching architecture

0 commit comments

Comments
 (0)