Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1 +1,2 @@
**/.DS_Store
pinot/volumes/controller/data/segments/
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
## Overview

Spark, Storm, Flink, and Samza all have the following three common parts:
- A component that your streaming application is submitted to; this is similar to how Hadoop Map Reduce works. Your application is sent to a node in the cluster that executes your application.
- Separate nodes in the cluster execute your streaming algorithms.
- Data sources are the input to the streaming algorithms.
### A Generalized Architecture

![](../../images/Generalized_Streaming_architecture.jpg)
#### Application driver
- With some streaming systems, this will be the client code that defines your streaming programming and communicates with the streaming manager. 
- For example, with Spark Streaming your client code is broken into two logical pieces: the driver and the streaming algorithm(s) or job. The driver submits the job to the streaming manager, may collect results at the end, and controls the lifetime of your job.
#### Streaming manager
- The streaming manager has the general responsibility of getting your streaming job to the stream processor(s); in some cases it will control or request the resources required by the stream processors.
#### Stream processor
- This is where your job runs. The purpose is to execute the job that was submitted.
#### Data source(s)
- This represents the input and potentially the output data from your streaming job. With some platforms your job may be able to ingest data from multiple sources in a single job, whereas others may only allow ingestion from a single source. 
- One thing that may not be obvious from the architectures is where the output of the jobs goes. In some cases you may want to collect the data in your driver; in others you may want to write it out to a different data source to be used by another system or as input for another job.
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
## Processing Semantics

There there three categories of processing guarantees each with their own set of tradeoff
- At-most-once—A message may get lost, but it will never be processed a second time.
- At-least-once—A message will never be lost, but it may be processed more than once.
- Exactly-once—A message is never lost and will be processed only once.

#### At-most-once
- At-most-once is the simplest delivery guarantee a system can offer; no special logic is required anywhere.
- In essence, if a message gets dropped, a stream processor crashes, or the machine that a stream processor is running on fails, the message is lost. This is essentially a ***fire-and-forget*** model
#### At-least-once
- At-least-once increases the complexity because the streaming system must keep track of every message that was sent to the stream processor and an acknowledgment that it was received. 
- If the streaming manager determines that the message wasn’t processed (perhaps it was lost or the stream processor didn’t respond within a given time boundary), then it will be re-sent. 
- At this level of messaging guarantee, ***your streaming job may be sent the same message multiple times***. ***Therefore your streaming job must be idempotent,*** meaning that every time your streaming job receives the same message, it produces the same result.  If you remember this when designing your streaming jobs, you’ll be able to handle the duplicate-messages situation

#### Exactly-once
- Exactly-once semantics ratchets up the complexity a little more for the stream processing framework. Besides the bookkeeping that it must keep for all messages that have been sent, now it must also detect and ignore duplicates.
-  With this level of guarantee your streaming job no longer has to worry about dealing with duplicate messages—it only has to make sure it responds with a success or failure after a message is processed.

### Which guarantee to choose

You may be wondering which of these guarantees you need; it depends on the business problem you’re trying to solve.

#### Case 1: Turbine engine monitoring system

Say we want to constantly analyze how our turbine engine is performing so we can predict when a failure may occur and preemptively perform maintenance. Turbines produce approximately 1 TB of data every hour—keep in mind that’s one turbine, and we’re monitoring thousands to be able to predict when a failure may occur. 

Do we need to ensure we don’t lose a single message? We may, but it would be worth investigating whether our prediction algorithm needs all the data. If it can perform adequately with data missing, then I’d choose the least complex guarantee first and work from there.

#### Case 2 : Financial Transactions

What if instead your business problem involved making a financial transaction based on a streaming query? Perhaps you operate an ad network and you provide real-time billing to your clients. In this case you’d want to ensure that the streaming system. You choose provides exactly-once semantics.

## State Management

Once your streaming analysis algorithm becomes more complicated than using the current message without dependencies on any previous messages and/or external data, you’ll need to maintain state and will likely need the state management services provided by your framework of choice

![](../../images/state_management_1.jpg)


Fromt he figure, it should be obvious where you need to keep state—right there in the stream processor where your job performs the counting by user ID

The state management facilities provided by various systems naturally fall along a complexity continuum :

![](../../images/state_management_2.jpg)

The continuum starts on the left with a naïve in-memory-only choice,, and progresses to the other end of the spectrum with systems that provide ***a queryable persistent state*** that’s replicated.

For Example, Spark Streaming uses ***RocksDB*** as a state store.

With frameworks implementing the later, you can join together different streams of data. Imagine you were running an ad-serving business and you wanted to track two things: the ad impression and the ad click. It’s reasonable that the collection of this data would result in two streams of data, one for ad impressions and one for ad clicks.  Figure shows how these streams and your streaming job would be set up for handling this.

![](../../images/state_management_3.jpg)


## Fault Tolerance

A stream-processing framework’s ability to keep going in the face of failures is a direct result of its fault-tolerance capabilities. When you consider all the pieces involved in stream processing, there are quite a few places where it can fail.

![](../../images/fault_tolerance.jpg)

If we take the list and consolidate it down to the common elements, we end up with the following:
- **Data loss**—This covers data lost on the network and also the stream processor or your job crashing and losing data that was in memory during the crash
- **Loss of resource management**—This covers the streaming manager and your application driver, in the event you have one.

### Fault Tolerance mechanisms

When it comes to stream-processing frameworks, all the common techniques for dealing with failures involve some variant of replication and coordination. 

In general there are two common approaches a streaming system may take toward replication and coordination.  
1. State-machine
2. Rollback recovery

#### State Machine 
- The first approach used by stream-processing systems is known as state-machine.
- With this approach the stream manager replicates the streaming job on independent nodes and coordinates the replicas by sending the same input in the same order to all.
- In Spark, for example, this is called ***Speculative Execution***

#### Rollback recovery ( Checkpointing )
- The second approach is known as rollback recovery.
- In this approach the stream processor periodically packages the state of our computation into what is called a checkpoint, which it copies to a different stream processor node or a nonvolatile location such as a disk.
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
## Overview

* The idea behind stream processing is that rather than saving data to storage and processing the data in batches afterward, ***events are processed in flight***
* Unlike batch processing in which there is a defined beginning and end of a batch, in stream processing, the processor is long-lived and operates constantly on small batches of events as they arrive
* Stream processing is often used to look at data and make immediate decisions on how to transform, aggregate, filter, or alert on data
* Real-time processing can be useful in :
- model scoring
- complex event processing
- data enrichment, or many other types of processing.
* These types of patterns apply in a variety of domains for which the ***time-value*** of the data is high, meaning that the data is most valuable soon after creation and then diminishes thereafter. Many types of fraud detection, healthcare analytics, and IoT use cases fit this pattern.
* A major factor limiting adoption of real-time processing is the challenging demands it places on data storage
### The Continuous Query Model

- In non-streaming systems (traditional DBMS (RDBMS, Hadoop, HBase, Cassandra, and so on)) ***the data is at rest, and we query it for answers***. In the traditional DBMS the ***query is sent to the data*** and executed, and the result is returned to the application
- In a streaming system, this model is completely changed—***the data is sent “through” the query,*** and the result is sent to an application.
- In a streaming system a user (or application) registers a query that’s executed every time data arrives or at a predetermined time interval
- The result of the query is then pushed to the client. This model is called the ***continuous query model***, meaning the query is constantly being evaluated as new data arrives
- The ***data is being pulled or pushed through our system in a never-ending stream.***

>***Streaming systems inverts the traditional data management model by assuming users to be passive and the data management system to be active.***


![](../../images/dbms_vs_streaming_systems.jpg)
### Stream Table Duality

![](https://cdn.confluent.io/wp-content/uploads/kafka-streams-and-tables.gif)
#### Stream
- **An _event stream_ records the history of what has happened in the world as a sequence of events.*, so we know which event happened before another event to infer causality
- A **stream** provides immutable data. It supports only inserting (appending) new events, whereas existing events cannot be changed.
- You could consider a stream to be like a table in a relational database (RDBMS) that has no unique key constraint and that is append only.
#### Table
- Compared to an event stream, **a _table_ represents the state of the world** at a particular point in time, typically “now.”
- A **table** provides mutable data. New events—rows—can be inserted, and existing rows can be updated and deleted.

![](../../images/stream_table_duality_1.jpg)
### Convergence of Stream and table

![](../../images/stream_table_duality_2.jpg)

There is a close relationship between a stream and a table. We call this the [stream-table duality](https://www.confluent.io/blog/streams-tables-two-sides-same-coin/). What this means is that we can convert one to another :
1. **Streams → tables** : We can turn a stream into a table by aggregating the stream with operations such as COUNT() or SUM(), for example. 
2. Tables → streams
- We can turn a table into a stream by capturing the changes made to the table—inserts, updates, and deletes—into a “change stream.” This process is often called [change data capture](https://en.wikipedia.org/wiki/Change_data_capture) or CDC for short. 
- In fact, a table is fully defined by its underlying change stream. Major relational database such as Oracle or MySQL, these change streams exist there, by names such as ***redo log*** or ***binlog***

![](https://cdn.confluent.io/wp-content/uploads/event-stream-1.gif)
## References

https://www.confluent.io/blog/kafka-streams-tables-part-1-event-streaming/
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
## Overview
#### Classification of Real-time systems

What constitutes a real-time system? Real-time systems are classified as 
- Hard
- Soft
- Near

For near real-time : “Denoting or relating to a data-processing system that is slightly slower
than real-time.”

![](../../images/Classification_of_real-time_systems.jpg)

You can identify hard real-time systems fairly easily. They are almost always found in embedded systems and have very strict time requirements that, if missed, may result in total system failure.  Differentiating soft and near real-time becomes blurry, at times disappears, is very subjective, and may often depend on the consumer of the data.

![](../../images/generic_rt_system_with_consumer.jpg)

It can be stated as follows:

***Is the process of receiving data all the way to the point where it is ready for consumption a soft or near real-time process?***
#### Differences between real-time and streaming systems

***A system may be labeled soft or near real-time based on the perceived delay experienced by consumers.*** 

In many scenarios, the computation part of the system is operating in a non-hard real-time fashion, but the clients may not be consuming the data in real time due to network delays, application design, or a client application that isn’t even running. 

Put another way, what we have is a non-hard real-time service with clients that consume data when they need it. This is called a ***streaming data system—a non-hard real-time system that makes its data available at the moment a client application needs it.*** It’s neither soft nor near—it is streaming

![](../../images/generic_streaming_system.jpg)
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
## Overview

In a streaming system, the data is constantly flowing.  It may be out of order when we see it  or delayed —and we can’t query all the data at once, because the stream never ends
### Domains of time

To speak cogently about unbounded data processing requires a clear understanding of the domains of time involved.  Within any data processing system, there are typically two domains of time that we care about:
#### Event time
- This is the time at which events actually occurred. 
- Not all use cases care about event times
- Examples include characterizing user behavior over time, most billing applications, and many types of anomaly detection, to name a few.
#### Processing time
- This is the time at which events are observed in the system

Imagine we are collecting data from a fitness-tracking device such as a Apple Watch, and the data is flowing into our streaming system. 
- Stream time would be when the fitness event enters the analysis engine
- Event time would be when it takes place on the device. 
### Time Skew
In an ideal world, event time and processing time would always be equal, with events being processed immediately as they occur.  Reality is not so kind, however, and the skew between event time and processing time is not only non-zero, but often a highly variable function of the characteristics of the underlying input sources, execution engine, and hardware. 

Things that can affect the level of skew include the following:
- Shared resource limitations, like network congestion, network partitions, or shared CPU in a non dedicated environment
- Software causes such as distributed system logic, contention, and so on
- Features of the data themselves, like key distribution, variance in throughput, or variance in disorder (i.e., a plane full of people taking their phones out of airplane mode after having used them offline for the entire flight)

![](../../images/time_skew.jpg)
### Windowing

- The important takeaway regarding lag/skew is this:  Because the ***overall mapping between event time and processing time is not static*** (i.e., the lag/skew can vary arbitrarily over time), this means that you cannot analyze your data solely within the context of when they are observed by your pipeline if you care about their event times (i.e., when the events actually occurred).
- To cope with the infinite nature of unbounded datasets, these systems typically provide some notion of windowing the incoming data.  It essentially means chopping up a dataset into finite pieces along temporal boundaries.
- If you care about correctness and are interested in analyzing your data in the context of their event times, you cannot define those temporal boundaries using processing time (i.e., processing-time windowing)
Loading