Data Streaming — Understanding Tumbling, Sliding, and Session Windows in Apache Flink

Dorian Machado
3 min readJun 25, 2024

--

Apache Flink is a powerful stream processing framework that enables real-time data processing at scale. One of the critical aspects of stream processing is windowing, which allows you to group and process data streams in fixed-size chunks, overlapping intervals, or based on session activity.

In this article, we’ll delve into three primary types of windows in Flink: Tumbling, Sliding, and Session windows, and provide examples of how to implement them using PyFlink.

Tumbling Windows

Definition

Tumbling windows are fixed-size, non-overlapping windows that cover the data stream without any gaps. Each event belongs to exactly one tumbling window.

Use Case

Tumbling windows are suitable for scenarios where you need to aggregate data over a fixed period, such as calculating the sum of sales every minute.

PyFlink Example

from pyflink.table import StreamTableEnvironment, EnvironmentSettings
from pyflink.table.window import Tumble
from pyflink.datastream import StreamExecutionEnvironment
from pyflink.table.descriptors import Schema, Rowtime, FileSystem, OldCsv

# Set up the environment
env = StreamExecutionEnvironment.get_execution_environment()
settings = EnvironmentSettings.new_instance().in_streaming_mode().build()
table_env = StreamTableEnvironment.create(env, environment_settings=settings)

# Define source schema
table_env.connect(FileSystem().path('/path/to/input'))
.with_format(OldCsv().field_delimiter(',').derive_schema())
.with_schema(Schema()
.field('user_id', 'BIGINT')
.field('amount', 'DOUBLE')
.field('timestamp', 'TIMESTAMP')
.rowtime(Rowtime().timestamps_from_field('timestamp').watermarks_period('1 SECOND')))
.create_temporary_table('source')

# Define the table
table = table_env.from_path('source')

# Define a tumbling window
tumbling_window = table.window(Tumble.over('10.minutes').on('timestamp').alias('w'))

# Perform aggregation
result = table.window(tumbling_window) \
.group_by('w, user_id') \
.select('user_id, w.end as window_end, amount.sum as total_amount')

# Write the result to sink
table_env.connect(FileSystem().path('/path/to/output'))
.with_format(OldCsv().field_delimiter(',').derive_schema())
.with_schema(Schema()
.field('user_id', 'BIGINT')
.field('window_end', 'TIMESTAMP')
.field('total_amount', 'DOUBLE'))
.create_temporary_table('sink')

result.execute_insert('sink')

Sliding Windows

Definition

Sliding windows overlap and allow each event to belong to multiple windows. They are defined by a window size and a slide interval.

Use Case

Sliding windows are useful when you need overlapping aggregations, such as computing the average temperature every 5 minutes with a 1-minute slide interval.

PyFlink Example

from pyflink.table.window import Slide

# Define a sliding window
sliding_window = table.window(Slide.over('10.minutes').every('1.minute').on('timestamp').alias('w'))

# Perform aggregation
result = table.window(sliding_window) \
.group_by('w, user_id') \
.select('user_id, w.end as window_end, amount.sum as total_amount')

# Write the result to sink
result.execute_insert('sink')

Session Windows

Definition

Session windows group events that are separated by a specified gap. If no events occur for a specified duration, the window is closed.

Use Case

Session windows are ideal for tracking user sessions on a website, where the session ends if there is inactivity for a certain period.

PyFlink Example

from pyflink.table.window import Session

# Define a session window with a gap of 5 minutes
session_window = table.window(Session.with_gap('5.minutes').on('timestamp').alias('w'))

# Perform aggregation
result = table.window(session_window) \
.group_by('w, user_id') \
.select('user_id, w.end as window_end, amount.sum as total_amount')

# Write the result to sink
result.execute_insert('sink')

Conclusion

Understanding and correctly implementing windowing strategies is crucial for effective stream processing. Tumbling windows offer fixed, non-overlapping intervals; sliding windows provide overlapping intervals; and session windows dynamically adjust based on activity.

By leveraging PyFlink, you can efficiently implement these windowing strategies to meet various real-time data processing requirements.

--

--