Real-Time Data Pipeline Selection Guide
Singdata Lakehouse provides three types of real-time data processing objects: Pipe (continuous ingestion), Table Stream (change data capture), and Dynamic Table (incremental computation). They solve different problems and are often used together, but choosing the wrong one will lead you down unproductive paths.
This article answers two questions: which object suits which scenario, and how to choose for common business requirements.
Positioning of the Three Object Types
| Object | Problem It Solves | Data Flow Direction | Typical Usage |
|---|---|---|---|
| Pipe | Continuous flow of external data into Lakehouse | External -> Table | Kafka message ingestion, OSS file loading |
| Table Stream | Perceiving data changes within a table (CDC) | Table -> Downstream processing | Change sync, incremental ETL driver |
| Dynamic Table | Automatically maintaining real-time processing results | Table -> Processed result table | ODS->DWD->DWS incremental processing chain |
These three are not in competition; they are different stages of a pipeline:
Selection Decision Tree
First question: Where does the data come from?
- From Kafka or object storage (OSS/S3/COS) → Use Pipe
- Already in a Lakehouse table → See next question
Second question: What do you want to do?
- Process and transform table data, keeping results up-to-date in real time → Use Dynamic Table
- Need to perceive row-level changes (INSERT/UPDATE/DELETE) in a table to drive downstream processing → Use Table Stream
- Just a one-time or low-frequency import → Use COPY INTO, no Pipe needed
Third question (for Pipe scenarios): Trigger method?
- Real-time trigger when files are uploaded (minute-level latency) → EVENT_NOTIFICATION mode (requires message queue configuration)
- No real-time requirement; periodic scanning with original file deletion is acceptable → LIST_PURGE mode (simple configuration)
Pipe: Suitable for Continuous Ingestion, Not Batch Import
Scenarios Suitable for Pipe
- Kafka Topic data continuously writing to Lakehouse tables
- OSS/S3 files being continuously uploaded with automatic loading needed
- Data sources produce data continuously, requiring the system to automatically maintain read positions
Scenarios NOT Suitable for Pipe
- One-time historical data import → Use
COPY INTOdirectly - Low-frequency (once daily) batch sync → Scheduling
COPY INTOtasks is simpler - Strictly ordered data processing is required → Pipe does not guarantee load order
Pipe vs COPY INTO Comparison
| Pipe | COPY INTO (Scheduled) | |
|---|---|---|
| Suitable Scenario | Data flows continuously; new files auto-processed | Low-frequency batch import, one-time migration |
| Trigger Method | File event-driven or periodic scanning | Manual or external scheduling system trigger |
| Read Position Management | System auto-maintains (load_history) | Caller must manage on their own |
| File Deduplication | Auto dedup (same path + same filename = imported once) | No built-in dedup; repeated execution = repeated import |
| Ops Complexity | Auto-runs after creation, no intervention needed | Requires schedule config, execution monitoring |
| Compute Resources | Requires VCluster online | Requires VCluster online |
File Deduplication Behavior: Pipe records imported files via load_history (retained for 7 days). Files with the same path and filename are imported only once. If you need to re-import a file, you must manually execute COPY INTO.
File Size Recommendations:
- gzip-compressed files: recommended within 50MB
- CSV / Parquet uncompressed: recommended 128MB to 256MB
Files that are too small increase scheduling overhead; files that are too large affect single-batch loading time.
Pipe: Two Mode Choices
| EVENT_NOTIFICATION | LIST_PURGE | |
|---|---|---|
| Latency | Minute-level (event-triggered) | Depends on scan interval |
| Configuration Complexity | Requires message queue config (Alibaba Cloud MNS / AWS SQS) | Simple, no extra config needed |
| Original File Handling | Preserves original files | Deletes original files after import |
| Suitable Scenario | Latency-sensitive, files need to be preserved | Low latency requirement, file deletion acceptable |
| Cloud Provider Support | Alibaba Cloud OSS, AWS S3 | All object storage |
Table Stream: Suitable for Perceiving Changes, Not Direct Queries
Scenarios Suitable for Table Stream
- Need to capture INSERT / UPDATE / DELETE changes on a source table and sync to a target system
- Incremental ETL: only process data that is "new/changed since last processing," avoiding full scans
- Driving SCD (Slowly Changing Dimension) maintenance
- Listening for OSS file arrival events (Volume Stream)
Scenarios NOT Suitable for Table Stream
- Only need to query current data → Query the table directly, no Stream needed
- Only need to capture new data, not updates or deletes → Use APPEND_ONLY mode, better performance
- Need real-time data processing with maintained result updates → Use Dynamic Table
STANDARD vs APPEND_ONLY?
| STANDARD | APPEND_ONLY | |
|---|---|---|
| Captured Operations | INSERT, UPDATE, DELETE | INSERT only |
| UPDATE Representation | UPDATE_BEFORE + UPDATE_AFTER (two rows) | Not recorded (only original insert value) |
| Suitable Scenario | Need to handle updates and deletes (e.g., MERGE sync) | Append-only log data, pursue performance |
| Performance | Higher overhead (tracking all changes) | Lighter weight |
Offset Advancement Rules (Important)
Table Stream offsets advance only after a DML transaction containing the Stream commits successfully.
SELECT * FROM stream→ Does not advance offset; data remains available next queryINSERT INTO target SELECT ... FROM stream→ Offset advances after transaction commit; data is consumed- Transaction rollback → No advancement; data remains
This means: if you only SELECT to view Stream data, the offset does not change; you must consume data through DML operations that include the Stream to advance the offset.
Multi-Consumer Pattern
One Stream can only be fully consumed by one consumer. When Task A consumes the Stream via DML, the offset advances, and Task B querying the same Stream will no longer see that batch of changes.
If multiple downstream tasks all need to consume changes from the same table, create a separate Stream for each consumer:
Each Stream independently maintains its own offset, without interfering with others. Streams only store offsets, not a copy of the data, so the additional cost of creating multiple Streams is very low.
Data Retention Period and Stream Invalidation
Table Stream relies on the source table's historical version data (Time Travel) to return change records. If a Stream is not consumed for a long time and the source table's historical versions are cleaned up (exceeding DATA_RETENTION_DAYS, default 7 days), the Stream will become unavailable.
Daily recommendation: the Stream consumption frequency should be much shorter than the source table's DATA_RETENTION_DAYS to avoid Stream invalidation due to historical version loss.
Dynamic Table: Suitable for Declarative Incremental Processing, Not High-Real-Time Requirements
Scenarios Suitable for Dynamic Table
- Multi-layer ETL processing chains: ODS -> DWD -> DWS
- Query performance optimization: materialize complex computation results, avoiding recalculation on each query
- Near-real-time analysis with fixed dimensions, tolerating minute-level latency
Scenarios NOT Suitable for Dynamic Table
- Sub-second real-time requirements → Current minimum refresh interval is 1 minute
- SQL with extensive ORDER BY → Incremental refresh is limited; degrades to full refresh
- Extensive Outer Joins with frequently changing right-side tables → Low incremental computation efficiency
- Need to directly modify data (UPDATE/DELETE/TRUNCATE) → Dynamic Table does not support this
Refresh Mode Selection
Singdata Dynamic Table supports three refresh scheduling methods:
| Scheduling Method | Suitable Scenario | Characteristics |
|---|---|---|
DDL-defined refresh interval (REFRESH INTERVAL) | Simple scenarios, quick launch | No upstream/downstream dependency support; min interval 1 minute |
| Lakehouse Studio scheduling | Multi-layer DT chains, dependency control needed | Supports task dependencies (A completes then triggers B); monitoring and alerting |
| Third-party scheduling engine | Existing scheduling system, flexible control needed | Time interval not limited, but introduces external dependency |
Key Constraint for Multi-Layer Chains: The upstream DT's refresh frequency determines the minimum latency achievable by the downstream DT. If upstream refreshes every 5 minutes, downstream latency will be at least 5 minutes even if set to 1 minute.
Freshness vs Cost Trade-off
More frequent refreshing means fresher data but higher compute costs. Method for determining a reasonable refresh interval:
- Assess business tolerance for data latency ("Is 5-minute-old data acceptable?")
- Check single refresh duration (
SHOW DYNAMIC TABLE REFRESH HISTORY) - Refresh interval > single refresh duration to avoid task backlog
- If refresh duration approaches the interval, incremental computation may have degraded to full; optimize SQL or extend the interval
Incremental Refresh vs Full Refresh
Dynamic Table automatically selects incremental or full mode:
- Incremental Refresh (INCREMENTAL): Only computes data changed since the last refresh; high efficiency
- Full Refresh (FULL): Recomputes all data; suitable for initial refresh or when incremental conditions are not met
The following situations trigger full refresh or prevent incremental computation:
- SQL contains operators that cannot be incrementally processed (e.g., window functions with ORDER BY)
- First-time
REFRESH DYNAMIC TABLE - Source table changes are too large (approaching full data volume)
Use SHOW DYNAMIC TABLE REFRESH HISTORY to view the refresh_mode field for each refresh, confirming whether incremental mode was successfully used.
Typical Architecture Patterns
Pattern 1: Kafka Real-Time Ingestion + Incremental Processing
Suitable for: log analysis, real-time dashboards, user behavior analysis
End-to-End SQL Example (replace table names and connection names with actual values):
The Pipe auto-runs after creation; no manual trigger is needed. The Dynamic Table auto-refreshes incrementally every minute, processing only events newly added since the last refresh.
Pattern 2: OSS File Auto-Loading + Incremental Processing
Suitable for: IoT device data, log archiving, batch file processing
Pattern 3: Table Stream-Driven Cross-System Sync
Suitable for: syncing Lakehouse internal table changes to external systems, SCD maintenance
End-to-End SQL Example (replace table names with actual values):
The Stream's offset advances automatically after MERGE commits successfully. If MERGE fails (transaction rollback), the offset does not advance, and the next execution will re-consume the same batch of changes, ensuring no data loss.
Pattern 4: Volume Stream Monitoring File Changes
Suitable for: image/document AI processing, incremental file analysis
Related Documents
- Pipe Continuous Ingestion Overview
- Using Pipe for Continuous Object Storage Data Ingestion
- Table Stream Introduction
- Table Stream Best Practices
- Dynamic Table Introduction
- Dynamic Table Refresh Scheduling
- Incremental Computing Overview
- Developing Dynamic Tables for Near-Real-Time Incremental Processing
- Implementing SCD with Streams and Tasks
