Apache Flink cleans, standardizes, backfills, and de-identifies the data, and write it back to Kafka. DWD: This is where the fact tables are.Meanwhile, a copy of them will be stored in HDFS for data verification or replay. ODS: Original logs and alerts from all sources are gathered into Apache Kafka.The logs are collected into the data warehouse, and go through several layers of processing. This is an overview of their data pipeline. The rest of this post is about what their log processing architecture looks like, and how they realize stable data ingestion, low-cost storage, and quick queries with it. For the need of real-time monitoring, threat tracing, and alerting, they require a log analytic system that can automatically collect, store, analyze, and visualize logs and event records.įrom an architectural perspective, the system should be able to undertake real-time analysis of various formats of logs, and of course, be scalable to support the huge and ever-enlarging data size. Such a gigantic log analysis system is part of their cybersecurity management. Using Apache Doris, they deploy multiple petabyte-scale clusters on dozens of machines to support their 15 billion daily log additions from their over 30 business lines. The user is China Unicom, one of the world's biggest telecommunication service providers. This data warehousing use case is about scale.