This is a whole-journey guide for Apache Doris users, especially those from the financial sector, which requires a high level of data security and availability.
If you don't know how to build a real-time data pipeline and make the most of the Apache Doris functionalities, start with this post, and you will be loaded with inspiration after reading.
Data sources include MySQL, Oracle, and MongoDB. They were using Apache Hive as an offline data warehouse but feeling the need to add a real-time data processing pipeline.
After introducing Apache Doris, they increase their data ingestion speed by 2~5 times, ETL performance by 3~12 times, and query execution speed by 10~15 times.
In this post, you will learn how to integrate Apache Doris into your data architecture, including how to arrange data inside Doris, how to ingest data into it, and how to enable efficient data updates.
Plus, you will learn about the enterprise features that Apache Doris provides to guarantee data security, system stability, and service availability.
The main difference between these models lies in whether or how they aggregate data.
ODS - Duplicate Key model: As a payment service provider, the user receives a million settlement data every day.
Since the settlement cycle can span a whole year, the relevant data needs to be kept intact for a year.
An exception is that some data is prone to constant changes, like order status from retailers.
Such data should be put into the Unique Key model so that the newly updated record of the same retailer ID or order ID will always replace the old one.
DWD & DWS - Unique Key model: Data in the DWD and DWS layers are further abstracted, but it is all put in the Unique Key model so that the settlement data can be automatically updated.
The key is to set an appropriate number of data partitions and buckets.
They often need to query the dimensional data of different retailers from the retailer flat table, so they specify the retailer ID column as the bucketing field and list the recommended bucket number for various data sizes.
In the adoption of Apache Doris, the user had to migrate all local data from their branches into Doris, which was when they found out their branches were using different databases and had data files of very different formats, so the migration could be a mess.
Luckily, Apache Doris supports a rich collection of data integration methods for both real-time data streaming and offline data import.
Real-time data streaming: Apache Doris fetches MySQL Binlogs in real-time.
Offline data import: This includes more diversified data sources and data formats.
Historical data and incremental data from S3 and HDFS will be ingested into Doris via the Broker Load method, data from Hive or JDBC will be synchronized to Doris via the Insert Into method, and files will be loaded to Doris via the Flink-Doris-Connector and Flink FTP Connector.
As the primary cluster undertakes all the queries, the major business data is also synchronized into the backup cluster and updated in real time so that in the case of service downtime in the primary cluster, the backup cluster will take over swiftly and ensure business continuity.
This Cyber News was published on feeds.dzone.com. Publication date: Thu, 11 Jan 2024 17:43:04 +0000