This article is shared from Huawei Cloud Community " Introduction to database coverage data import method ", author: along_2020.
As we all know, the INSERT INTO syntax in the database is the insert method of append. Recently, when dealing with some customer data import scenarios, we often encounter situations that require overlay import. Common overlay imports mainly include the following two:
1. Partial coverage: The old and new data are matched according to the key column values. If they match, they will be covered with new data. If they cannot be matched, they will be inserted directly.
2. Complete coverage: Directly delete all old data and insert new data.
This article mainly introduces how to complete the overlay data import method in the database.
A business imports big data into the business table for analysis every day. There is a primary key in a column in the business table. When there is a primary key conflict between the inserted data and the existing data, it is hoped that the row of data can be overwritten or updated with new data. If the old data userid does not conflict, insert the new data directly into the database. Take the overriding import of the data in the table src into the business table des as an example:
Option 1: Use DELETE+INSERT combination to implement (UPDATE is also possible, please think about it)
--open transaction START TRANSACTION; --Remove primary key conflict data DELETE FROM des USING src WHERE EXISTS (SELECT 1 FROM des WHERE des.userid = src.userid); --Import new data INSERT INTO des SELECT * FROM src WHERE NOT EXISTS (SELECT 1 FROM des WHERE des.userid = src.userid); --transaction commit COMMIT;
Advantages of the scheme: It can be achieved using the most common use of DELETE and INSERT.
Disadvantages of the scheme: 1. It is divided into two steps, DELETE and INSERT, which is not easy to use; 2. With the help of sub-queries, the DELETE/INSERT performance is restricted by the query performance.
Option 2: Use the MERGE INTO function to achieve
MERGE INTO des USING src ON (des.userid = src.userid) WHEN MATCHED THEN UPDATE SET des.b = src.b WHEN NOT MATCHED THEN INSERT VALUES (src.userid,src.b);
Advantages of the solution: MERGE INTO can be done with a single SQL, which is easy to use and has high internal deduplication efficiency.
Disadvantages of the solution: The database product needs to support the MERGE INTO function. Currently, databases such as Oracle and GaussDB(DWS) already support this function. The insert into on duplicate key of mysql is also similar to this function.
A business imports the data of a certain time interval into the business table every day for analysis. The analysis only needs to remove the imported time interval, and does not need the past historical data. In this case, it is necessary to use the overlay import.
Option 1: Use TRUNCATE+INSERT combination to achieve
--open transaction START TRANSACTION; --Clear business table data TRUNCATE des; --Insert January data INSERT INTO des SELECT * FROM src WHERE time > '2020-01-01 00:00:00' AND time < '2020-02-01 00:00:00'; --commit transaction COMMIT;
Advantages of the scheme: simple violence, first clean up and directly implement similar overwrite functions in insertion.
Disadvantages of the scheme: when TRUNCATE cleans the business table des data, it adds 8 locks to the table until the end of the transaction. When the INSERT time is very long due to the huge amount of data, the des table is inaccessible for a long time, and the business table des The related business is in a state of interruption.
Option 2: Use the method of creating a temporary table transition to achieve
--open transaction START TRANSACTION; --Create a temporary table CREATE TABLE temp(LIKE desc INCLUDING ALL); --The data is first imported into a temporary table INSERT INTO temp SELECT * FROM src WHERE TIME > '2020-01-01 00:00:00' AND TIME < '2020-02-01 00:00:00'; --Delete business table after import is complete des DROP TABLE des; --Modify the temporary table name temp->des ALTER TABLE temp RENAME TO des; --commit transaction COMMIT;
Advantages of the scheme: Compared with scheme 1, during INSERT, the business table des can continue to be accessed (old data), that is, the analysis business can continue to access the old data before the transaction is submitted, and the analysis business can access the newly imported data after the transaction is submitted.
Disadvantages of the scheme: 1. There are many combination steps and it is not easy to use; 2. The DROP TABLE operation will delete the dependent objects of the table, such as views, etc., and the subsequent restoration of the dependent objects may be more complicated.
Option 3: Use the INSERT OVERWRITE function
INSERT OVERWRITE INTO des SELECT * FROM src WHERE time > '2020-01-01 00:00:00' AND time < '2020-02-01 00:00:00';
Advantages of the solution: a single SQL can be completed, and the execution is convenient. It can support one-click switching between new and old data for business queries, and the business will not be interrupted.
Disadvantages of the solution: The product needs to support the INSERT OVERWRITE function. Currently, databases such as impala and GaussDB(DWS) already support this function.
With the increasing number of big data scenarios, the data import scenarios are becoming more and more abundant. In addition to the overlay data import introduced in this article, there are other import methods such as INSERT IGNORE import that ignores conflicts. These import scenarios It can be implemented by using basic INSERT, UPDATE, DELETE, and TRUNCATE, but it also has a direct appeal to the advanced one-click SQL function, which will be introduced later when there is a chance.