Introduction of 2 kinds of database coverage data import methods

Abstract: This article mainly shares the method of data overlay import database in big data scenarios.

This article is shared from Huawei Cloud Community " Introduction to database coverage data import method ", author: along_2020.

foreword

As we all know, the INSERT INTO syntax in the database is the insert method of append. Recently, when dealing with some customer data import scenarios, we often encounter situations that require overlay import. Common overlay imports mainly include the following two:

1. Partial coverage: The old and new data are matched according to the key column values. If they match, they will be covered with new data. If they cannot be matched, they will be inserted directly.

2. Complete coverage: Directly delete all old data and insert new data.

This article mainly introduces how to complete the overlay data import method in the database.

partial coverage

Business scene

A business imports big data into the business table for analysis every day. There is a primary key in a column in the business table. When there is a primary key conflict between the inserted data and the existing data, it is hoped that the row of data can be overwritten or updated with new data. If the old data userid does not conflict, insert the new data directly into the database. Take the overriding import of the data in the table src into the business table des as an example:

application solution

Option 1: Use DELETE+INSERT combination to implement (UPDATE is also possible, please think about it)

--open transaction
START TRANSACTION;

--Remove primary key conflict data
DELETE FROM des
USING src
WHERE EXISTS (SELECT 1 FROM des WHERE des.userid = src.userid);

--Import new data
INSERT INTO des
SELECT *
FROM src
WHERE NOT EXISTS (SELECT 1 FROM des WHERE des.userid = src.userid);

--transaction commit
COMMIT;

Advantages of the scheme: It can be achieved using the most common use of DELETE and INSERT.

Disadvantages of the scheme: 1. It is divided into two steps, DELETE and INSERT, which is not easy to use; 2. With the help of sub-queries, the DELETE/INSERT performance is restricted by the query performance.

Option 2: Use the MERGE INTO function to achieve

MERGE INTO des USING src ON (des.userid = src.userid)
WHEN MATCHED THEN UPDATE SET des.b = src.b
WHEN NOT MATCHED THEN INSERT VALUES (src.userid,src.b);

Advantages of the solution: MERGE INTO can be done with a single SQL, which is easy to use and has high internal deduplication efficiency.

Disadvantages of the solution: The database product needs to support the MERGE INTO function. Currently, databases such as Oracle and GaussDB(DWS) already support this function. The insert into on duplicate key of mysql is also similar to this function.

full coverage

Business scene

A business imports the data of a certain time interval into the business table every day for analysis. The analysis only needs to remove the imported time interval, and does not need the past historical data. In this case, it is necessary to use the overlay import.

application solution

Option 1: Use TRUNCATE+INSERT combination to achieve

--open transaction
START TRANSACTION;

--Clear business table data
TRUNCATE des;

--Insert January data
INSERT INTO des SELECT * FROM src WHERE time > '2020-01-01 00:00:00' AND time < '2020-02-01 00:00:00';

--commit transaction
COMMIT;

Advantages of the scheme: simple violence, first clean up and directly implement similar overwrite functions in insertion.

Disadvantages of the scheme: when TRUNCATE cleans the business table des data, it adds 8 locks to the table until the end of the transaction. When the INSERT time is very long due to the huge amount of data, the des table is inaccessible for a long time, and the business table des The related business is in a state of interruption.

Option 2: Use the method of creating a temporary table transition to achieve

--open transaction
START TRANSACTION;

--Create a temporary table
CREATE TABLE temp(LIKE desc INCLUDING ALL);

--The data is first imported into a temporary table
INSERT INTO temp SELECT * FROM src WHERE TIME > '2020-01-01 00:00:00' AND TIME < '2020-02-01 00:00:00';

--Delete business table after import is complete des
DROP TABLE des;

--Modify the temporary table name temp->des
ALTER TABLE temp RENAME TO des;

--commit transaction
COMMIT;

Advantages of the scheme: Compared with scheme 1, during INSERT, the business table des can continue to be accessed (old data), that is, the analysis business can continue to access the old data before the transaction is submitted, and the analysis business can access the newly imported data after the transaction is submitted.

Disadvantages of the scheme: 1. There are many combination steps and it is not easy to use; 2. The DROP TABLE operation will delete the dependent objects of the table, such as views, etc., and the subsequent restoration of the dependent objects may be more complicated.

Option 3: Use the INSERT OVERWRITE function

INSERT OVERWRITE INTO des SELECT * FROM src WHERE time > '2020-01-01 00:00:00' AND time < '2020-02-01 00:00:00';

Advantages of the solution: a single SQL can be completed, and the execution is convenient. It can support one-click switching between new and old data for business queries, and the business will not be interrupted.

Disadvantages of the solution: The product needs to support the INSERT OVERWRITE function. Currently, databases such as impala and GaussDB(DWS) already support this function.

Summarize

With the increasing number of big data scenarios, the data import scenarios are becoming more and more abundant. In addition to the overlay data import introduced in this article, there are other import methods such as INSERT IGNORE import that ignores conflicts. These import scenarios It can be implemented by using basic INSERT, UPDATE, DELETE, and TRUNCATE, but it also has a direct appeal to the advanced one-click SQL function, which will be introduced later when there is a chance.

 

Click Follow to learn about HUAWEI CLOUD's new technologies for the first time~

Posted by apacheguy on Mon, 11 Jul 2022 10:32:55 +0530