Upsert operation: ETL vs pure SQL
In this post I would like to compare efficiency of upsert operation in ETL tools and SQL. This is one of the most commonly used operations when building ETLs and integrating systems. Developers should always ask themselves whether not to use custom written MERGE instead of creating jobs in selected ELT tool.
As an ETL example I will use Talend Open Studio 6.1. An SQL example will be MERGE operation in SQL Server. We will start with building two tables, which will be used for upsert. Please note, that I will populate sourceTable with 1 000 000 rows and destinationTable with 500 000 rows using modified script from http://blogbi.pl/generating-milions-of-rows-in-sql-server/
CREATE TABLE sourceTable (
id int PRIMARY KEY,
CREATE TABLE destinationTable (
id int PRIMARY KEY,
-- mentioned script here
SELECT COUNT(*) as [#sourceCount]
SELECT COUNT(*) as [#destinationCount]
Talend Open Studio upsert operation implementation
First, we will build job in Talend. Normally you can use three components: tMSSqlInput (for sourceTable), tMSSqlOutput (for destinationTable) and obviously tMap. Then you would switch component action for data for (Insert or Update) for tMSSqlOutput. However this approach creates a flow, which transfers data with speed of dozens rows per second, so it is extremely poor. So, let me show you how to build a well-performing Talend upsert job.
I have used sourceTable and destinationTable as inputs for tMap. Then I haved selected inner join between them to catch which rows are new and which are already in destination table. Based on that I will filter them and insert/update them.
With such approach the performance is better than build-in insert/update in Talend. It works in ~ 20 seconds.
SQL Server upsert operation implementation
Now let’s check out how MERGE works in comparison to Talend.
The SQL code for MERGE:
CREATE PROCEDURE dbo.MergeExample AS
MERGE dbo.destinationTable AS target
USING dbo.sourceTable AS source
ON target.id = source.id
WHEN MATCHED THEN
UPDATE SET number = source.number
WHEN NOT MATCHED THEN
INSERT (id, number)
VALUES (source.id, source.number);
In Talend you can just use TMSSqlRow with SQL Query EXEC dbo.MergeExample.
Performance: ~2 s.
So SQL is 10 times faster than Talend in this case.
This simple theoretical test shows the same what practise learned me. Pure SQL is much faster than any ETL tool in case of many operations. With more data and more complicated transformations the advantage of SQL will be even bigger. I have seen cases where pure well written SQL is even 500 times faster than ETL tools.
Also, please note that when you have a lot of transformations ,it is nearly always easier to write Table -> View – > Merge flow in SQL, then create complex jobs in ETL tool.
Therefore I would like to suggest you to think about this option while developing your ETLs.