5 best ETL practices
It is not possible to provide a list of instructions to build ideal and perfect ETL. Not only because of complexity of a problem, but also because of diversity. The scope of ETL can vary depending on project requirements. Every part of building the solution: creating architecture, defining sources, mappings, building validation, applying business rules is important and have it’s own specificity.
However, there are some practices, which might help you and I would like to share them with you.
1. Start with visualization
When you start new ETL project, typically you meet with client to determine list of sources, type of sources and then build mappings. It can be role of analyst or ETL consultant, but probably both are involved.
I would suggest you to create a visualization of the ETL process before moving to actual building. Even on very early stage of project you can take a piece of paper and draw the process (or just use professional architectural tool).
Ask yourself questions:
- Do I know everything (from analysis perspective) I should know? If not, ascertain that business knows about this shortage.
- Do the sources give all required information and data for target? Do you have a list of sources, dictionaries and mappings?
- Do you know dependencies in systems ? Do you know all relations ?
Even, if you have an analyst in your team – you should always analyze the data. Trying to find bugs in mappings, relations or data on the beginning might let you save tons of hours of work. For example, when you get first extract of source data, check it for NULLs, primary key and foreign key occurence. It might not be compatibile with what business thinks. It happens very often in medium size data systems like SAP, where you can find a lot of financial exceptions.
3. Choose proper ETL tool
This is a crucial step from technical point of view. I suggest you to analyze your case and if you have a comfort of choosing the ETL tool – make use of it !
For example, if client use SAP as source and want free ETL tool then Talend Open Studio might be a very good choice. On the other hand, if the source and destination is Oracle Database, on which there is a build OBIEE reporting system – it might be a good idea to propose ODI as tool. And lastly, if you have a loot of demands that will require a lot of customisation – you might think about combining Python & SQL and build your own tool. This are just example cases you might see, but every case if different and should be diagnosed carefully.
Nevertheless, I encourage you to analyze your case and possible options. It might save you a lot of time later.
It’s not only a very trendy word, but also one of key feature good ETL should possess. Always think about it and test your solution for it. Even if your ETL is now operating on 10 000 rows daily, it might grow later to 1 000 000 rows or more. Asking business about possible growth of data is also a very good idea.
For example: when using webservices as source system, always check how they behave with bigger volums. It is a common problem with lack of performance.
5. ETL should serve people, not the other way
Therefore, build a proper logging. But think about it as helpful and useful information for maintenance, not the painful must-do. Building too big and too complex logging can make future harsh.
Secondly, when building alerts always check who should get them and how they will be delivered. It might be not the best option to create automatic e-mail send to 1000 people every day.
And lastly, write proper documentation. And start this task during ETL building (even just abstract), not two weeks after you finish. Try to write it as simple and as easy-to-understand as possible.
Of course the list is not full, this are just some of ideas. I would be very happy to discuss with you in comments.