SSIS Row Comparison By Checksum
I recently ran across an SSIS package that needed to be rewritten that included the Checksum Transformation from SQLIS.com . The main reason was that they wanted the removal of the component because they were using it to compare rows instead of using something like a Slowly Changing Dimension Transformation (SCD). While their process seemed to be solid on the surface they were having problems with it not accurately identifying changes in certain columns. Right away I knew that the issue was related to the columns in question being text data types. I knew this specifically because I had written A LOOOONG time ago a database comparison tool that used checksums as part of the data comparison algorithm.
The problem stems from the fact that SQL Server and really a lot of Checksum algorithms cannot accurately compare certain types of fields. Specifically from MSDN…
Syntax
CHECKSUM ( * | expression [ ,...n ] )
Arguments
*
Specifies that computation is over all the columns of the table. CHECKSUM returns an error if any column is of noncomparable data type. Noncomparable data types are text, ntext, image, and cursor, as well as sql_variant with any of the above types as its base type.
Now while the component doesn’t necessarily use the CHECKSUM function within SQL Server, it still suffers from the same limitation. You do however have some options. Let’s start with how the Checksum is normally used. You have two inputs, the new data stream and the old data stream. Both of these are ordered and then merged on the key values. Then the old data is run through the checksum process and is appended as a new column. The new data is then checksumed and appended. Finally the two checksum columns are compared in order to tell what to do with that row of data by looking to see whether it matches and the comparison of some of the key values(NULL or NOT NULL). The most straight-forward option is to keep all of the process up until the checksum values and delete the checksums. Now replace the checksums with a simple Derived Column Transformation. With the Derived Column you can simply create new column and populate it with a boolean column created by a simple function (oldcolumn==newcolumn && oldcolumn1==newcolumn1 ). That’s it. the comparison of the column values in SSIS overcome the limitations of the checksum method. A little more labor involved especially on very wide tables but it works like a charm.
Hopefully this will help you out a little if you are in the same jam.
Cheers!
AJ