Minnu's Blog on Informatica & Data warehouse concepts

Archives

Monday, June 9, 2008

Handling Duplicate Source Records using Mapping

Overview
Due to duplicate rows in the source, encountering primary key violation errors on the target table is a common issue encountered when running PowerCenter sessions. You can use a Lookup transformation to find duplicate data in a target based on a key field (or a composite key). This works when comparing source rows to rows already existing in the target table. But this technique does not entirely work when the source contains duplicate rows which do not exist in the target already.

You may attempt to use an uncached Lookup against the target and set the target commit interval to one so that the target table is queried for each row. Because of the way in which the PowerCenter Server processes rows (as described at the end of this article ), this approach does not work.
Even if this method worked, it would be very inefficient and performance would be severely affected.
The following examples demonstrate some of the most efficient ways to deal with duplicate source data.

FREE IMAGE HOSTING & FREE IMAGE UPLOAD @ SCREENSHOTS.CC

In this case study, all duplicate rows read from the source are removed and only the first row is sent to the target and the rest are discarded. This example specifically chooses the item with the greatest quantity from the ORDER_ITEMS.

2

FREE IMAGE HOSTING & FREE IMAGE UPLOAD @ SCREENSHOTS.CC

Source Qualifier
Relational Source
To choose a specific row depending on a specific value in a field other than the key field you can specify a SQL override in the Source Qualifier to order the input rows. With a relational source we can use the following SQL override to order the ports initially. The order by clause makes the row with the largest quantity the last row in each group of items in the source table.
SELECT ORDER_ITEMS.ORDER_ID, ORDER_ITEMS.LINE_NO, ORDER_ITEMS.ITEM_ID, ORDER_ITEMS.DESCRIPTION, ORDER_ITEMS.QUANTITY, ORDER_ITEMS.PRICE, ORDER_ITEMS.DISCOUNT
FROM ORDER_ITEMS
ORDER BY ORDER_ITEMS.ITEM_ID,ORDER_ITEMS.QUANTITY
Non-Relational Source
Source Qualifier will be left as is, no modifications can be made. Data will be processed in the order it is entered into the source file. This means that the rows will not be guaranteed to be ordered by the quantity.
Sorter transformation
Relational Source
A Sorter is optional when using a relational source. The data can be sorted in the source qualifier as above, and this is recommended.
Non-Relational Source
To presort the data when using a source file a Sorter Transformation can be used.
In this example the ORDER_ID and the QUANTITY ports are selected as key columns.

FREE IMAGE HOSTING & FREE IMAGE UPLOAD @ SCREENSHOTS.CC
FREE IMAGE HOSTING & FREE IMAGE UPLOAD @ SCREENSHOTS.CC

The Aggregator will group the rows together based on ITEM_ID so that there will be only one row per ITEM_ID sent to the target. The Aggregator sends the last row in each group to the target. If this is a relational source, the rows have been ordered by the Quantity in ascending order and the last row in each group is the row with the greatest value in the Quantity port. If this is a flat file source, the rows are not ordered and you cannot control which row of each group is sent to the target.
Note: To improve speed, sorted ports can be used for presorted data. Make sure the first row listed in the order by clause is the same as the Group By port in the Aggregator. In this case, it is ITEM_ID.
If your source is not relational and you want to write a specific row to the target for each group (in this case the row with the greatest quantity), add the Sorter transformation as above or use the one of the following options.

5

This example does basically the same thing as the Aggregator example, but without the need for a SQL override to obtain a specific row. This is most useful if the source is not relational.
Source Qualifier
No modifications needed for Source Qualifier.
Rank
Set the properties for the Rank Transformation as follows:

6

This groups the rows by the ITEM_ID, and ranks rows according to quantity. The Top option returns the values with the greatest value. Setting Number of Ranks to 1 returns the row with the highest value in quantity.
Variable Port and Filter Mapping

7

This approach requires the following transformations, Expression transformation to store the ITEM_ID numbers with variables, and a Filter transformation to filter out all subsequent items.

8

Source Qualifier
As in with the Aggregator approach the Source Qualifier can be used to sort the rows according to the Item_ID. This is done so that all items come into the mapping pipeline sequentially and that the values are stored in the variables in the desired order.
SELECT ORDER_ITEMS.ORDER_ID, ORDER_ITEMS.LINE_NO, ORDER_ITEMS.ITEM_ID, ORDER_ITEMS.DESCRIPTION, ORDER_ITEMS.QUANTITY, ORDER_ITEMS.PRICE, ORDER_ITEMS.DISCOUNT
FROM ORDER_ITEMS
ORDER BY ORDER_ITEMS.ITEM_ID,ORDER_ITEMS.QUANTITY DESC
Sorter transformation (Optional)
Relational Source
A Sorter is optional when using a relational source. The data can be sorted in the source qualifier as above, and this is recommended.
Non-Relational Source
To choose a specific row depending on a specific value in a column in the source file a Sorter Transformation can be used as shown in the Aggregator example above.
Expression Transformation

9

The first field, ITEM_ID_VAR compares difference between the ITEM_ID and the PREV_ITEM_ID to see if they are the same. If they are, then ITEM_ID_VAR is set to 1. If they are not, then it evaluates to 0.
The output port ITEM_ID_OUT is set to the value in the ITEM_ID_VAR and passes it on to the Filter.

The order in which the variables are placed into the Expression transformation determines the order in which they are evaluated and set. The first variable in the list of ports is the first variable set and the last one in the list is the last one set. After the variables are evaluated the output ports are evaluated.

In this example: For the very first source row, when evaluating the variable ITEM_ID_VAR, since the PREV_ITEM_ID has not yet been set, it is 0. Since ITEM_ID is always a nonzero value then ITEM_ID and PREV_ITEM_ID are not equal and ITEM_ID_VAR is set to 0. The last variable, PREV_ITEM_ID, is set to the value in ITEM_ID which is the value in the current row for ITEM_ID. The output port ITEM_ID_OUT sends the value 0 to the Filter transformation.

In the second row, the ITEM_ID_VAR once again tests the difference between the ITEM_ID and the PREV_ITEM_ID. This time, however, (based on the sample data below) the values for each are the same, since the PREV_ITEM_ID was set to the ITEM_ID value from the previous row. So the value for ITEM_ID_VAR and ITEM_ID_OUT are set to 1 and passed to the Filter transformation. The last variable, PREV_ITEM_ID, is set to the value in ITEM_ID which is the value in the current row for ITEM_ID.

The value for ITEM_ID_VAR and ITEM_ID_OUT are set to 1 until the ITEM_ID changes to new set of ITEM_IDs. This is shown with the sample data below.

10

All rows that are marked 0 are the rows that need to be kept and the ones mark 1 in the ITEM_ID_VAR port are the duplicates, the ones to be discarded. This can be accomplished by using a Filter.
Filter Transformation
The Filter then is used to delete all duplicate item values. The filter condition is:
ITEM_ID_OUT!=1
This only allows rows to proceed to the target where the ITEM_ID_VAR and ITEM_ID_OUT equal zero. All other rows are ignored. Thus only the rows that are unique are sent to the target. In the example above the only rows sent to the target are rows 1 and 100, the rows containing the highest quantity for each ITEM_ID. Remember this will only work if the Quantity is ordered in descending order in the source.
Router Transformation (Optional)
Instead of simply filtering out the duplicate records this approach allows for special handling of the duplicates. After the Expression transformation, a Router transformation can be used instead of a Filter.
This Router will contain the following groups:
11

Connect the DEFAULT group to the target and the REJECT_GROUP to an audit table.

12

This can be used to pass on only the duplicate rows for processing and onto a separate target table, like an audit table.

Lookup Strategies
If you have duplicate rows in your source data, using a Lookup transformation will not reliably detect them. A cached Lookup transformation caches the target rows in the beginning of the session and can detect duplicate rows that are already in the target, but not new duplicate rows being loaded in the same session. Turning off caching will not reliably catch duplicates. It ensures the target is re-queried for each row, but uncommitted rows temporarily stored buffers will not be accessible by the Lookup. Setting the commit point to one also will not work reliably. A database commit operation takes some milliseconds to complete, and the transformation engine does not pause - it keeps processing and transforming rows and buffering them in memory. So by the time the commit is complete, 100 rows might have been processed. If the 101st row is a duplicate the Lookup will find it, but not if there was a duplicate in the first 100 rows.

A dynamic lookup can be used to overcome above limitation.

0 comments: