The Link King
Record Linkage and Consolidation Software
Performance Statistics: Creating a Master Client Listing
When creating a master listing that will be periodically updated the user can choose to process
the data in a couple of ways.  Suppose you are given a dataset containing 50,000 rows of
identifiers and will receive 50,000 additional rows every quarter.  Both the initial dataset and
the quarterly updates require unduplication and the quarterly updates may contain records
already in the master listing.

Performance characteristics will vary depending on characteristics of your data/processor.
Option #1 (Complete Unduplication): Unduplicate the initial dataset to create the master listing.  When quarterly updates
arrive, append the update to the initial dataset and unduplicate the aggregate dataset.  Using this option, the user would
unduplicate 50,000 initially, 100,000 after the 1st quarterly update, 150,000 after the 2nd quarterly update, etc.  The aggregate
Option #2 (Sequential Processing): Unduplicate the initial dataset to create the master listing.  When the quarterly updates
arrive, import the unduplicated initial dataset as the ‘sample’  dataset and the quarterly update as the ‘matching’ dataset.  Instruct
The Link King to append unmatched records from the ‘matching’ dataset to the master listing.  Save the results of the linkage to
import as the ‘sample’ dataset at the next quarterly update.

Comparison Results
As detailed below, sequential processing requires less processing time and disk space for each quarterly update. Ultimately, the
processing time differential appears to stabilize at the point where full unduplication requires 40-50% more processing time than
sequential processing.  
Compared to sequential processing, full unduplication requires considerably more disk space for the blocked dataset(s).  The
proportional reduction in the size of appears to grow as the size of the dataset(s) being processed increases:

    1.4 times more disk space with 100,000 records,
    2.3 times more disk space with 200,000 record,
    3.0 times more disk space with 300,000 records, and
    3.3 times more disk space with 400,000 records.  

Sequential processing produces a smaller blocked dataset because a) the records in the master listing are not blocked against the
master listing and b) The Link Kings unique “alias compression” protocol optimizes the number of records from the master
listing submitted for blocking (see user manual for full explanation of “alias compression”).
One might wonder how the resulting linkage matrix differs between the 2 data processing options.  The chart below suggests
minimal difference between the 2 processes (the lines overlap almost perfectly).  There were some differences that suggest
sequential processing is ever so slightly less likely to link records.
Copyright Camelot Consulting 2004