Database designers seldom consider nuances of application servers, while assigning default values to dates. Design flaws often go unnoticed until issues prop up like an open can of worms, typically during data migration or integration projects involving mainframe data. Data profiling, inventory and quality analysis are usually part of the assessment phase of a data migration / integration project, and require to be meticulous to avoid any issues in future.
Profiling categoric fields is a straight forward option that detects noise with relatively simple and standard statistics (box plot, frequncy tabulation, etc.,). The same may not hold true with dates, as their frequency tabulation may be too long due to the large number of distinct values. They often require additional metrics for quality checks, else flaws may go unnoticed. Added to this, lack of apppropriate documentation and the complex nature of legacy systems in
large organisations only multiply the challenges of information management. One such scenario with data quality issues in date(s) is discussed below.
Background: An automated data profiling SAS macro was used to print statistics for a custom list of fields. SAS Data marts were created to validate the design of the new DB, as part of UAT, that included all the integrity constraints. Space constraints and speed issues due to the massive size of the database were worked around by choosing to use 2-3 days' worth data load for one iteration of UAT. One integrity constraint defined on Date of Birth field was to be less
than that of customer profile creation / customer acquisition date. The GUI application form was not designed to check for violation of integrity constraints.
Problem: During an Enterprise Data Integration project, missing
values in two fields with date values went unnoticed. Large difference
in number of reject records between UAT and ETL. A large number of
records were rejected during the ETL loads and the total number of
customer records reduced by upto 10%, a total deviation form the results
of all four iterations of UAT.
Reason analysed: Missing values on dates had been set up as "01-01/0001" and "31/12/9999" in two different fields of the source database, at its design time. These were part of a data integration project. While extracting data from the DB using PROC SQL, SAS converted 01-01-0001 into 01-01-2001 while storing it in SAS Date formats. The random sample records chosen for profiling, rejected only a small percentage (1/1000 records) which was within the accepted thresholds. On verification, it showed up to be the record of a minor customer, and passed through the Data Quality module as well. All the records with missing date values passed the UAT, as none of the records were detected by the SAS Data Mart load rule defined to "Reject record if date value in ("01-01/0001","31/12/9999")", as all thre cords had a valid date by now.
Business Impact: Go-live delay by 15 business days due to additional iterations of UAT, ETL for customer data.
Opportunities: UAT Team must try and use the same ETL tool as the Technology Team or use more than one ETL tools along various iterations. Final results may vary due to the DBMS Data mart used. Hence maintaining uniformity while managing for efficiency is a balance to be striked by the IM group. On the other hand, it may be a good option for Data Quality Assessment Software Companies to include the nature of Source, UAT and End User Database as parameters to model or valiudat Data Quality rules by internal translation or interconversion of rules.
Suggested Methods for Data Profiling and Quality Analysis on Fields with Date values: In such cases, you may have to be even more careful while assigning default values. If you wish to stick to SAS, the best option would be to extract the date with such default values as day, month and year parts, separately. If these dates are to be used to calculate age / time period, then I would advice use of SQL functions to calculate date differences at data selection, than attempting conversions in SAS (or) a front end application.
Solution implemented: Inclusion of Data Quality checks to count the number of records that match the individual date, month and with year values in SAS, was included in the UAT module to determine the expected number of reject records at ETL, in order to avoid repeated data extractions, and to minimise delay in Go-live from 15 days to 3 days.
No comments:
Post a Comment