Sunday, October 6, 2013

Integrating data from multiple files in R


Problem Statement: To perform a spatio temporal analysis on rainfall data available in .csv format, with one file each for one month and year, mounting to 300 files for 25 years. It is manually difficult to merge this data, though not impossible. 

SolutionR was used for analysis and hence it only made sense to perform the data integration also in R. However, this can be solved in Excel.

System: R was run on Windows 7

Assumption / Note of Warning: The temporal attribute was not included in the data, but file names had the temporal attribute (i.e., each file was named uniformly with the month year as the suffix). For file names with other formats, there may have to be a work around. If the temporal unit is already provided in the data, the solution may be an easier one. In such cases you can skip or eliminate the steps used to Create a new variable to record the temporal label of the file (may be date or month or year) extracted from the file name.

Method: The Files were read one at a time for the first two files, and they were merged to create the initial data mart. After which, every new file read was merged with this initial data mart and over written subsequently. There may be better ways of doing it, but this made it easy for Data Quality Moinitoring as well. The code used is given below:

#Script for Integrating data from multiple .csv files in R

# Initialise a variable mainintain the file count (set it to zero).
i<-0;

# Provide a variable to mainintain the path to source data files (set it to a string variable). 
file_path<-c("C:\\R\\Data");

# Createa a Variable to store the list of files in the given path 
# I am not sure on the impact on performance and limitation of the size of this list
# Note that this automatically includes subfolders
Files_list<-list.files(file_path);

# Validation - Print the number of files in the folder
cat("\n","Number of files present are",length(Files_list),"\n")
cat("\n","LIST OF FILES IS BELOW :","\n",Files_list)

# Read first file in the list
file<-paste(file_path,Files_list[2],sep="/");
Year_1<-read.csv(file,header=FALSE,na.strings="NULL",stringsAsFactors=FALSE);

# Create a new variable to record the temporal label of the file (may be date or month or year)
# This can be derived or extracted from the file names
Year_1$YEAR<-substr(Files_list[1],4,7);

# Create the Header for the Data Mart in R
names(Year_1)<-c("D1","D2","D3","D4","D5","YEAR") ;

# Read the second  File
file<-paste(file_path,Files_list[2],sep="/")
Year_2<-read.csv(file,header=FALSE,na.strings="NULL",nrows=5,stringsAsFactors=FALSE);
Year_2$YEAR<-substr(Files_list[2],4,7)
names(Year_2)<-c("D1","D2","D3","D4","D5","YEAR") ;

# Merge the two years' files read so far
Init_Mart<-merge(Year_1, Year_2, by=c("D1","D2","D3","D4","D5","YEAR"),all=T)


# Print Header to validate
head(Init_Mart)

# Read all other consecutive years' files
for(i in 3:length(Files_list)){
                file<-paste(file_path,Files_list[i],sep="/")
                Yearly_Data<-read.csv(file,header=FALSE,na.strings="NULL",stringsAsFactors=FALSE)
                Yearly_Data$YEAR<-substr(Files_list[i],4,7)
                names(Yearly_Data)<-c("D1","D2","D3","D4","D5","YEAR")
                Init_Mart<-merge(Init_Mart, Yearly_Data,by=c("D1","D2","D3","D4","D5","YEAR"),all=T)
}

#Test print rows
Init_Mart[421565:421575,]

## Change data types

               ## Convert variable types from character to Numeric wherever applicable
               ## Convert Date fields to month, year and day if required
Init_Mart[c(2,3,4,5,6,7,8)]<-sapply(Init_Mart[c(2,3,4,5,6,7,8)], as.numeric)

## Upload / save it as a CSV File for further processing
write.table(Init_Mart,"C:\\R\\Data\\Data_Mart_Total.csv",sep=",")

Friday, August 22, 2008

SAS issues with default date values on Database

Database designers seldom consider nuances of application servers, while assigning default values to dates. Design flaws often go unnoticed until issues prop up like an open can of worms, typically during data migration or integration projects involving mainframe data. Data profiling, inventory and quality analysis are usually part of the assessment phase of a data migration / integration project, and require to be meticulous to avoid any issues in future.

Profiling categoric fields is a straight forward option that detects noise with relatively simple and standard statistics (box plot, frequncy tabulation, etc.,). The same may not hold true with dates, as their frequency tabulation may be too long due to the large number of distinct values. They often require additional metrics for quality checks, else flaws may go unnoticed. Added to this, lack of apppropriate documentation and the complex nature of legacy systems in large organisations only multiply the challenges of information management. One such scenario with data quality issues in date(s) is discussed below. 

Background: An automated data profiling SAS macro was used to print statistics for a custom list of fields. SAS Data marts were created to validate the design of the new DB, as part of UAT, that included all the integrity constraints. Space constraints and speed issues due to the massive size of the database were worked around by choosing to use 2-3 days' worth data load for one iteration of UAT. One integrity constraint defined on Date of Birth field was to be less than that of customer profile creation / customer acquisition date. The GUI application form was not designed to check for violation of integrity constraints.
Problem: During an Enterprise Data Integration project, missing values in two fields with date values went unnoticed. Large difference in number of reject records between UAT and ETL. A large number of records were rejected during the ETL loads and the total number of customer records reduced by upto 10%, a total deviation form the results of all four iterations of UAT.
Reason analysed: Missing values on dates had been set up as "01-01/0001" and "31/12/9999" in two different fields of the source database, at its design time. These were part of a data integration project. While extracting data from the DB using PROC SQL, SAS converted 01-01-0001 into 01-01-2001 while storing it in SAS Date formats. The random sample records chosen for profiling, rejected only a small percentage (1/1000 records) which was within the accepted thresholds. On verification, it showed up to be the record of a minor customer, and passed through the Data Quality module as well. All the records with missing date values passed the UAT, as none of the records were detected by the SAS Data Mart load rule defined to "Reject record if date value in ("01-01/0001","31/12/9999")", as all thre cords had a valid date by now.
Business Impact: Go-live delay by 15 business days due to additional iterations of UAT, ETL for customer data.
Opportunities: UAT Team must try and use the same ETL tool as the Technology Team or use more than one ETL tools along various iterations. Final results may vary due to the DBMS Data mart used. Hence maintaining uniformity while managing for efficiency is a balance to be striked by the IM group. On the other hand, it may be a good option for Data Quality Assessment Software Companies to include the nature of Source, UAT and End User Database as parameters to  model or valiudat Data Quality rules by internal translation or interconversion of rules.
Suggested Methods for Data Profiling and Quality Analysis on Fields with Date values: In such cases, you may have to be even more careful while assigning default values. If you wish to stick to SAS, the best option would be to extract the date with such default values as day, month and year parts, separately. If these dates are to be used to calculate age / time period, then I would advice use of SQL functions to calculate date differences at data selection, than attempting conversions in SAS (or) a front end application.
Solution implemented: Inclusion of Data Quality checks to count the number of records that match the individual date, month and with year values in SAS, was included in the UAT module to determine the expected number of reject records at ETL, in order to avoid repeated data extractions, and to minimise delay in Go-live from 15 days to 3 days.

Friday, August 15, 2008

SAS on Citrix - SPOOL Issues - Microsoft Document Image Writer Damages



One of the most important issues people face while using a remote SAS server is that of PC SAS on Citrix. While I strongly advocate the use of VI & Basic UNIX commands to refrain from creating a dependency on GUI / PC SAS, one cannot ignore the less technical /pure analysts' community, and their plight of wading through the beauty of black.

But we all know that the support offered by SAS to issues on Citrix is not all that great. Am still unsure why organizations don't choose a thin client.
SOLARIS and SUN Blade is an excellent solution for high volume Data Mining & Analytics. This is much more viable than PC SAS on Metaframe. It would have been a different issue if PUTTY is all required to do a batch submit.

A Very Common Issue faced is below -

An accidental print command on PC SAS, opened via citrix, leads to the hazard of multi-user outage, since the default printer settings point to Microsoft Document Image Writer which upsets the spool by writing the content to be printed, on to the worng device.


How to tackle this ?

  • Temporary workaround (End User Responsibility) -
  1. One can set SPOOL=NO in the options of your PROC, to ensure lesser damage while programming. SAS by default has the option set to YES. (OR)
  2. Use the Options on the Menu Bar to change the settings permanently. I will try and add the screen shots on how to do this.
  • The disadvantages of the SPOOL=NO option
    1. Reduced efficiency in a multi-pass DB2/SQL Query.
    2. It really doesn't solve the core issue of accidental clicks / Print commands issued on the PC SAS Main Menu while trying to keep the SAS session active, under terribly slow refresh rates of GUI on CITRIX.

Check this page on setting SPOOL options for SAS ACCESS / PROC SQL if you still want a temporary solution.

  • Server side liability / Admin's piece of work -
    While it is easier to do the temporary fix at the start of every session by executing options mentioned above, it is the admin who would be able to prevent this completely. Hence, here is how to -
  1. set the default printer options on Citrix
  2. I strongly advice on checking this - How to Create a Network Printer for All Users on CITRIX to avoid writing to the Microsoft Document Image Writer.

I dislike windows primarily, for all promos it carries, linking its own irrelevantly useless applications, for, I am just another of the DARK OPEN SOURCE World. It is high time they heard the user community calling for "Don't give me what I did not ask for". But ask them, and they'd point at the System Admin, who did not take necessary precautions while configuring for the Metaframe venture !!