Sunday, October 6, 2013

Integrating data from multiple files in R


Problem Statement: To perform a spatio temporal analysis on rainfall data available in .csv format, with one file each for one month and year, mounting to 300 files for 25 years. It is manually difficult to merge this data, though not impossible. 

SolutionR was used for analysis and hence it only made sense to perform the data integration also in R. However, this can be solved in Excel.

System: R was run on Windows 7

Assumption / Note of Warning: The temporal attribute was not included in the data, but file names had the temporal attribute (i.e., each file was named uniformly with the month year as the suffix). For file names with other formats, there may have to be a work around. If the temporal unit is already provided in the data, the solution may be an easier one. In such cases you can skip or eliminate the steps used to Create a new variable to record the temporal label of the file (may be date or month or year) extracted from the file name.

Method: The Files were read one at a time for the first two files, and they were merged to create the initial data mart. After which, every new file read was merged with this initial data mart and over written subsequently. There may be better ways of doing it, but this made it easy for Data Quality Moinitoring as well. The code used is given below:

#Script for Integrating data from multiple .csv files in R

# Initialise a variable mainintain the file count (set it to zero).
i<-0;

# Provide a variable to mainintain the path to source data files (set it to a string variable). 
file_path<-c("C:\\R\\Data");

# Createa a Variable to store the list of files in the given path 
# I am not sure on the impact on performance and limitation of the size of this list
# Note that this automatically includes subfolders
Files_list<-list.files(file_path);

# Validation - Print the number of files in the folder
cat("\n","Number of files present are",length(Files_list),"\n")
cat("\n","LIST OF FILES IS BELOW :","\n",Files_list)

# Read first file in the list
file<-paste(file_path,Files_list[2],sep="/");
Year_1<-read.csv(file,header=FALSE,na.strings="NULL",stringsAsFactors=FALSE);

# Create a new variable to record the temporal label of the file (may be date or month or year)
# This can be derived or extracted from the file names
Year_1$YEAR<-substr(Files_list[1],4,7);

# Create the Header for the Data Mart in R
names(Year_1)<-c("D1","D2","D3","D4","D5","YEAR") ;

# Read the second  File
file<-paste(file_path,Files_list[2],sep="/")
Year_2<-read.csv(file,header=FALSE,na.strings="NULL",nrows=5,stringsAsFactors=FALSE);
Year_2$YEAR<-substr(Files_list[2],4,7)
names(Year_2)<-c("D1","D2","D3","D4","D5","YEAR") ;

# Merge the two years' files read so far
Init_Mart<-merge(Year_1, Year_2, by=c("D1","D2","D3","D4","D5","YEAR"),all=T)


# Print Header to validate
head(Init_Mart)

# Read all other consecutive years' files
for(i in 3:length(Files_list)){
                file<-paste(file_path,Files_list[i],sep="/")
                Yearly_Data<-read.csv(file,header=FALSE,na.strings="NULL",stringsAsFactors=FALSE)
                Yearly_Data$YEAR<-substr(Files_list[i],4,7)
                names(Yearly_Data)<-c("D1","D2","D3","D4","D5","YEAR")
                Init_Mart<-merge(Init_Mart, Yearly_Data,by=c("D1","D2","D3","D4","D5","YEAR"),all=T)
}

#Test print rows
Init_Mart[421565:421575,]

## Change data types

               ## Convert variable types from character to Numeric wherever applicable
               ## Convert Date fields to month, year and day if required
Init_Mart[c(2,3,4,5,6,7,8)]<-sapply(Init_Mart[c(2,3,4,5,6,7,8)], as.numeric)

## Upload / save it as a CSV File for further processing
write.table(Init_Mart,"C:\\R\\Data\\Data_Mart_Total.csv",sep=",")

No comments: