Yi Tang Data Scientist with Emacs

Import Irregular Data Files Into R With Regular Expression - an BODC Example

The first step in data analysis is to get the data into the modelling platform. But it may not be as straightforward as it used to be since nowadays statistician are more likely face the data files that are not in CSV or others format that can feed directly to the read.table() function in R, in which cases, we need to understand the data files in terms of the structure and apply pre-process first. My general strategy is to discard the unnecessary information in the data files and hopefully leave a regular data files.

In my last week's post, Why I Should Explore Regular Expression and Why I Haven't, I expressed my interests in Regular Expression and lucky I got a chance to use it for getting the data into R. It provides me a different strategy: pick only what I am interested in.

The Irregular Data Files

The task is simple: I have about 1,800 .text data files downloaded from British Oceanographic Data Centre (BODC). They are the historical tidal data and are separated by year and by port. I need to combine all the data into one giant table in R, and save it later for modelling.

One sample data file looks like this:

Port:              P035
Site:              Wick
Latitude:          58.44097
Longitude:         -3.08631
Start Date:        01JAN1985-00.00.00
End Date:          03OCT1985-19.00.00
Contributor:       National Oceanography Centre, Liverpool
Datum information: The data refer to Admiralty Chart Datum (ACD)
Parameter code:    ASLVZZ01 = Surface elevation (unspecified datum) of the water body                      
  Cycle    Date      Time      ASLVZZ01     Residual  
 Number yyyy mm dd hh mi ssf           f            f 
     1) 1985/01/01 00:00:00      1.0300      -0.3845  
     2) 1985/01/01 01:00:00      1.0400      -0.3884  
     3) 1985/01/01 02:00:00      1.2000      -0.3666

The first 9 lines are the metadata, which describes the port ID, name and location of the port, and other information about the data. The line 10 and 11 are the headers of the data matrix.

First Attempt - Skip Lines

After the glimpse of the data sample, my first thought was to skip the first 12 lines and treat the rest as a regular data files that has space as separator. It can be easily done by using read.table() with skip = 12 option.

read.table(data.file, skip = 12) ## error

It turned out this approach won't work for some files because when the way of measuring tidal were changed, the date and port were highlighted, leaving a second chunk of data matrix but again with metadata and few other characters. It looks like this:

;; end of first chunk 

########################################
 Difference in instrument
########################################

Port: P035
;; other metadata 

Second Attempt - Remove Lines

Although the first attempt isn't success, I've learnt a bit about the structure of the data files. And based on that, I came up with a second approach: read the data files into R as a vector of string, one element for a line, and then remove all the lines which are metadata. They start with Port:, Site: or Longitude: etc or the ### chunk. It can be done using grep function, which tells me exactly which element of the vector contains the metadata.

s <- readLines(data.file)
metainfo.list <- c("Port:", "Site:", "Latitude:", "Longitude:", "Start Date:", "End Date:", "Contributor:", "Datum information:", "Parameter code:")
meta.line.num <- sapply(metainfo.list, function(i) {
    grep(pattern = i, s)
})
res.2 <- s[-meta.line.num]

This approach works well as long as the metainfo.list contains all the lines I'd like to remove. The downside is that I won't able to know I've includes all of them until the whole process is finished. So when I was waiting for the program to finish, I came up with a third approach, a better one.

Third Attempt - Capture Lines (RegExp)

The above two approaches are to discard the unnecessary information, but I may be in the situation that there are other lines that should be discard but I haven't encounter yet, then the process becomes tedious try-error and takes quite long.

Equally, another approach is to select exactly what I am interested in by using regular expression. But first, I have to identify pattern. Each data point was recorded at a certain point, and therefore must be associated with a timestamp, for example, the first data point is recorded at 1926-01-01 00:00:00. They also has an ID values with an closing parentage's, for example 1.

     1) 1985/01/01 00:00:00      1.0300      -0.3845  

So the content of my interests are have a common pattern that can be summarised as: the lines that start with a number of spaces, and also have

observation ID
few integers, and an ending parentheses,
observation date
few integers with forward slashes that means year, month and day, and then a space,
observation time
few integers with colons, means hour, minutes and seconds.

The patterns in RegExp can be formulated as the roi.pattern variable and the whole process can be implemented as:

roi.pattern <- "[[:space:]]+[[:digit:]]+\\) [[:digit:]]{4}/[[:digit:]]{2}/[[:digit:]]{2}"
roi.line.num <- grep(pattern = roi.pattern, s)
res.3 <- s[roi.line.num]

To me, there isn't an absolute winner between the second and third approach, but I prefer to use regular expression because it has more fun with it; I am a statistician and like to spot patterns.

Also, it is an direct approach and more flexible. Note I can continue to add components to the regular expression to increase the confidence in selecting the right data matrix. For example, there are spaces and then few integers at the timestamp. But it will presumably increase the run-time.

Code and Sample Data

You can download the exmaple data and run the scripts listed below in R to reproduce all the results.

#### * Path
data.file <- "~/Downloads/1985WIC.txt" ## to the downloaded data file

#### * Approach 1
read.table(data.file, skip = 11) ## error

#### * Approach 2
s <- readLines(data.file)
metainfo.list <- c("Port:", "Site:", "Latitude:", "Longitude:", "Start Date:", "End Date:", "Contributor:", "Datum information:", "Parameter code:")
meta.line.num <- sapply(metainfo.list, function(i) {
    grep(pattern = i, s)
})
res.2 <- s[-meta.line.num]

#### * Approach 3
roi.pattern <- "[[:space:]]+[[:digit:]]+\\) [[:digit:]]{4}/[[:digit:]]{2}/[[:digit:]]{2}"
roi.line.num <- grep(pattern = roi.pattern, s)
res.3 <- s[roi.line.num]
If you have any questions or comments, please post them below. If you liked this post, you can share it with your followers or follow me on Twitter!
comments powered by Disqus