Yi Tang Data Scientist with Emacs

Migrate to Ubuntu

My MacBookPro's hard drive stooped working last week and I managed to recover most of the data from a Time Machine back-up 6 months ago. But I couldn't get the mu4e and mu working. I feed up with googling, trying, and decide to immigrate to Ubuntu. It would save me from a lot of frustrations and time in making my Mac and office PC work the same way.

Ideally, I will built a Ubuntu on Mac which is exactly the same as the one on my office PC, by just copy over everything 1. As a minimalist, I decided to build the system from scratch and install software one by one so that I can have an better understanding of what are the necessities for me.

In the last few days, I become extra mindful about the what and how I used the Ubuntu system in the office, and realise the things I need can be grouped into three categories:

  1. Configuration,
    1. the .ssh folder for the ssh-agent,
    2. the .fonts folder for new fonts,
    3. the .mbsynrc file for sync emails,
    4. the .ledgerrc.
  2. Software for
    1. Development: like git, gcc, Emacs, and R.
    2. Writing: org-mode, LaTeX,
    3. Email: mu, mu4e, and mbsync.
    4. Finance: ledger.
  3. Personal git repositories
    1. public reposity on GitHub,
    2. private reposities on BitBucket

For 1), since they are small, I can zip up and copy over, or even better, create a git repository so that sync on two machines becomes better easier.

For 2), I need to find the software's package name in the Ubuntu's software repository, and then install all of them by a script. The dependencies should be resolved automatically.

For 3), I need to create a shared folder between the host system and the Ubuntu system, and then copy over the ~/git/ folder.

It really sounds like a plan! I am going to download the Ubuntu installation file now and hopefully the transition will be very smooth.

My Expeirence with Repetitive Strain Injury (RSI)

Someday I typed more than 80 thousand times just in Emacs. This is pretty awesome at first sight but it can cause serious health problem.

Last month, I felt burning pain of my forearms. It is an symptom of Repetitive strain injury (RSI). I realised that if continue typing like that, one day I will never able to do programming, like the Emacs celebrities in Xah Lee' article about RSI.

Since then I've deliberately tried to avoid aimless and unproductive typing, take more typing breaks, think though things before trying, write more on paper.

Conditions are getting better: I don't feel server pain any more, only sometimes uncomfortable.

But I need to find a better way to improve it. Because sometimes I got the idea, but can't touch the keyboard. This feeling really suck.

So I investigated the Hydra package and use it to group related commands together so that use only two keys are needed to perform frequent tasks.

For example, to search something in current project, instead of typing M-x helm proj grep, that's 16 keystrokes, I only need F5 G with Hydra. The implementation is listed in this post.

But calling functions/commands in Emacs counts only a small proportion of my typing; most of the time, I write code and report.

This is where Yasnippets kicks in, it enable me to type less without losing quality. For example, I use this snippet quite often when writing R code,

res <- sapply(seq_len(n), function(i) {
    ## 
})

That's more than 40 keystrokes. Yasnippets can short it to only 6s! After I type sapply and then hit TAB, it will expand to the region above.

I will investigate the Yasnippet package next week. If you know any good tutorials for Yasnippet or snippets for writing R code, please share your resources.

Start Enjoying Regular Expression In Emacs

The search-forward-regexp, replace-match, and match-string functions work together nicely, and makes my job much easier and enjoyable!

I am writing a release notes for the a software updates. Part of the process is to associate the SVN Revision number that relates to important changes, so that others can backtrack and review the code and see what exactly has been implemented.

In Phabricator, the revision number will be render automatically. Clicking them takes me to the exact revision, showing the difference with previous version. But the documentation will be eventually built by Sphinx and hosted on a remote server. So I have to manually add the URL to all the SVN revision number. For example, to replace rS1234 to

[[http://phabricator.domain.co.uk/rS1234][rS1234]]

There are 31 revision number in the whole document. I could do it manually but for the long term benefits, it would be more efficient write a function to process it automatically, maybe others can use it as well.

Implementation

The first thing I noticed is each SVN revision numbers consist of two letters (rS) and few digits. Because the four digits I don't know beforehand, I have to use regular expression to do the pattern search.

The tricky bit here is to retrieve the values that matched the pattern, because of it is needed to construct the URL that points to the commits, and I also need to replace the it with differnet values.

The procedure can be summarised as:

  1. Find the revision number that match the patterns described above. I use search-forward-regexp() to search the pattern "rS[0-9]+", which means a string that starts with rS with one or more digits.a
  2. retrieve the values that matched the pattern. This is done by match-string().
  3. replace the revision number with the constructed URL. This is done by replace-match(), and I use concat() to combine the IP address with the revision number.

The following is a workable implementation:

(defvar revision-pattern "rS[0-9]+"
  "The RegExp pattern of the SVN revision number")

(defvar repo-url "http://10.0.0.11/"
  "The IP address of the SVN repository")

(defun yt/add-link-to-SVN-revision-number ()
  "add links to svn commits identifier"
  (interactive)
  (while (search-forward-regexp revision-pattern)
    (let* ((commit (match-string 0))
           (link (concat repo-url commit)))
      (replace-match "")
      (org-insert-link nil link commit))))

Note the last two lines of the function can be simplified as

(replace-match (concat "[[" link
                       "][" commit "]]"))                       

You can easily adopt the code and make it applicable to your case, just modify the revision-pattern and repo-url variables. But beware that you should not apply the function to the same buffer more than once, otherwise you will get something crazy like this:

[[http://10.0.0.11/[[http://10.0.0.11/rS1234][rS1234]]][[[http://10.0.0.11/rS1234][rS1234]]]]

One way to make it better is to have a test before replacing: if the revision number is already associated with a URL, then do nothing. If you have figure out how to do it, please let me know and I've happy to update this post.

My posts published last year showed my frustration with regular expression in Emacs. But now I am looking forward doing more text processing with it, because it will be fun!

The search-forward-regexp, replace-match, and match-string functions work together nicely and make the my job much easier and enjoyable!

What's your favourite functions in regular expression? Do you have something to recommend?

Import Irregular Data Files Into R With Regular Expression - an BODC Example

The first step in data analysis is to get the data into the modelling platform. But it may not be as straightforward as it used to be since nowadays statistician are more likely face the data files that are not in CSV or others format that can feed directly to the read.table() function in R, in which cases, we need to understand the data files in terms of the structure and apply pre-process first. My general strategy is to discard the unnecessary information in the data files and hopefully leave a regular data files.

In my last week's post, Why I Should Explore Regular Expression and Why I Haven't, I expressed my interests in Regular Expression and lucky I got a chance to use it for getting the data into R. It provides me a different strategy: pick only what I am interested in.

The Irregular Data Files

The task is simple: I have about 1,800 .text data files downloaded from British Oceanographic Data Centre (BODC). They are the historical tidal data and are separated by year and by port. I need to combine all the data into one giant table in R, and save it later for modelling.

One sample data file looks like this:

Port:              P035
Site:              Wick
Latitude:          58.44097
Longitude:         -3.08631
Start Date:        01JAN1985-00.00.00
End Date:          03OCT1985-19.00.00
Contributor:       National Oceanography Centre, Liverpool
Datum information: The data refer to Admiralty Chart Datum (ACD)
Parameter code:    ASLVZZ01 = Surface elevation (unspecified datum) of the water body                      
  Cycle    Date      Time      ASLVZZ01     Residual  
 Number yyyy mm dd hh mi ssf           f            f 
     1) 1985/01/01 00:00:00      1.0300      -0.3845  
     2) 1985/01/01 01:00:00      1.0400      -0.3884  
     3) 1985/01/01 02:00:00      1.2000      -0.3666

The first 9 lines are the metadata, which describes the port ID, name and location of the port, and other information about the data. The line 10 and 11 are the headers of the data matrix.

First Attempt - Skip Lines

After the glimpse of the data sample, my first thought was to skip the first 12 lines and treat the rest as a regular data files that has space as separator. It can be easily done by using read.table() with skip = 12 option.

read.table(data.file, skip = 12) ## error

It turned out this approach won't work for some files because when the way of measuring tidal were changed, the date and port were highlighted, leaving a second chunk of data matrix but again with metadata and few other characters. It looks like this:

;; end of first chunk 

########################################
 Difference in instrument
########################################

Port: P035
;; other metadata 

Second Attempt - Remove Lines

Although the first attempt isn't success, I've learnt a bit about the structure of the data files. And based on that, I came up with a second approach: read the data files into R as a vector of string, one element for a line, and then remove all the lines which are metadata. They start with Port:, Site: or Longitude: etc or the ### chunk. It can be done using grep function, which tells me exactly which element of the vector contains the metadata.

s <- readLines(data.file)
metainfo.list <- c("Port:", "Site:", "Latitude:", "Longitude:", "Start Date:", "End Date:", "Contributor:", "Datum information:", "Parameter code:")
meta.line.num <- sapply(metainfo.list, function(i) {
    grep(pattern = i, s)
})
res.2 <- s[-meta.line.num]

This approach works well as long as the metainfo.list contains all the lines I'd like to remove. The downside is that I won't able to know I've includes all of them until the whole process is finished. So when I was waiting for the program to finish, I came up with a third approach, a better one.

Third Attempt - Capture Lines (RegExp)

The above two approaches are to discard the unnecessary information, but I may be in the situation that there are other lines that should be discard but I haven't encounter yet, then the process becomes tedious try-error and takes quite long.

Equally, another approach is to select exactly what I am interested in by using regular expression. But first, I have to identify pattern. Each data point was recorded at a certain point, and therefore must be associated with a timestamp, for example, the first data point is recorded at 1926-01-01 00:00:00. They also has an ID values with an closing parentage's, for example 1.

     1) 1985/01/01 00:00:00      1.0300      -0.3845  

So the content of my interests are have a common pattern that can be summarised as: the lines that start with a number of spaces, and also have

observation ID
few integers, and an ending parentheses,
observation date
few integers with forward slashes that means year, month and day, and then a space,
observation time
few integers with colons, means hour, minutes and seconds.

The patterns in RegExp can be formulated as the roi.pattern variable and the whole process can be implemented as:

roi.pattern <- "[[:space:]]+[[:digit:]]+\\) [[:digit:]]{4}/[[:digit:]]{2}/[[:digit:]]{2}"
roi.line.num <- grep(pattern = roi.pattern, s)
res.3 <- s[roi.line.num]

To me, there isn't an absolute winner between the second and third approach, but I prefer to use regular expression because it has more fun with it; I am a statistician and like to spot patterns.

Also, it is an direct approach and more flexible. Note I can continue to add components to the regular expression to increase the confidence in selecting the right data matrix. For example, there are spaces and then few integers at the timestamp. But it will presumably increase the run-time.

Code and Sample Data

You can download the exmaple data and run the scripts listed below in R to reproduce all the results.

#### * Path
data.file <- "~/Downloads/1985WIC.txt" ## to the downloaded data file

#### * Approach 1
read.table(data.file, skip = 11) ## error

#### * Approach 2
s <- readLines(data.file)
metainfo.list <- c("Port:", "Site:", "Latitude:", "Longitude:", "Start Date:", "End Date:", "Contributor:", "Datum information:", "Parameter code:")
meta.line.num <- sapply(metainfo.list, function(i) {
    grep(pattern = i, s)
})
res.2 <- s[-meta.line.num]

#### * Approach 3
roi.pattern <- "[[:space:]]+[[:digit:]]+\\) [[:digit:]]{4}/[[:digit:]]{2}/[[:digit:]]{2}"
roi.line.num <- grep(pattern = roi.pattern, s)
res.3 <- s[roi.line.num]

Why I Should Explore Regular Expression and Why I Haven't

Like many R users who are not actually programmer, I am afraid of regular expression (RegExp), whenever I saw something like

I'd told myself I won't be able to understand it and gave up on the sight.

But I've collected few RegExp patterns that do magical jobs. My favourites are the dot (.) and dollar ($) sign and I usually use them with list.files() to filter the file names in a directory. For example,

list.files(pattern = ".RData$")
list.files(pattern = ".text$")

The first line returns all the R image files, which have file names ending with RData, and for the second all the text files which have file names ended with text. Basically in regular expression, dot sign (.) means anything, and dollar sign ($) means the end of a string. By combining these two, I am able to select multiple files with certain patterns, without manually picking one by one.

How powerful is that! It is an inspirational example that motivates myself from time to time to look deeper and get my head on the topic of regular expression. But I just couldn't have a clear picture of how to us it.

I think the main problems for me to understand RegExp in R are

The syntax is content-sensitive

A subtle change can lead to random results. For example, the above pattern can also be \\.RData$, which means file names ended with .RData. The dot (.) sign here literally means ".". Adding two backslashes \\ changes the meaning of the pattern completely, but both gives the same results. It gave me so much frustration when extrapolating a pattern that works in one case to a similar case but get random results.

The syntax is hard to read

The RegExp pattern above are reasonably easy to understand, if one spent 10 minutes reading the manual, but the following is just crazy.

m <- regexec(pattern = "^(([^:]+)://)?([^:/]+)(:([0-9]+))?(/.*)", x)

There are 12 parentheses, 6 square brackets and many other symbols. Even same symbol have different meanings, and it's hard to find out exactly what they means because

There isn't enough learning materials

I've never seen an R book that mentioned regular expression. This topic is certainly not a teaching content in university courses or training workshops.

Even google fails to find any meaningful resource except for the Text Processing in Wiki, which is the best I could find.

Although there are related questions in StackOverflow, most of the answers were set in a very specific situation. It's hard make it applicable to other situations or learn this topic from the discrete Q&As.

It has created a mental barrier that statistician shouldn't teach nor learn RegExp at all, or at least for me. But my limited experience suggests that it is such a powerful feature that I've missed a lot.

But

I believe there will be more chances to process text files, for example, parse the log files of this blog. RegExp can improve the efficiency to a great extent. So I am considering to invest the time to learn it properly.

Are you a R user? What's your experience with regular expression? Do you have good learning materials to recommend? If so, please share your experience on the less-talked area.

If you have any questions or comments, please post them below. If you liked this post, you can share it with your followers or follow me on Twitter!