My MacBookPro's hard drive stooped working last week and I managed to
recover most of the data from a Time Machine back-up 6 months ago. But
I couldn't get the mu4e and mu working. I feed up with googling,
trying, and decide to immigrate to Ubuntu. It would save me from a
lot of frustrations and time in making my Mac and office PC work the same
Ideally, I will built a Ubuntu on Mac which is exactly the same as the
one on my office PC, by just copy over everything 1. As a minimalist, I
decided to build the system from scratch and install software one by
one so that I can have an better understanding of what are the
necessities for me.
In the last few days, I become extra mindful about the what and how I
used the Ubuntu system in the office, and realise the things I need
can be grouped into three categories:
the .ssh folder for the ssh-agent,
the .fonts folder for new fonts,
the .mbsynrc file for sync emails,
Development: like git, gcc, Emacs, and R.
Writing: org-mode, LaTeX,
Email: mu, mu4e, and mbsync.
Personal git repositories
public reposity on GitHub,
private reposities on BitBucket
For 1), since they are small, I can zip up and copy over, or even
better, create a git repository so that sync on two machines becomes better
For 2), I need to find the software's package name in the Ubuntu's
software repository, and then install all of them by a script. The
dependencies should be resolved automatically.
For 3), I need to create a shared folder between the host system and the
Ubuntu system, and then copy over the ~/git/ folder.
It really sounds like a plan! I am going to download the Ubuntu
installation file now and hopefully the transition will be very smooth.
Someday I typed more than 80 thousand times just in Emacs. This is
pretty awesome at first sight but it can cause serious health problem.
Last month, I felt burning pain of my forearms. It is an symptom
of Repetitive strain injury (RSI). I realised that if continue typing
like that, one day I will never able to do programming, like the Emacs
celebrities in Xah Lee' article about RSI.
Since then I've deliberately tried to avoid aimless and unproductive
typing, take more typing breaks, think though things before trying,
write more on paper.
Conditions are getting better: I don't feel server pain any more, only
But I need to find a better way to improve it. Because sometimes I got
the idea, but can't touch the keyboard. This feeling really suck.
So I investigated the Hydra package and use it to group related
commands together so that use only two keys are needed to perform
For example, to search something in current project, instead of typing
M-x helm proj grep, that's 16 keystrokes, I only need F5 G with
Hydra. The implementation is listed in this post.
But calling functions/commands in Emacs counts only a small proportion
of my typing; most of the time, I write code and report.
This is where Yasnippets kicks in, it enable me to type less without
losing quality. For example, I use this snippet quite often when
writing R code,
That's more than 40 keystrokes. Yasnippets can short it to only 6s!
After I type sapply and then hit TAB, it will expand to the region
I will investigate the Yasnippet package next week. If you know any
good tutorials for Yasnippet or snippets for writing R code, please
share your resources.
The search-forward-regexp, replace-match, and match-string
functions work together nicely, and makes my job much easier and enjoyable!
I am writing a release notes for the a software updates. Part of the
process is to associate the SVN Revision number that relates to
important changes, so that others can backtrack and review the code and
see what exactly has been implemented.
In Phabricator, the revision number will be render automatically.
Clicking them takes me to the exact revision, showing the difference
with previous version. But the documentation will be eventually built
by Sphinx and hosted on a remote server. So I have to manually add the
URL to all the SVN revision number. For example, to replace rS1234 to
There are 31 revision number in the whole document. I could do it
manually but for the long term benefits, it would be more efficient
write a function to process it automatically, maybe others can use it
The first thing I noticed is each SVN revision numbers consist of two
letters (rS) and few digits. Because the four digits I don't know
beforehand, I have to use regular expression to do the pattern search.
The tricky bit here is to retrieve the values that matched the
pattern, because of it is needed to construct the URL that points to
the commits, and I also need to replace the it with differnet values.
The procedure can be summarised as:
Find the revision number that match the patterns described above. I
use search-forward-regexp() to search the pattern "rS[0-9]+", which
means a string that starts with rS with one or more digits.a
retrieve the values that matched the pattern. This is done by match-string().
replace the revision number with the constructed URL. This is done by
replace-match(), and I use concat() to combine the IP address with the
The following is a workable implementation:
Note the last two lines of the function can be simplified as
You can easily adopt the code and make it applicable to your case,
just modify the revision-pattern and repo-url variables. But
beware that you should not apply the function to the same buffer more than
once, otherwise you will get something crazy like this:
One way to make it better is to have a test before replacing: if the
revision number is already associated with a URL, then do nothing. If
you have figure out how to do it, please let me know and I've happy to
update this post.
My posts published last year showed my frustration with regular expression in
Emacs. But now I am looking forward doing more text processing with
it, because it will be fun!
The search-forward-regexp, replace-match, and match-string
functions work together nicely and make the my job much easier and
What's your favourite functions in regular expression? Do you have
something to recommend?
The first step in data analysis is to get the data into the modelling
platform. But it may not be as straightforward as it used to be since
nowadays statistician are more likely face the data files that are not
in CSV or others format that can feed directly to the read.table()
function in R, in which cases, we need to understand the data files in
terms of the structure and apply pre-process first. My general
strategy is to discard the unnecessary information in the data files
and hopefully leave a regular data files.
The task is simple: I have about 1,800 .text data files downloaded
from British Oceanographic Data Centre (BODC). They are the historical
tidal data and are separated by year and by port. I need to combine all
the data into one giant table in R, and save it later for modelling.
One sample data file looks like this:
Start Date: 01JAN1985-00.00.00
End Date: 03OCT1985-19.00.00
Contributor: National Oceanography Centre, Liverpool
Datum information: The data refer to Admiralty Chart Datum (ACD)
Parameter code: ASLVZZ01 = Surface elevation (unspecified datum) of the water body
Cycle Date Time ASLVZZ01 Residual
Number yyyy mm dd hh mi ssf f f
1) 1985/01/01 00:00:00 1.0300 -0.3845
2) 1985/01/01 01:00:00 1.0400 -0.3884
3) 1985/01/01 02:00:00 1.2000 -0.3666
The first 9 lines are the metadata, which describes the port ID, name
and location of the port, and other information about the data. The
line 10 and 11 are the headers of the data matrix.
First Attempt - Skip Lines
After the glimpse of the data sample, my first thought was to
skip the first 12 lines and treat the rest as a regular data files
that has space as separator. It can be easily done by using
read.table() with skip = 12 option.
It turned out this approach won't work for some files because
when the way of measuring tidal were changed, the date and port were
highlighted, leaving a second chunk of data matrix but again with
metadata and few other characters. It looks like this:
;; end of first chunk
Difference in instrument
;; other metadata
Second Attempt - Remove Lines
Although the first attempt isn't success, I've learnt a bit about the
structure of the data files. And based on that, I came up with a
second approach: read the data files into R as a vector of string, one
element for a line, and then remove all the lines which are metadata.
They start with Port:, Site: or Longitude: etc or the ###
chunk. It can be done using grep function, which tells me exactly
which element of the vector contains the metadata.
This approach works well as long as the metainfo.list contains all
the lines I'd like to remove. The downside is that I won't able to
know I've includes all of them until the whole process is finished. So
when I was waiting for the program to finish, I came up with a third
approach, a better one.
Third Attempt - Capture Lines (RegExp)
The above two approaches are to discard the unnecessary information,
but I may be in the situation that there are other lines that should
be discard but I haven't encounter yet, then the process becomes
tedious try-error and takes quite long.
Equally, another approach is to select exactly what I am interested in
by using regular expression. But first, I have to identify pattern.
Each data point was recorded at a certain point, and therefore must be
associated with a timestamp, for example, the first data point is
recorded at 1926-01-01 00:00:00. They also has an ID values with an
closing parentage's, for example 1.
1) 1985/01/01 00:00:00 1.0300 -0.3845
So the content of my interests are have a common pattern that can
be summarised as: the lines that start with a number of spaces, and
few integers, and an ending parentheses,
few integers with forward slashes that
means year, month and day, and then a space,
few integers with colons, means hour, minutes and
The patterns in RegExp can be formulated as the roi.pattern variable
and the whole process can be implemented as:
To me, there isn't an absolute winner between the second and third
approach, but I prefer to use regular expression because it has more
fun with it; I am a statistician and like to spot patterns.
Also, it is an direct approach and more flexible. Note I can continue
to add components to the regular expression to increase the confidence
in selecting the right data matrix. For example, there are spaces and
then few integers at the timestamp. But it will presumably increase
Code and Sample Data
You can download the exmaple data and run the scripts listed below
in R to reproduce all the results.
Like many R users who are not actually programmer, I am afraid of
regular expression (RegExp), whenever I saw something like
I'd told myself I won't be able to understand it and gave up on the
But I've collected few RegExp patterns that do magical
jobs. My favourites are the dot (.) and dollar ($) sign and I usually
use them with list.files() to filter the file names in a directory. For
The first line returns all the R image files, which have file names
ending with RData, and for the second all the text files which have
file names ended with text. Basically in regular expression, dot
sign (.) means anything, and dollar sign ($) means the end of a
string. By combining these two, I am able to select multiple files
with certain patterns, without manually picking one by
How powerful is that! It is an inspirational example that motivates
myself from time to time to look deeper and get my head on the topic
of regular expression. But I just couldn't have a clear picture of how to
I think the main problems for me to understand RegExp in R are
The syntax is content-sensitive
A subtle change can lead to random results. For example, the above
pattern can also be \\.RData$, which means file names ended with
.RData. The dot (.) sign here literally means ".". Adding two
backslashes \\ changes the meaning of the pattern completely, but
both gives the same results. It gave me so much frustration when
extrapolating a pattern that works in one case to a similar case
but get random results.
The syntax is hard to read
The RegExp pattern above are reasonably easy to understand, if one
spent 10 minutes reading the manual, but the following is just crazy.
There are 12 parentheses, 6 square brackets and many other symbols.
Even same symbol have different meanings, and it's hard to find out
exactly what they means because
There isn't enough learning materials
I've never seen an R book that mentioned regular expression. This topic
is certainly not a teaching content in university courses or training
Even google fails to find any meaningful resource except for the Text
Processing in Wiki, which is the best I could find.
Although there are related questions in StackOverflow, most of the
answers were set in a very specific situation. It's hard make it
applicable to other situations or learn this topic from the discrete Q&As.
It has created a mental barrier that statistician shouldn't teach nor
learn RegExp at all, or at least for me. But my limited experience
suggests that it is such a powerful feature that I've missed a lot.
I believe there will be more chances to process text files, for example,
parse the log files of this blog. RegExp can improve the efficiency to
a great extent. So I am considering to invest the time to learn it properly.
Are you a R user? What's your experience with regular expression? Do
you have good learning materials to recommend? If so, please share
your experience on the less-talked area.