Yi Tang Data Scientist with Emacs

Why I Should Explore Regular Expression and Why I Haven't

Like many R users who are not actually programmer, I am afraid of regular expression (RegExp), whenever I saw something like

grep(pattern = "([[:digit:]]{2}) ([[:alpha:]]+) ([[:digit:]]{4})", s)

I'd told myself I won't be able to understand it and gave up on the sight.

But I've collected few RegExp patterns that do magical jobs. My favourites are the dot (.) and dollar ($) sign and I usually use them with list.files() to filter the file names in a directory. For example,

list.files(pattern = ".RData$")
list.files(pattern = ".text$")

The first line returns all the R image files, which have file names ending with RData, and for the second all the text files which have file names ended with text. Basically in regular expression, dot sign (.) means anything, and dollar sign ($) means the end of a string. By combining these two, I am able to select multiple files with certain patterns, without manually picking one by one.

How powerful is that! It is an inspirational example that motivates myself from time to time to look deeper and get my head on the topic of regular expression. But I just couldn't have a clear picture of how to us it.

I think the main problems for me to understand RegExp in R are

The syntax is content-sensitive

A subtle change can lead to random results. For example, the above pattern can also be \\.RData$, which means file names ended with .RData. The dot (.) sign here literally means ".". Adding two backslashes \\ changes the meaning of the pattern completely, but both gives the same results. It gave me so much frustration when extrapolating a pattern that works in one case to a similar case but get random results.

The syntax is hard to read

The RegExp pattern above are reasonably easy to understand, if one spent 10 minutes reading the manual, but the following is just crazy.

m <- regexec(pattern = "^(([^:]+)://)?([^:/]+)(:([0-9]+))?(/.*)", x)

There are 12 parentheses, 6 square brackets and many other symbols. Even same symbol have different meanings, and it's hard to find out exactly what they means because

There isn't enough learning materials

I've never seen an R book that mentioned regular expression. This topic is certainly not a teaching content in university courses or training workshops.

Even google fails to find any meaningful resource except for the Text Processing in Wiki, which is the best I could find.

Although there are related questions in StackOverflow, most of the answers were set in a very specific situation. It's hard make it applicable to other situations or learn this topic from the discrete Q&As.

It has created a mental barrier that statistician shouldn't teach nor learn RegExp at all, or at least for me. But my limited experience suggests that it is such a powerful feature that I've missed a lot.

But

I believe there will be more chances to process text files, for example, parse the log files of this blog. RegExp can improve the efficiency to a great extent. So I am considering to invest the time to learn it properly.

Are you a R user? What's your experience with regular expression? Do you have good learning materials to recommend? If so, please share your experience on the less-talked area.

If you have any questions or comments, please post them below. If you liked this post, you can share it with your followers or follow me on Twitter!
comments powered by Disqus