Yi Tang Data Scientist with Emacs

Control the Plotting Order in ggplot2

nil

The above two plots show the same data (included below), and if you are going to present one to summarise your findings, which will you choose? It is very likely you are going to pick the right one, because

  1. the linear increasing feature of bars is pleasant to see,
  2. it is easier to compare the categories, the ones on the right has higher value than the ones on the left, and
  3. categories with lowest and highest value are clearly shown,

In this article I am trying to explain how to specify the plotting orders in ggplot to whatever you want and encourage R starters to use ggplot2.

To create a bar plot is dead easy in R, take this dataset as an example,

mode count
ssh-mode 2361
fundamental-mode 4626
git-commit-mode 4869
mu4e-compose-mode 4964
emacs-lisp-mode 6205
shell-mode 10046
minibuffer-inactive-mode 12624
inferior-ess-mode 25774
ess-mode 47115
org-mode 78195

to get the plot on the right side, reorder the table by count (it is already been done), then

with(df, barplot(count, names.arg = mode))

will do the job. That's simple and easy, it does what you provide. This is completely different to ggplot() paradigm, which does a lot computation behind the scene.

ggplot(df, aes(mode, count)) + geom_bar()

will give you the first plot; the categories are in alphabetically order. In order to get a pleasant increasing order that depends on the count or any other variable, or even manually specified order, you have to explicitly change the level of factors.

df$mode.ordered <- factor(df$mode, levels = df$mode)

create another variable mode.oredered which looks the same as mode, except for the underlying levels are in different. It is set to the order of counts. Run the same ggplot code again will give you the plot on the right. How does it work?

First, every factor in R is mapped into an integer, and the default mapping algorithm is

  1. sort the factor vector alphabetically,
  2. map the first factor to 1, and last to 10.

So emacs-lisp-mode is mapped to 1 and ssh-mode is mapped to 10.

What the reorder script can do is to sort the factors by count, so that ssh-mode is mapped to 1 and org-mode is mapped to 10, I.e. the factor order which are set to the order of count.

How does this affects ggplot? I presume ggplot do the plotting on the order of levels, or let's say on the integer space, I.e. do the plotting from 1 to 10, and then add the labels for each.

In this example, the default barplot function did the job. Usually we need to do extra data manipulation so that ggplot will do what we want, in exchange for the plot good better and may fits in the other plots. Without considering the time constraints, I would encourage people to stick with ggplot because like many other things in life, once you understand, it becomes easier to do. For example, it is actually very easy to specify the order manually with only two steps:

  • first, sort the whole data.frame to a variable,
  • then change the levels options in factor() to what ever you want.

To show a decreasing trends - the reverse order of increasing, just use levels = rev(mode). How neat!

RExercise - Analyst Your Exercise Data in R

RExercise is a by-product of the ActivityDashboard. It parses your exercise data in .GPX format and for each workout, it returns

location table
a data.frame with longitude, latitude, elevation at a particular recording time,
summary table
a one-row data.frame of summary statistics about the workout, includes duration, distance, speed etc.

It comes with a helper function Parse_GPX_all to do the batch process and combine all data.frame together, also add city and country to the summary tables. Then you can see all the activities summary in one table, and use it to query both location and summary table, for example, how many miles did you run last year? How many cities had you run? It meant to make you feel great by showing you have achieved a lot.

Currently it parsing data from RunKeeper and Strava perfectly. .GPX format is generic data format so applying RExercise to data from other apps shuodn't be a problem. If you do, please feel free to contact me, I am extermely friendly to people who do exercise (:d), or sent me a pull request if you already figure out.

Demo

Suppose you have those .GPX data files,

20150108-170830-Run.gpx 
20150109-171835-Run.gpx 
20150111-113750-Run.gpx 
20150112-171906-Walk.gpx

RExercise will gives you a location table and summary table as follows:

Table 1: A Summary Table
id activity date start.time name duration (h) distance (km) speed (km/h) elevation (m) climb (m)
20150108-170830 Run 2015-01-08 17:08:14 Afternoon 0.13 0.74 5.4 109.0 11.1
20150109-171835 Run 2015-01-09 17:18:14 after work 0.42 3.33 7.9 110.5 60.1
20150111-113750 Run 2015-01-11 11:37:14 Sunday 0.50 4.25 8.4 130.6 136.6
20150112-171906 Run 2015-01-12 17:19:14 after work 0.51 4.08 7.9 110.4 88.6
Table 2: A Location Table
lon lat ele time
-2.019050 53.961909 108.4 2015-01-11 11:37:50
-2.017989 53.961375 109.8 2015-01-11 11:38:27
-2.018019 53.961427 109.8 2015-01-11 11:38:29
-2.018004 53.961536 109.8 2015-01-11 11:38:30
-2.018189 53.962276 110.4 2015-01-11 11:38:33
-2.018141 53.962277 110.4 2015-01-11 11:38:34
-2.018090 53.962276 110.4 2015-01-11 11:38:35

Usage

1. Install

devtools::install_github("yitang/rexercise")

2. Download GPX data

3. Set working directory and app

all.data <- Parse_GPX_all(data.dir = "~/ExerciseData/Strave/",
                         app = "Strava",
                         add.city = TRUE)

You should have two tables as shown in Demo section.

Group Emacs Search Functions using Hydra

I am a search-guy: when I want to know something, I use the search functionality to locate to where has the keyword, and I didn't use my eyes to scan the page, it's too slow and harmful.

Emacs provides powerful functionality to do searching. For example, I use these commands very often (with the key-binds),

  1. isearch (C-s), search for a string and move the cursor to there,
  2. helm-swoop (C-F1), find all the occurrences of a string, pull out the lines containing the string to another buffer where I can edit and save,
  3. helm-multi-swoop M-X, apply helm-swoop to multiple buffers, very handy if I want to know where a function is called in different buffers.
  4. projectile-grep or helm-projectile-grep C p s g, find which files in current project contains a specific string, similar to helm-multi-swoop limits the search to files in project directory.

I love doing searching in Emacs, but the problem is to have to remember all the key-binds for different tasks. Also, sometimes, I forgot about what alternatives I have and usually go with the one that I most familiar with, which usually means not the right one. I sometimes realise I use isearch multiple times to do what ace-jump-word-mode can achieve by just once.

Org-mode Hydras incoming! gives me some idea to group all these functions together, and press a single key to perform different tasks, so this can free my mind from remembering all the key-binds. Also, I can write the few lines of text to reminds myself when to do what, and this potentially can solve problem two.

Here is the hydra implementation for searching:

(defhydra hydra-search (:color blue
                               :hint nil)
  "
Current Buffer : _i_search helm-_s_woop _a_ce-jump-word
Multiple Buffers : helm-multi-_S_woop
Project Directory: projectile-_g_rep helm-projectile-_G_rep
"
  ("i" isearch-forward)
  ("s" helm-swoop)
  ("a" ace-jump-word-mode)
  ("S" helm-multi-swoop)
  ("g" projectile-grep)
  ("G" helm-projectile-grep))
(global-set-key [f4] 'hydra-search/body)

So next time, when I want to search something, I just press F4, and then it brings up all the choices I have, and I don't need to worry about the key-binds or which to use! That's cool!

I am looking forward simplifying my Emacs workflow using hydra package, the key challenge is to identify the logical similarities among the tasks and then group them together accordingly. For hydra-search(), it is "search something on somewhere".

A Workflow for Using Git to Track SVN Repository

Version control system is a complex issues, and hard to understand the idea of branching and different types of merging. I merely understand the basic of Git, and it already makes my life a lot easier, I am managing about 10 repositories at this moment without much effort.

But my collages are using SVN as the centre storage for scripts. Switching to SVN is not a problem, I just need few weeks to transfer the knowledge and start to use it. I am reluctant to learn something basic and have duplicated knowledge, also, I use GitHub and Bitbucket which are Git based. But sticking to Git make mine work impossible to work with collauges.

Then I found out the Git developer has already made effort to bridge Git and other version control system, like SVN. The git svn allows me to just Git commands for staging, cherry-picking, pull etc, and then upload to the SVN remote repository with just one command line. I really like the idea of transferring the skills from one system to another without any cost, it makes me believe Git is great and I can continue to use Magit in Emacs!

Here is the basic steps and comments for this work flow:

  1. Create a folder mkdir ProjRepo
  2. Create an empty Git repository git init
  3. Add the following to .git/config

    [svn-remote "svn"]
    url = https://your.svn.repo
    fetch = :refs/remotes/git-svn
    

    and change the URL to right repository,

  4. pull from SVN central repository to this folder, git svn fetch svn
  5. switch to SVN remote branch, git checkout -b svn git-svn
  6. modify or add files
  7. use git add and git commit for snapshot local changes
  8. sometimes need to update local repository, git svn rebase
  9. finally upload local changes to SVN central repository git svn dcommit

See the official manual 8.1 Git and Other Systems - Git and Subversion git-svn documentation for more details.

Why Use Emacs 1 - Emacs Speaks Statistics

I am a Statistician, coding in R and write report is what I do most of the day. I have been though a long way of searching the perfect editor for me, tried Rstudio, SublimeText, TextMate and settled down happily with ESS/Emacs, for both coding and writing.

There three features that have me made the decision:

Auto Formatting

Scientists has reputation of being bad programmers, who wrote code that is unreadable and therefore incomprehensible to others. I have intention to become top level programmer and followed a style guide strictly. It means I have to spent sometime in adding and removing space in the code.

To my surprise, Emacs will do it for me automatically, just by hitting the TAB and it also indents smartly, which make me conformable to write long function call and split it into multiple lines. Here's an example. Also, if I miss placed a ')' or ']' the formatting will become strange and it reminders me to check.

rainfall.subset london,
rainfall.pairs,
rainfall.dublin)

Search Command History

I frequently search the command history. Imaging I was produce a plot and I realised there was something miss in the data, so I go back and fix the data first, then run the ggplot command again, I press Up/Down bottom many times, or just search once/two times. M-x ggplot( will give me the most recent command I typed containing the keyword ggplot(, then I press RET to select the command, which might be ggplot(gg.df, aes(lon, lat, col = city)) + geom_line() + ...... If it is not I want, I press C-r again to choose the second most recent one and repeat until I find right one.

Literate Programming

I am a supporter of literate statistical analysis and believe we should put code, results and discoveries together in developing models. Rstudio provides an easy to use tool for this purpose, but it does not support different R sessions, so if I need to generate a report, I have to re-run all the code from beginning, which isn't particle for me with volumes data because it will take quit long.

ESS and org-mode works really well via Babel, which is more friendly to use. I can choose to run only part of the code and have the output being inserted automatically, no need to copy/paste. Also, I can choose where to execute the code, on my local machine or the remote server, or both at the same time.

These are only the surface of ESS and there are lot more useful features like spell checking for comments and documentation templates, that makes me productive and I would recommend anyone uses R to learn ESS/Emacs. The following is my current setting.

;; Adapted with one minor change from Felipe Salazar at
;; http://www.emacswiki.org/emacs/EmacsSpeaksStatistics
(require 'ess-site)
(setq ess-ask-for-ess-directory nil) ;; start R on default folder
(setq ess-local-process-name "R")
(setq ansi-color-for-comint-mode 'filter) ;;
(setq comint-scroll-to-bottom-on-input t)
(setq comint-scroll-to-bottom-on-output t)
(setq comint-move-point-for-output t)
(setq ess-eval-visibly-p 'nowait) ;; no waiting while ess evalating
(defun my-ess-start-R ()
(interactive)
(if (not (member "*R*" (mapcar (function buffer-name) (buffer-list))))
(progn
(delete-other-windows)
(setq w1 (selected-window))
(setq w1name (buffer-name))
(setq w2 (split-window w1 nil t))
(R)
(set-window-buffer w2 "*R*")
(set-window-buffer w1 w1name))))
(defun my-ess-eval ()
(interactive)
(my-ess-start-R)
(if (and transient-mark-mode mark-active)
(call-interactively 'ess-eval-region)
(call-interactively 'ess-eval-line-and-step)))
(add-hook 'ess-mode-hook
'(lambda()
(local-set-key [(shift return)] 'my-ess-eval)))
(add-hook 'inferior-ess-mode-hook
'(lambda()
(local-set-key [C-up] 'comint-previous-input)
(local-set-key [C-down] 'comint-next-input)))
(add-hook 'ess-mode-hook
(lambda ()
(flyspell-prog-mode)
(run-hooks 'prog-mode-hook)
;; (prog-mode)
))

;; REF: http://stackoverflow.com/questions/2901198/useful-keyboard-shortcuts-and-tips-for-ess-r
;; Control and up/down arrow keys to search history with matching what you've already typed:
(define-key comint-mode-map [C-up] 'comint-previous-matching-input-from-input)
(define-key comint-mode-map [C-down] 'comint-next-matching-input-from-input)
If you have any questions or comments, please post them below. If you liked this post, you can share it with your followers or follow me on Twitter!