StaTEAstics.: February 2013

Monday, February 11, 2013

R package building automation

Title: R package building automation

Inspired by the post at http://giventhedata.blogspot.tw/2013/02/my-r-package-development-cheat-sheet.html. I have decided to publish my cheat script for package development as well.

Building package used to be a nightmare, filling in all those Rd files manually can cause some serious brain damage. Thanks to the roxygen2 team and knitr, building a package with nice documentation is merely running a script like what I have below. Just modify and change pkgName to the name of your package then your package should be ready in 5 minute.


## Build the package
## ---------------------------------------------------------------------

## Remove the folder if it exists
if(file.exists("./pkgName"))
    unlink("pkgName", recursive = TRUE)

## Build the package
package.skeleton("pkgName", code_files = paste("./Codes/",
                           dir("./Codes/", pattern = "\\.R$"), sep = ""),
                 force = TRUE)

## Include the data
dir.create("pkgName/data")
file.copy(from = "./FAOcountryProfile.RData",
          to = "pkgName/data/", overwrite = TRUE)
file.copy(from = "./FAOregionProfile.RData",
          to = "pkgName/data/", overwrite = TRUE)
file.copy(from = "./FAOmetaTable.RData",
          to = "pkgName/data/", overwrite = TRUE)
file.copy(from = "./DESCRIPTION", to = "pkgName/",
          overwrite = TRUE)

## Include Demo
dir.create("pkgName/demo")
file.copy(from = "./pkgNamedemo.R",
          to = "pkgName/demo/", overwrite = TRUE)
cat("pkgNamedemo Demonstration for the pkgName package\n",
    file = "pkgName/demo/00Index")

## Use roxygen to build the documentation
library(roxygen2)
roxygenize("pkgName")

## Include vignette
dir.create("./pkgName/inst/doc/")
file.copy(from = "../Documentation/pkgName/pkgName.pdf",
          to = "./pkgName/inst/doc/", overwrite = TRUE)

## Create the vignette hack from (http://yihui.name/knitr/demo/vignette/)
## This is not required for R 3.0.0
cat("%\\VignetteIndexEntry{General Manual}\n\\documentclass{article}\n\\begin{document}\n\\end{document}", file = "./pkgName/inst/doc/pkgName.Rnw")

## Build and check the package
system("R CMD INSTALL --build pkgName")
system("R CMD build pkgName")
system("Rcmd check pkgName")
system("Rcmd check --as-cran pkgName")

PS: This is my first post using Rmd, the highlight for R is defintely much better than plain text. However, I think some twick is required for my blog theme.

Tuesday, February 5, 2013

Relearn boxplot and label the outliers

Despite the fact that box plot is used almost every where and taught at undergraduate statistic classes, I recently had to re-learn the box plot in order to know how to label the outliers.

This stackoverflow post was where I found how the outliers and whiskers of the Tukey box plots are defined in R and ggplot2:
In ggplot2, what do the end of the boxplot lines represent?

and this post on how to label the outliers using base graphics.
How to label all the outliers in a boxplot

Since the use of ggplot2 is required for this task, I have written some basic hack code to label the outliers for ggplot2.

Here are the codes:


## Install the FAOSTAT package to obtain the data

if(!is.element("FAOSTAT", .packages()))
    install.packages("FAOSTAT")
library(FAOSTAT)


## Download data on Cassava production
cp.lst = getFAOtoSYB(name = "cassava_production", domainCode = "QC",
    itemCode = 125, elementCode = 5510)


## Use the country level data, and take only data for 2011 and remove the NA's
cp.df = cp.lst$entity[!is.na(cp.lst$entity$cassava_production) &
                      cp.lst$entity$Year == 2011, ]

## Merge with the country profile to obtain the country names for labelling
ccp.df = merge(cp.df, FAOcountryProfile[, c("FAOST_CODE", "ABBR_FAO_NAME")],
    all.x = TRUE)


## Merge with the regional pofile to obtain the UNSD M49 macro region
## composition for multiple boxplot.
rcp.df = merge(ccp.df, FAOregionProfile[, c("FAOST_CODE", "UNSD_MACRO_REG")],
    all.x = TRUE)


## Compute the quantile
qrcp.df = ddply(.data = rcp.df, .variables = .(UNSD_MACRO_REG), transform,
    lQntl = quantile(cassava_production, probs = 0.25, na.rm = TRUE),
    uQntl = quantile(cassava_production, probs = 0.75, na.rm = TRUE))


## Compute the lower and upper bound which defines the outlier
brcp.df = ddply(.data = qrcp.df, .variables = .(UNSD_MACRO_REG), transform,
    lBound = lQntl - 1.5 * (uQntl - lQntl),
    uBound = uQntl + 1.5 * (uQntl - lQntl))



## Remove the country names if it is within the bounds
with(brcp.df, {
    brcp.df[cassava_production <= uBound &
            cassava_production >= lBound, "ABBR_FAO_NAME"] <<- NA
})

## Plot the data
set.seed(587)
ggplot(data = brcp.df, aes(x = UNSD_MACRO_REG, y = cassava_production)) +
    geom_boxplot(outlier.colour = NA) +
    geom_text(aes(label = ABBR_FAO_NAME), size = 2,
              position = position_jitter(width = 0.1)) +
    labs(x = NULL, y = NULL, title = "Production of Cassava by region")

Here is the final product, to avoid over-plotting of texts I have used position_jitter. Which is not an elegant solution but I just can not find any algorithm that works well in general.

Sunday, February 3, 2013

A package for agricultural statistic: FAOSTAT

After 8 years of using R, today I finally become a contributor to the community and released my first package, FAOSTAT.

The package is designed to provide user with direct access to the FAOSTAT data base via R and to support the open data and methodology philosophy used in the Statistical Yearbook of the Food and Agricultural Organization.

The package can be downloaded and installed the same way as other package in a single line:

install.packages("FAOSTAT")

Or the latest development version from github:

 library(devtools)
install_github("FAOSTAT","mkao006")

A draft documentation can be found at:


vignette("FAOSTAT", package = "FAOSTAT")

I hope the package will save time for people working with agricultural statistics and more generally official statistics and to promote the use for research. All feedbacks and criticisms are welcome.

Maize trade Part II: Comparison and analysis

Following my last post about the maize network, although interesting but is not very informative. What we are going to do today is to contrast the maize network with the wine trade network.

The choice why we have chose wine will become clear after the network and the analysis. Lets first have a look at the trade network of wine, again only the top 80% of the trade has been shown.

First, we can see that the wine network is much more inter-connected than the maize network. Secondly, the amount of reciprocal trading is much higher. Third, there does not appear to have a single trading of the size of the magnitude such as the export of maize from United state to Japan and Korea

The advantage of the network diagram not only reveals the connection, but the structure of the network, hence the nature of the market. It is clear that the market for maize and wine is different, but how, and why, is what we are interested.

Usage/consumption:

First of all, the usage of the two commodity are on the opposite end of the spectrum. Maize is a staple crop, necessary in many regional cuisine also used for livestock feed and bio-fuel. On the other hand, wine is consumed at a lesser extent and is considered as a luxury or enjoyment (although considered a necessity in Italy and several other countries).

Cost:

The usage of the commodity lead to the difference in cost and elasticity. Staples and biofuel need to be cheap to be viable and thus cost is an important consideration. In contrast, the preference and quality of wine plays a much bigger role than the actual price tag. This is where the trade agreement plays a significant role in promoting the trade of maize between the United States, Japan and Korea but a smaller element for wine trading.

Value added/preferences:

Finally, the variety and the value added of the commodity characterizes the trade. Japan can import maize from any country in the world to use for consumption, feed or to produce bio-fuel without any considerable change or impact because the value add of maize is small; simply put, the product can be easily substituted. On the other hand, wine is dominated by several producers such as France, Italy and a handful of new world wine producers. Which is not easily substituted given the preferences. This is the contributing reason to the high number of trading partners, interconnected clusters and reciprocal trading; since a single trading partner can not satisfy the preferences of the consumer and even the French would like to have Italian wine occasionally.

This example illustrate how a simple network diagram can reveal a lot of interesting insight compare to analyzing just the quantity and value of trade, it also supports my view on how important exploratory analysis is.


## Download the data
wine.df = read.csv("https://dl.dropbox.com/u/18161931/wine_trade.csv", header = TRUE,
    stringsAsFactors = FALSE)

## Take only the trade
wineEx.df = subset(wine.df,subset = type == "Export",
    c("reporting_country", "partner_country", "Wine"))

## Sort the data and take only the top 80% of the trade
wineEx.df = arrange(wineEx.df, desc(Wine))
wineEx.df$sWine = scale(wineEx.df$Wine, center = FALSE)
wineEx.df$cs = cumsum(wineEx.df$Wine)
wineEx.df[wineEx.df[, 1] == "China, Hong Kong SAR ", 1] = "China, Hong Kong SAR"
wineFinal.df = subset(wineEx.df, cs < tail(wineEx.df$cs, 1) * 0.8)

## Set edge and arrow size
wineFinal.df$edgeSize = with(wineFinal.df, Wine/sum(Wine))
wineFinal.df$arrowSize = ifelse(wineFinal.df$edgeSize * 30 < 0.5 , 0.5,
    wineFinal.df$edgeSize * 15)

## Create the network and plot
wine.net = network(wineFinal.df[, 1:2])
plot(wine.net, displaylabels = TRUE, label.col = "steelblue",
     edge.lwd = c(wineFinal.df$edgeSize) * 100,
     arrowhead.cex = c(wineFinal.df$arrowSize),
     label.cex = 0.5, vertex.border = "white",
     vertex.col = "skyblue", edge.col = rgb(0, 0, 0, alpha = 0.5))