R Markdown template for data cleaning.

hlapp · hlapp · commit 14003a84d271 · 2015-05-14T12:49:16.000-04:00
diff --git a/files/prog-mod/oceania-uk-data.Rmd b/files/prog-mod/oceania-uk-data.Rmd
@@ -0,0 +1,102 @@
+---
+title: "Sanitizing and cleaning the Oceania-UK dataset"
+output:
+  html_document:
+    keep_md: yes
+---
+
+## Introduction
+
+Here we apply a sequence of steps to reproducibly sanitize the oceania-uk dataset. We start from `oceania-uk-data.csv`, which is the result of applying the following steps to the original spreadsheet:
+
+1. Insert column _continent_ to Australia data table between _population size_ and _life expectancy_.
+2. Change column A column header from _continent_ to _country_. (This is for the Australia data table.)
+3. Move the other (than Australia) per-country tables (only UK at present) under the Australia table.
+
+No additional manipulation has been done yet.
+
+## Loading libraries and other setup
+
+```{r}
+# the name of the file containing the dataset:
+datafile <- "oceania-uk-data.csv"
+
+# the name of the metadata file:
+metafile <- paste(paste(strsplit(datafile, split = "-")[[1]][c(1,2)],
+                        collapse="-"),
+                  "metadata.txt",
+                  sep = "-")
+metafile
+
+# the name of the metadata file:
+outfile <- paste(paste(strsplit(datafile, split = "-")[[1]][c(1,2)],
+                       collapse="-"),
+                 "sanitized.csv",
+                 sep = "-")
+outfile
+```
+
+## Moving metadata out into a separate file
+
+The first two lines are metadata. Read those in and write out to a metadata file:
+
+```{r}
+file.header <- scan(datafile, 
+                    what = "character", 
+                    sep = ",", 
+                    nlines = 2)
+file.header
+
+writeLines(file.header[1], metafile) # We only want what is in the first cell
+```
+
+## Sanitizing the data
+
+Read in data, standardizing NA values, skipping blank lines, properly setting column header names:
+
+```{r}
+data.in <- read.table(datafile, 
+                      sep = ",", 
+                      skip = 4, 
+                      col.names = c("country", 
+                                    "year", 
+                                    "pop", 
+                                    "continent", 
+                                    "lifeExp", 
+                                    "gdpPercap", 
+                                    "blank", 
+                                    "Notes"), 
+                      blank.lines.skip=TRUE, 
+                      na.strings = c("N/A", "NA", ""))
+```
+
+Remove the empty column:
+```{r}
+data.in <- subset(data.in, select = -c(blank))
+```
+
+Fix the typo in the country column and remove excess factor levels:
+```{r}
+data.in$country[data.in$country == "Australa"] <- "Australia"
+data.in$country <- factor(data.in$country)
+
+# Test: we should be left with 2 factors now in country:
+if (nlevels(data.in$country) > 2) {
+  cat("Data integrity alert: more than 2 factors for country")
+}
+```
+
+Fix the typo in the population column:
+```{r}
+pop.is.typo <- is.na(as.numeric(as.character(data.in$pop)))
+pop.typo <- strsplit(as.character(data.in$pop[pop.is.typo]),"")[[1]]
+pop.typo[pop.typo == "O"] <- "0"
+data.in$pop <- as.numeric(as.character(data.in$pop))
+data.in$pop[pop.is.typo] <- as.numeric(paste(pop.typo,collapse=""))
+```
+
+## Write sanitized data to csv
+
+```{r}
+write.csv(data.in, file = outfile)
+```