Skip to content

Commit 14003a8

Browse files
committed
R Markdown template for data cleaning.
1 parent ff69a8c commit 14003a8

File tree

1 file changed

+102
-0
lines changed

1 file changed

+102
-0
lines changed

files/prog-mod/oceania-uk-data.Rmd

Lines changed: 102 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,102 @@
1+
---
2+
title: "Sanitizing and cleaning the Oceania-UK dataset"
3+
output:
4+
html_document:
5+
keep_md: yes
6+
---
7+
8+
## Introduction
9+
10+
Here we apply a sequence of steps to reproducibly sanitize the oceania-uk dataset. We start from `oceania-uk-data.csv`, which is the result of applying the following steps to the original spreadsheet:
11+
12+
1. Insert column _continent_ to Australia data table between _population size_ and _life expectancy_.
13+
2. Change column A column header from _continent_ to _country_. (This is for the Australia data table.)
14+
3. Move the other (than Australia) per-country tables (only UK at present) under the Australia table.
15+
16+
No additional manipulation has been done yet.
17+
18+
## Loading libraries and other setup
19+
20+
```{r}
21+
# the name of the file containing the dataset:
22+
datafile <- "oceania-uk-data.csv"
23+
24+
# the name of the metadata file:
25+
metafile <- paste(paste(strsplit(datafile, split = "-")[[1]][c(1,2)],
26+
collapse="-"),
27+
"metadata.txt",
28+
sep = "-")
29+
metafile
30+
31+
# the name of the metadata file:
32+
outfile <- paste(paste(strsplit(datafile, split = "-")[[1]][c(1,2)],
33+
collapse="-"),
34+
"sanitized.csv",
35+
sep = "-")
36+
outfile
37+
```
38+
39+
## Moving metadata out into a separate file
40+
41+
The first two lines are metadata. Read those in and write out to a metadata file:
42+
43+
```{r}
44+
file.header <- scan(datafile,
45+
what = "character",
46+
sep = ",",
47+
nlines = 2)
48+
file.header
49+
50+
writeLines(file.header[1], metafile) # We only want what is in the first cell
51+
```
52+
53+
## Sanitizing the data
54+
55+
Read in data, standardizing NA values, skipping blank lines, properly setting column header names:
56+
57+
```{r}
58+
data.in <- read.table(datafile,
59+
sep = ",",
60+
skip = 4,
61+
col.names = c("country",
62+
"year",
63+
"pop",
64+
"continent",
65+
"lifeExp",
66+
"gdpPercap",
67+
"blank",
68+
"Notes"),
69+
blank.lines.skip=TRUE,
70+
na.strings = c("N/A", "NA", ""))
71+
```
72+
73+
Remove the empty column:
74+
```{r}
75+
data.in <- subset(data.in, select = -c(blank))
76+
```
77+
78+
Fix the typo in the country column and remove excess factor levels:
79+
```{r}
80+
data.in$country[data.in$country == "Australa"] <- "Australia"
81+
data.in$country <- factor(data.in$country)
82+
83+
# Test: we should be left with 2 factors now in country:
84+
if (nlevels(data.in$country) > 2) {
85+
cat("Data integrity alert: more than 2 factors for country")
86+
}
87+
```
88+
89+
Fix the typo in the population column:
90+
```{r}
91+
pop.is.typo <- is.na(as.numeric(as.character(data.in$pop)))
92+
pop.typo <- strsplit(as.character(data.in$pop[pop.is.typo]),"")[[1]]
93+
pop.typo[pop.typo == "O"] <- "0"
94+
data.in$pop <- as.numeric(as.character(data.in$pop))
95+
data.in$pop[pop.is.typo] <- as.numeric(paste(pop.typo,collapse=""))
96+
```
97+
98+
## Write sanitized data to csv
99+
100+
```{r}
101+
write.csv(data.in, file = outfile)
102+
```

0 commit comments

Comments
 (0)