Like most scripting languages, R has the ability to import data from a CSV (comma separated) file. A nice feature is that it will automatically load the data into a data.frame object which means it can be easily be manipulated.
Let's walk through a simple example on how to do this.
The data file we are going to load, storesales.csv, has the contents below.
StoreID,City,State,Year,Sales
1,"Providence","RI",2015,1200000.00
2,"Boston","MA",2015,4000000.00
3,"Bangor","ME",2015,2300000.00
4,"Portsmouth","NH",2015,1200000.00
To avoid hard coding the file path, we can use the Sys.getenv function to build our file path in a way that will work for any user. Note: The # character marks a comment, an ignored line used to document your code. paste is a function used to concatenate strings but it puts space between each parameter. paste0 does not add a space between the strings.
# Load a CSV file into a data frame...
# paste() adds a space between each parameter but paste0 does not...
samplepath = paste0(Sys.getenv("HOMEDRIVE"), Sys.getenv("HOMEPATH"), "\\Documents\\")
Like BASH and PowerShell, R always points to a working folder. The setwd function which stands for set working directory points R to the path specified in the function parameter as shown below. The getwd which stands for get working directory will confirm we are pointing to the correct folder, i.e. it will display the folder path.
# Use forward slash as separator to avoid needed double backslash (escape sequence)
setwd(samplepath)
# Confirm we are in the right folder...
getwd()
Which displays...
[1] "C:/Users/BryanCafferky/Documents"
Now the fun part. Let's load the simple CSV file into a variable name mydata.
# Load the data...
mydata <- read.csv("storesales.csv") # read csv file
After the line above has loaded the file, we can confirm it is indeed a data.frame class by using the class function and passing the variable mydata as the parameter.
# Confirm this was returned as a data.frame...
class(mydata)
Which shows it is a data.frame...
[1] "data.frame"
If a variable name is on a line by itself, R automatically displays it's contents much as PowerShell does. Let's display mydata.
# Display the data...
mydata
We should see the data below...
StoreID City State Year Sales 1 1 Providence RI 2015 1200000 2 2 Boston MA 2015 4000000 3 3 Bangor ME 2015 2300000 4 4 Portsmouth NH 2015 1200000
Let's play with accessing the data.frame by using its indexes, i.e. the row number and column number. To display the element at row 2, column 3, we would enter the statement below.
# We can access data by using the subscript (row and column)
# Get row 2, column 3...
mydata[2,3]
We see the values below.
[1] MA Levels: MA ME NH RI
What happened? What are levels? Good question. When strings are loaded into a data.frame, R automatically converts them to a factor. A factor is something that R is assuming you will want to sort and group by, i.e. like a dimension attribute, so it indexes the string and stores a distinct list of values for it. This enhances performance if R is correct about how you want to use the string. Notice, R displayed the value we asked for but also all the distinct values for State. Internally, R replaces the string with an integer so it is really like the concept of an enumeration. Ok, so an enumeration is just a fancy word for numeric codes standing for string values, i.e. 1 = Active, 2 = Cancelled, 3 = Not Started. Can you stop R from doing this string conversion? Yes. And I will discuss how in another blog. Hey, I gotta keep you coming back, right?
If we include the comma separator but omit the second parameter, i.e. the column index, we will get all the columns for the row numbers specified. Below we should see all of the columns in row 2 returned.
# Get row 2...
mydata[2,]
StoreID City State Year Sales 2 2 Boston MA 2015 4e+06
Now let's display the third column for all rows.
# Get column 3...
mydata[,3]
[1] RI MA ME NH Levels: MA ME NH RI
A novelty of R is that when you extract data from a class such as a matrix or a data.frame, it does not always return the same type back. In this case, R returned a vector, i.e. a one dimensional array. Let's prove it by using the class function again. Note: This behavior can be important in a function so you probably want to test the return types.
class(mydata[,3])
[1] "factor"
We got factor which is a base type, i.e. this is a vector of type factor.
Since R is a statistical language, let's get some statitics on mydata with the summary function.
# Get stats on the data...
summary(mydata)
StoreID City State Year Sales Min. :1.00 Bangor :1 MA:1 Min. :2015 Min. :1200000 1st Qu.:1.75 Boston :1 ME:1 1st Qu.:2015 1st Qu.:1200000 Median :2.50 Portsmouth:1 NH:1 Median :2015 Median :1750000 Mean :2.50 Providence:1 RI:1 Mean :2015 Mean :2175000 3rd Qu.:3.25 3rd Qu.:2015 3rd Qu.:2725000 Max. :4.00 Max. :2015 Max. :4000000
So we covered a lot of ground here. We learned about the data.frame, accessing data by indexes, factors, vectors, and got a flavor of why R is such a good statistical language.
Oh and sorry, I should have warned you this blog is rated R. :-)