Monday, March 14, 2016

Using Data Frames in R


Like most scripting languages, R has the ability to import data from a CSV (comma separated) file.  A nice feature is that it will automatically load the data into a data.frame object which means it can be easily be manipulated.

Let's walk through a simple example on how to do this.

The data file we are going to load, storesales.csv, has the contents below.

StoreID,City,State,Year,Sales
1,"Providence","RI",2015,1200000.00
2,"Boston","MA",2015,4000000.00
3,"Bangor","ME",2015,2300000.00
4,"Portsmouth","NH",2015,1200000.00


To avoid hard coding the file path, we can use the  Sys.getenv function to build our file path in a way that will work for any user.   Note:  The # character marks a comment, an ignored line used to document your code.  paste is a function used to concatenate strings but it puts space between each parameter.  paste0 does not add a space between the strings.


#  Load a CSV file into a data frame...

# paste() adds a space between each parameter but paste0 does not...
samplepath = paste0(Sys.getenv("HOMEDRIVE"), Sys.getenv("HOMEPATH"), "\\Documents\\")


Like BASH and PowerShell, R always points to a working folder.  The setwd function which stands for set working directory points R to the path specified in the function parameter as shown below.  The getwd which stands for get working directory will confirm we are pointing to the correct folder, i.e. it will display the folder path.
 
# Use forward slash as separator to avoid needed double backslash (escape sequence)
setwd(samplepath)


# Confirm we are in the right folder...
getwd()


Which displays...
 
[1] "C:/Users/BryanCafferky/Documents"


Now the fun part.  Let's load the simple CSV file into a variable name mydata.

# Load the data...
mydata <- read.csv("storesales.csv")  # read csv file 




After the line above has loaded the file, we can confirm it is indeed a data.frame class by using the class function and passing the variable mydata as the parameter. 
 # Confirm this was returned as a data.frame...
class(mydata)



Which shows it is a data.frame...

[1] "data.frame"



If a variable name is on a line by itself, R automatically displays it's contents much as PowerShell does.  Let's display mydata.

# Display the data...
mydata


We should see the data below...

  StoreID       City State Year   Sales
1       1 Providence    RI 2015 1200000
2       2     Boston    MA 2015 4000000
3       3     Bangor    ME 2015 2300000
4       4 Portsmouth    NH 2015 1200000



Let's play with accessing the data.frame by using its indexes, i.e. the row number and column number.  To display the element at row 2, column 3, we would enter the statement below.

# We can access data by using the subscript (row and column)

# Get row 2, column 3...
mydata[2,3]



We see the values below.

[1] MA
Levels: MA ME NH RI

What happened? What are levels? Good question. When strings are loaded into a data.frame, R automatically converts them to a factor. A factor is something that R is assuming you will want to sort and group by, i.e. like a dimension attribute, so it indexes the string and stores a distinct list of values for it. This enhances performance if R is correct about how you want to use the string. Notice, R displayed the value we asked for but also all the distinct values for State. Internally, R replaces the string with an integer so it is really like the concept of an enumeration.  Ok, so an enumeration is just a fancy word for numeric codes standing for string values, i.e. 1 = Active, 2 = Cancelled, 3 = Not Started.  Can you stop R from doing this string conversion? Yes. And I will discuss how in another blog. Hey, I gotta keep you coming back, right? 


If we include the comma separator but omit the second parameter, i.e. the column index, we will get all the columns for the row numbers specified.  Below we should see all of the columns in row 2 returned.
 
# Get row 2...
mydata[2,]



  StoreID   City State Year Sales
2       2 Boston    MA 2015 4e+06
  


Now let's display the third column for all rows.


# Get column 3...
mydata[,3]


[1] RI MA ME NH
Levels: MA ME NH RI

A novelty of R is that when you extract data from a class such as a matrix or a data.frame, it does not always return the same type back.  In this case, R returned a vector, i.e. a one dimensional array. Let's prove it by using the class function again.  Note:  This behavior can be important in a function so you probably want to test the return types.


class(mydata[,3])
 
[1] "factor"
 
We got factor which is a base type, i.e. this is a vector of type factor.

Since R is a statistical language, let's get some statitics on mydata with the summary function.

# Get stats on the data...
summary(mydata) 


   StoreID             City   State       Year          Sales        
 Min.   :1.00   Bangor    :1   MA:1   Min.   :2015   Min.   :1200000  
 1st Qu.:1.75   Boston    :1   ME:1   1st Qu.:2015   1st Qu.:1200000  
 Median :2.50   Portsmouth:1   NH:1   Median :2015   Median :1750000  
 Mean   :2.50   Providence:1   RI:1   Mean   :2015   Mean   :2175000  
 3rd Qu.:3.25                         3rd Qu.:2015   3rd Qu.:2725000  
 Max.   :4.00                         Max.   :2015   Max.   :4000000  

So we covered a lot of ground here.  We learned about the data.frame, accessing data by indexes, factors, vectors, and got a flavor of why R is such a good statistical language.  

Oh and sorry, I should have warned you this blog is rated R.  :-