Coursera: Getting and cleaning data (My Course Notes)

Goal of this course is the first 3 of this sequence

Raw Data-> Processing script-> tidy data -> data analysis -> data communication
Raw and processed data(1)
  • Data are values of qualitative or quantitative variables, belonging to a set of items.
    • variables are a measurement or characteristic of an item
  • Raw data
    • The original source of data
    • Has not been subjected to processing or any other manipulation.
    • No software has been run on the data
    • the numbers on the data have not been manipulated
    • Data has not been removed from the data set
    • The data hasn’t been summarized in any way.
  • Processed/Tidy data
    • Data that is ready for analysis
    • processing can include merging, subsetting, transforming
    • All steps should be recorded.
    • Each variable must be in one column
    • each different observation of a variable should be in a different row
    • There should be one table for each kind of variable
    • If you have multiple tables, they should include a column in the table that allows them to be linked.
Components of Tidy data(1)
  • When you go from raw to tidy data, you should have
    • The raw data
    • A tidy data set
    • A code book describing each variable and its values in the tidy data set
    • An explicit and exact recipe used to go from raw data to tidy data
  • Code book
    • Includes information about the variables(including units) in the data set not contained in the tidy data
    • Includes information about the summary choices you made
    • Information about the experimental study design you used
    • Should have a section called "Study Design" that has a thorough description of how you collected the data
    • There should be a section called "Code book" that describes each variable and its units
  • The instruction list
    • A script in R or python
    • The input for the script is the raw data
    • the output for the script is the processed, tidy data
    • there are no parameters in the script.
    • It might be impossible to script every step. Instead provide instructions like this
      • Step 1= Take the raw file, run version @3.1.2 of @summarize software with parameters a=1, b=2, c=3
      • Step 2= run the software separately for each sample
      • Step 3= take column three of outfile.out for each sample and that is the corresponding row in the output data set.
Downloading Files
  • R can be used to download files
  • This allows the downloading process to be put in the R script.
    • Getting/Setting your working directory
      • The 2 main commands are getwd() and setwd()
        • They both work with absolute and relative paths
          • Absolute paths: "C:\\Users\\Andrew\\Downloads"
          • Relative Paths: ".\\data" or "..\\"
            • \\ is used in windows instead of / in other OSs
    • Checking for and creating directories
      • file.exists("@directoryName") checks if the directory exists
      • dir.create("@directoryName") creates a directory if it doesn’t exist.
  • To download from the internet
    • usedownload.file()
      • parameters are url, destfile, method.

                         e.g. download.file(fileurl, destfile="./@Directory/@File.@Format". [method = "@method"])

      • @method is curl for Https websites, not necessary
    • helps with reproducibility
    • Note the time the file was downloaded
      • use date()
Reading local Data Files
  • read.table() is the main function for reading data int R
  • Read the data into RAM
  • parameters are file, header, sep, row.names, nrows
    • quote: Used to tell R whether there are any quoted values. quote="" means no quotes
      • Sometime ‘ or " are placed in data values and cause issues when reading files, setting quote="" resolves this.
    • na.strings: sets the character that represents a missing value
    • nrows: how many rows to read from the files
    • skip: number of lines to skip before starting to read
  • Related functions are read.csv and read.csv1
    • read.csv sets sep=, and header = TRUE by default
Reading Fixed Width files
  • use read.fwf().
    • the widths option specifies an integer vector that is used to specify the width of each column.
Reading Excel files
  • requires the xlsx package
    • Install using install.packages("xlsx")
  • read.xlsx() 
    • parameters are files, sheetIndex, header
      • Sheetindex is used to state which Excel sheet the data is stored on.
    • Allows you read specific rows and columns
      • Use parameters colIndex and rowIndex
        • They accept vectors showing the columns and rows to read.
  • Write.xlsx() writes out an Excel file with similar arguments
  • XLConnect package has more options for writing and manipulating excel files.
    • XLConnect.pdf
Reading XML
  • XML consists of markup and content
    • markup are labels that give the text structure.
    • content are the actual text of the document.
  • Tags, elements and attributes
    • Tags correspond to labels. Applied to particular parts of the text so it’s structured.
      • start tags <section>
      • end tags </section>
      • empty tags <line-break /> 
    • Elements are specific examples of tags
      • <Greeting> Hello, world </Greeting>
    • Attributes are components of the label
      • <img src="jeff.jpg" alt="instructor"/>
        • src and alt are attributes
      • <step number = "3"> connect A to B </step>
  • Requires the xml package
    • xmlTreeParse(@FileURL, useInternal=TRUE/FALSE) to parse out the XML file.
      • Loads the document into memory for parsing
      • If you’re reading an html file, use htmlTreeparse() instead
      • useInternal = TRUE if you want to get all the nodes inside the document.
    • xmlRoot() is used to remove the XML version header of the XML document. output is a parsed xmlObject
    • xmlName() is used to get the tag of the document’s root.
    • use names(@RootTag) to get the tags that are just below @RootTag
  • You can directly access parts of an XML document
    • Accessed the same way a list is accessed
      • using [[@ElementNumber]]
      • To access subelements, 
        • [[@elementNumber]][[@subelementNumber]]
  • Can be done programatically
    • use xmlSApply(@ParsedXMLObject, @FunctionToApply)
      • @FunctionToApply can be xmlValue if you want to return all the contents of each children elements as a single string.
    • Xpath programming can be used to specify further what element to return.
      • XML.pdf
      • Use xpathSApply(@ParsedXMLObject, "@XpathNotationOfNodeToExtract", @FunctionToApply)
        • @FunctionToApply usually is xmlValue
        • @XpathNotationOfNodeToExtract
          • /node: Used to extract top level node
          • //@node: used to extract a node at any level
            • e.g. to extract the node named house use //house.
          • @nodePath[@attr-name]: @Node with an attribute name
          • @nodePath[@attr-name=’@Value’] @Node with tag @attr-name that has a value @Value.
            • @nodePath is the path to the node that contains the tag @attr_name. usually in the format //@OutputOfxmlName()/@ParentTag/@ParentTag
Reading JSON
  • JavaScript Object Notation(JSON) is a lightweight data storage format often used for APIs.
  • It’s similar to XML but has a different syntax/format.
  • requires the jsonlite package
  • use fromJSON(@FileToRead) to read a JSON file.
    • returns a data frame
  • To access a variable
    • @dataFrame$VariableName
  • To write data frames to JSON
    • use toJSON()
      • the pretty parameter indents the JSON document
Using data.table
  • inherits from data.frame.
    • All functions that accept data.frame accept data.table
  • Faster than data.frame at subsetting, grouping and updating
  • requires data.table package.
  • use tables() to see all the data tables in memory
  • data tables allow subsetting rows
    • @DataTable[@RowNum,]
      • To return just the @rowNum row 
      • The same as @DataTable[@RowNum]
    • @DataTable[@DataTable$@columnName=="@value",]
      • To return all rows where the column @columnName has a value @Value
  • To subset columns in Data tables,
    • you cannot subset data tables the same way you subset data frames
      • @DataTable[,c(@columnIndex1, @columnIndex2)] doesn’t work.
    • requires expressions 
      • They can be used to summarize datasets
        • @DataTable[,@OutputDataClass(@function1(@ColumnName),@function2(@columnName2))]
          • @OutputDataClass can be list, table, vector e.t.c
          • as
  • To add a new column to a Data table, use := operator

                    @DataTable[,@NewColumnName:=@ExistingColumnName @Operator @Value]

    • This allows you carry out multi step operations as expressions are involved.

                         @DataTable[, @NewColumnName:= { tmp<- (@ColumnName1+ @ColumnName2); @Function(tmp @Operator @value)}]

    • Allows you carry out plyr like operations

                    @DataTable[,@NewColumnName:=  @ColumnName1 > Value]

      • Creates a new column that has boolean values
      • also allows you use the by keyword

                                   @DataTable[,b:= mean(x+w), by=a]

        • Where a is a column with boolean values.
        • hence rows where a = TRUE will have the same value in b i.e. the calculation for the mean will only considers x and w values from those rows. also rows where a=FALSE will have the same values in b
  • Data tables must be copied explicitly using copy(). 
  • Special variables
    • .N
      • Means count the number of times something occurs
      • E.g.

                                   @dataTable[, .N, by=x]

        • this counts the number of times the distinct elements in the x column occurs.
  • Keys
    • Allows rapid subsetting and sorting.



      • The above automatically subsets the datatable based on the key.
        • returns all rows where the value of x is a
  • Joins
    • Joins are based on keys
    • Data tables can be joined based on a similar element being set as the key.
  • For fast reading
    • done using fread()
Reading from MySQL

                         @ConnectionName<- dbConnect(MySQL(),user="@userName", host="@Hostname")

    • To connect specifically to a database on an instance

                         @DBName<-dbConnect(MySQL(), user="@UserName", db="@DBName", host="@HostName")

    • To run a command

                         dbGetQuery(@ConnectionName, "@DBQuery")

    • To disconnect from an instance


    • use dbListTables() to show the tables in a Database


    • To see the column names of a table, use dbListFields()

                         dbListFields(@DBName, "@TableName")

    • To run a query on a table

                         dbGetQuery(@DBName, "@SQLQuery")

    • To read a whole table to a R object

                         @RObjectToStoreTable<- dbReadTable(@DBName, "@TableName")

      • To read a subset of the Table

                                   @RObjectToSendQuery<- dbSendQuery(@DBName, " select * from @TableName @whereConditions")

                                   @RObjectToContainResult<- fetch(@RObjectToSendQuery[,[n=@NoOfRowsToRead] );

        • To clear the result of a query on the instance’s memory.


Reading data from HDF5
  • Hierarchical Data Format is used for storing large data sets.
    • Data is stored in groups containing zero or more data sets and their metadata.
      • Each group has a group header with a group name and a list of its attributes.
      • Has a group symbol table with a list of objects in the group.
    • Data sets are multidimensional arrays of data elements with metadata
      • They have a header with name, datatype, dataspace and storage layout.
      • Have a data array with the data.
  • To interface with HDF5, you need to install packages from bioconductor


    • Install the bio base packages


  • Use h5createFile() to create a hdf5 file.


    • To create group

                         @RObjectForH5 = h5createGroup("@Filename.h5", "@GroupName[/@SubGroupName]")

  • To list the contents of an hdf5 file use h5ls()


  • to write to a group, use h5write()

               h5write(@Robject, "@FileName.h5", "@GroupName[/@SubGroupName]")

    • If the destination Group(@GroupName[/@SubGroupName]") doesn’t exist, then it is created.
    • You can write different data classes e.g. dataframes, arrays, matrices.
    • You can direct data to be written to specific places using the index option

                         h5write(@ObjectToWrite, "@FileName.h5", "@GroupName[/@SubGroupName], index=list(@FirstRowNo:@LastRowNo, @FirstColumnNo:@LastColumnNo)

  • To read data, use h5read()

               readA=h5read("@FileName.h5", "@GroupName[/@SubGroupName]")

    • You can direct data specific data to be read using the index option

                         h5read("@FileName.h5", "@GroupName[/@SubGroupName]", index=list(@FirstRowNo:@LastRowNo, @FirstColumnNo:@LastColumnNo) )

  • The HDF5 tutorial rhdf5.pdf
Reading data from the Web
Scope is data scraping, authentication and interfacing with APIs
  • Web scraping is programmatically extracting data from the HTML code of websites
    • Read the data into R using
      • readLines()

                                   @ConnectionName<- url("@WebAddress")

                                   @RObject<- readLines(@ConnectionName)
        • Close the connection after you’re done


      • htmlTreeParse()

                                   @HTMLobject <- htmlTreeParse(@url, useInternalNodes=T)

        • htmlTreeParse in in XML package
        • You can specify the nodes to read using xpathSApply()
      • GET() in the httr package

                                   @RObjectToStoreTheHtml <- get(@URL)

                                   @RobjectToStoreTheContent <- (@RObjectToStoreTheHtml , as="text")
                                   @RobjectToStoreParsedHTML<- htmlParse(@RobjectToStoreTheContent ,asText=TRUE)

        • GET() is in httr package
  • Accessing websites with passwords
    • GET allows you authenticate yourself for websites.

                         @DestinationRObject<- GET("@URL",authenticate("@UserName","@Password"))

    • Handles allow you save authentication to a website.


                         @DestinationRObject= GET(handle=@RobjectForURL , authenticate("@UserName","@Password")

      • Handle() is in the httr package
      • To access the website, simple refer to it using the handle and it’ll use the previous authentication via cookies.

                                   GET(handle=@RobjectForURL , path="/")

Reading data from APIs
  • Done using the httr package
  • You need to create an account with the API organization.
  • To access twitter form R

               myapp=oauth_app("Twitter", key= "@yourConsumerKeyHere", secret="@YourConsumerSecretHere")

               sig=sign_oauth1.0(myapp, token="@yourTokenHere", token_secret="@yourTokenSecretHere")

Reading from other sources
  • Interacting with files
    • file(): to open a connection to a text file
    • url(): to open a connection to a url
      • To read a specific line from a url, use readLines()

                                   @connection<- url(@URL)

                                   @Robject<- readLines(@connection)

    • gzfile(): to open a connection to a .gz file
    • see ?connections for information on other connection mechanisms
    • always 
  • The foreign package allows you interact with other statistical programming languages
    • Loads data from minitab(read.mtp), SAS(read.xport), SPSS(read.spss), stata(read.dta)
    • library(foreign)
  • RODBC provides interface amultiple DBMSs
  • You can read images using jpeg(), readbitmap(), png() directly into R and manipulate them
  • Reading Geographic Information Systems(GIS) data using rdgal(), rgeos(), raster()
  • You can read MP3 files using tuneR packages or seewave packages
Subsetting and sorting
  • Subsetting can be done based on rows and columns
    • Based on Columns

                         @Data[, @ColumnOrVectorOfColumns]

      • @columnOrVectorOfColumns specifies the columns to select from the dataset
    • Based on rows


      • @rowOrVectorOfRows specifies the rows to select from the dataset
      • You can also subset on rows based on conditions

                                   @data[@Data$Column @ConditionalOperator @VectorOfValues,]

    • Based on rows and columns

                         @data[(@Column1 @Operator @value @CombinationOperator @column2 @operator @value),@Column1]

      • @CombinationOperator can be & for and or | for or.
    • Dealing with NA’s
      • Using which is the most efficient way of subsetting without including NAs

                                   Data[which(@Column1 @Operator @Value),]

  • Sorting 
    • To sort a column, use sort()


      • To sort in decreasing order, use the decreasing=TRUE option
      • To sort with NA’s at the end, use na.last=TRUE
    • To sort a data frame use sort()
      • This sorts the dataframe based on the values in the column. 


        • You can specify a tie breaker
    • You can also sort based on arrange() in the plyr package
  • Adding columns to a dataframe

                    @Data$@NewColumnName<- @VectorOfValueToInsertIntoColumn

    • you can use cbind

                         cbind(@Data, @VectorOfValuesToInsert)

  • Adding rows to a dataframe
    • Use rbind()
Summarizing Data
  • summary() can be used to summarize dataFrames.
  • str() can also be used for summarization. It shows the structure of the data.
  • quantile() can be used to summarize a column
    • To view the default of 5 quantiles(0,25, 50,75 & 100)

                         quantile(@Data$@Column, na.rm=true)

    • To view specific quantiles

                         quantile(@Data$@Column, probs=@vectorOfQuantiles

      • sample of @vectorOfQualities is c(0.5, 0.75, 0.9)
        • 0.5 is median or 50% percentile
  • Table can be used to make tables to summarize data


      • This returns a count of the distinct elements in @column.
      • Use the useNA="ifany" option if you want to see a count of NA values.
    • To view the relationship between the columns of @Data, input 2 values into table

                         table(@Data$@Column1, @Data$Column2)

      • Column1 becomes the row and @Column2 becomes the column. 
    • To view a count of all elements of a column that meet a particular condition

                         table(@Data$@Column @ConditionalOperator @VectorOfValues)

      • @conditionalOperator can be %in% to check if @vectorOfValues exists in @data
  • To count the number of NAs in a column


  • To check if missing values exist in a column.


    • It returns true is an NA exists
  • To check if every value in a column satisfy a condition, use all()

               all(@Data$@Column @Condition)

  • Row and column sums
    • You can calculate the sum of all the columns in a dataset using colSums()


      • To check if any of the columns in a dataset contains an NA value,
        •  apply colSums to the of the dataset



  • Crosstabs
    • used to identify where the relationships in a dataset exists.

                         xtabs(@VariableToDisplayInResult ~ @Variable1ToGroupBy [+ @Variable2ToGroupBy], data=@Data)

      • To group by all other variables, set @Variable1ToGroupBy to .
  • FlatTables
    • Often used to summarize data output of xtabs() into a compact form.


  • To see the size of a dataset


    • to specify the units of the output, use the unit=@Units command

                         object.size(@Data, unit="Mb")

Creating New variables
  • Sometimes variables are missing in a dataset. 
    • You can transform the dataset to get the values you want
    • Common missing variables are
      • Missingness indicators
      • cutting up quantitative variables
      • applying transforms
  • Creating sequences
    • You use seq()

                         seq(@FirstValue, @LastValue)

      • use the by = @value option to indicate the increment value to use in creating the sequence
      • use the length =@value option to indicate how many values should be created for the sequence.
    • To create an index of the values in a column/vector

                         seq(along = @vector)

      • The output is a sequence from 1 to n(@vector)
  • Subsetting variables

               @Data$@NewColumn= @Data$@Column %in% @vectorOfValues


    • This returns  a count of true and false observations of the conditions compared to create @NewColumn
  • Creating binary variables
    • use ifelse() to apply a value when a condition is true, and another when it is false

                         @Data$@NewColumn= ifelse(@Data$@Column @ComparisonOperator @Value, True, False)

      • To view a count of how many observations do not meet the condition,  use table() to compare @NewColumn to @Data$@Column @ComparisonOperator @Value

                                   table(@Data$NewColumn, @Data$@Column @ComparisonOperator @Value)

  • Creating categorical variables
    • Done using cut()
    • To split a variable based on it’s value, e.g. into quantiles 

                         @Data$NewColumn = cut(@Data$@Column, breaks=@MeasureToSplitData)


      • e.g. @MeasureToSplitData can be quantile(@Data$Column) to split the data into 5 quantiles.
      •  table(@Data$NewColumn) is Used to view the count of the groups @Column is split into.
      • You can compare @NewColumn to @Column using table() to view the relationship between them.
    • To split a variable into group based on quantiles
      • use cut2() in the Hmisc package

                                   @Data$@NewColumn = cut2(@Data$@Column, g=@NoOfQuantilesToSplitInto)

    • The result of a cut/cut2 is a factor variable.
  • Creating Factor variables
    • You can convert a variable into a factor

                         @Data$@ColumnToRepresentFactors<- factor(@Data$@Column)

    • To convert factor variable to numeric, use as.numeric(@Data$Column)
    • Alternatively, use transform to convert the variable type to factor

                         @Data<- transform(@data, @VariableToConvertToFactor= factor(@VariableToConvertToFactor))

      • You can use labels to change the value of the observations to a character that represents the factor

                                   @Data$Variable<- factor(@Data$variable, label=c("LabelForValue1", "LabelForValue2"…))

        • Sometimes, the number of variables are continuous(e.g decimal numbers with a variable of 0.1 and a range of 10.). You can split this into a limited number of factors using cut(). This involves creating tertiles.
          • First split the data into quantiles

                                                       @variableToStoreCutpoints<- quantiles(@Data$variable, seq(0,1, length=@NumberOfFactors), na.rm=TRUE)

          • Cut the data and define a new variable using the quantiles

                                                       data$newVariable <- cut(@Data$variable, @variableToStoreCutpoints)

Reshaping Data
  • melt is used on a data frame to select which columns are id and which are measure.

               melt(@Data, id=@VectorOfColumnsThatAreID, measure.vars= @VectorOfColumnsThatAreMeasures)

    • The output is an object with the @VectorOfColumnsThatAreMeasures where 2 new columns are made named variable and value. For each of the rows, the value of the column is put as the variable and it’s value set to the value of that column. 
  • A melted dataframe can be reshaped to a dataframe using dcast() or an array using acast()

               dcast(@Data, @Column ~variable[,@AggregationFunction])

    • This compares distinct(@column) to all the columns from the melt that are used as measure.vars and a summarized comparison showing just the count for each variable.
    • If @AggregationFunction is set to mean, it returns the mean of the variable for each distinct(@Column).
  • Averaging values
    • You can use tapply()

                         tapply(@Data$@ColumnToApplyFunctionTo, @Data$@ColumnToGroupBy, @functionToApply)

      • The result contains the sum of all ColumnToApplyFunctionTo grouped by ColumnToGroupBy if functionToApply is sum
    • You can use split()

                         split(@Data$@ColumnWithDataToMeasure, @Data$ColumnToGroupBy )

                         %>% lapply(sum)
                         %>% unlist

      • This returns a list with @ColumnWithDataToMeasure  grouped by @ColumnToGroupBy 
      • you can use lapply(sum) to get the sum of the groups
      • You can use unlist to convert the list to a vector.
    • You can use ddply
      • Part of the plyr package

                                   ddply(@Data, .(@columnToGroupBy), summarize, sum= sum(ColumnWithDataToMeasure ))


Managing data frames in R using dplyr
  • dplyr is a package developed to make it easy to work with data frames
  • dplyr expects
    • There is one observation per row
    • Each column represents a variable or measure or characteristic
    • An optimized version of plyr
  • The dplyr functions have a similar format
    • The first argument is a data frame
    • the subsequent arguments describe what to do with the first argument. 
    • You can refer to columns in the data frame directly without using $ operator
    • the result is a new data frame
    • Da
  • The dplyr verbs
    • select()
      • returns a subset of the columns of a data frame
    • filter
      • extracts a subset of rows from a data frame based on logical conditions
    • summarize
      • generates summary statistics of different variables in the data frame.
    • mutate
      • add new variables/columns or transform existing columns
    • arrange 
      • reorders rows of a data frame
      • preserves the order of the columns
    • rename
      • rename variables in a data frame
Merging data in R
  • Similar to linking as done in RDBMSs
  • merge() can be used to merge 2 dataframes x and y.
    • merge() has the following parameters which are used to specify which variable to use for the merge
      • x/y/by/by.x/by.y/all
    • By default it merges by the columns with a common name

                         merge(@DataFrameX, @DataFrameY, [by.x="@ColumnInXThatMatchesY", by.y="@ColumnInYThatMatchesX"]) 

      • if by.x and by.y aren’t specified, it’ll merge all columns that have the same name. If it’s possible for the value’s in both datasets to be different for the same named variable, by.x and by.y should be used to specify the column to use for joining.
      • all=TRUE means if a value appears in one of the datasets but not the other, it should include the value with NA for the variables in the other dataset
  • join()
    • in the plyr package
    • default is a left join.

                         join(@DataFrame1, @DataFrame2)

      • It joins columns with the same name.
    • join_all allows you join multiple dataframes.

                         list(@DataFrame1, @DataFrame2, @DataFrame3) %>% join_all

      • First combine the dataframes into a list
      • Then perform a join_all on the dataframe
      • They all need a column with the same name
Editing text variables
  • Manipulating text programmatically 
  • To see the column names of a dataframe,


    • use tolower() to convert it to small characters.
  • To split a string when a character is encountered, 

               strsplit(@Data, "@CharacterToSplitOn")

    • @CharacterToSplitOn is a special character, precede it with \\ e.g. "\\.".
    • strsplit() returns a list
    • You can use sapply to get the first elements from the list
      • Create a function that gets the first character

                                   firstElement<- function(x){x[1])

      • Use sapply to get the first elements from the list by applying a function to every element of the list.

                                   sapply(@list, firstElement)

  • To substitute out characters in a vector for others
    • To replace the first appearance of the character
      • use sub()

                                   sub("@CharacterToRemove", "@CharacterToPut", @Data)

    • To replace multiple appearances of the character
      • use gsub()

                                   gsub("@CharacterToRemove", "@CharacterToPut", @Data)

  • To search for specific values in variables
    • use grep()


      • It returns the row numbers of the observations with the string
      • to return the actual observation, use the option value=TRUE


      • to see the number of elements that satisfy the function, use length(@grepCommand)
    • use grepl()


      • This returns a logical vector with true for the observations with the string and false for observations without the string
      • You can use this to subset only cols where a string appears.

                                   @Data[grepl("@StringToSearchFor","@Data") ,]

  • The stringr package contains a lot of string manipulation functions
    • nchar(@Data)
      • To get the number of characters in a variable
    • substr(@Data, @firstPosition, @LastPostition)
      • This returns a string of the characters from the first to the last position
    • paste(@firstString, @lastString)
      • This joins 2 strings together and puts a space in between
    • paste0(@firstString, @lastString)
      • Joins the 2 strings together without a separating space
    • str_trim(@Data)
      • Removes the space characters at the end of a string
  • Characteristics of text in data sets
    • Names of variables should be
      • All lower case when possible
      • Descriptive
      • Not duplicated
      • Not have underscores, dots or white spaces
    • variables with character values
      • should be made into factor variables(depends on application)
      • should be descriptive(use TRUE/FALSE instead of 0/1)
Regular Expressions
  • They are often used with grep, grepl, sub,gsub and other string manipulation functions
  • Regular expressions can be viewed as a combination of literals and metacharacters
    • Metacharacters provide a way to represent
      • whitespace word boundaries
      • sets of literals
      • the beginning and end of a line
        • ^ represents the start of a line. 

                                             ^@String checks for observations with @String at the beginning of the line

        • $ represents the end of a line

                                             @String$ checks for observations with @String at the end of the line

      • alternatives("war" or "peace")
  • Character classes with []
    • To match a sequence of characters regardless of their cases, enclose the upper and lower cases in [] for each letter
      • [Bb][Uu][Ss][Hh] matches bush regardless of which of it’s alphabets being capital or small.
    • You can specify a range of characters i.e numbers or letters
      • to specify numbers 0 to 9, use [0-9]
      • to specify small case a to z, use [a-z] 
      • to specify capital case a to z, use [A-Z]
        • to specify either small or caps [a-zA-Z]
    • To match characters that do not satisfy a condition, use [^@C]
      • e.g [^?.]$ matches characters that do not end with ? or .
    • You can combine e.g. 
      • ^[Ii] matches only observations that the first character is I regardless of it’s case.
      • ^[0-9][a-zA-Z] matches only observation that the first character is a number and te second is an alphabet regardless of it’s case.
  • MetaCharacters
    • "."
      • The . means any character/ wild character
        • e.g. 9.11 matches 9A11, 9-11, 9.11, 9911.
    • "|"
      • | means or. It can be used to combine expressions that serve as alternatives to each other
        • e.g. food|water matches either food or water.
        • flood|earthquake|hurricane|wildfire matches either of any of the elements.
      • ^[Gg]ood|[Bbad] matches either Good, good at the beginning of the observation or Bad and bad somewhere in the observation
      • ^([Gg]ood|[Bbad]) matches either Good, good or Bad and bad at the beginning of the observation.
    • "?"
      • This indicates that the expression is optional
        • e.g. [Gg]eorge( [Ww]\.)? [Bb]ush matches "george bush", "george W. Bush"
    • "*" and "+"
      • They’re used to indicate repetition
      • "*"
        • This means the character can repeat any number of times including not appearing at all
          • e.g. (.*) matches both elements that don’t have any characters between () as well as those that have characters there
            • it matches () or (a) or (asdfas)
        • It matches the longest possible string that satisfies the regular expression.
          • e.g. ^s(.*)s returns a string containing the characters between the first and last s in an observation. 
          • To prevent greediness, use *? instead.
      • "+"
        • This means the character can repeat any number of times but must appear once.
          • e.g. [0-9]+(.*)[0-9]+ matches at lease one number,followed by any number of characters followed by at least one number.
    • "{@minimumNumberOfRepetition[, @maximumNumberOfRepetition]}"
      • They are interval qualifiers and are used to specify the minimum and maximum number of matches of an expression.
      • specifying {@minimumNumberOfRepetition} means exactly @minimumNumberOfRepitition time
      • specifying {@minimumNumberOfRepetition,} means at least @minimumNumberOfRepitition times
      • e.g. ( +[^ ]+ +){1,5} means space, not space, space between 1 to five times. 
        • e.g. " Apple Camera food " 
      • e.g. [Bb]( +[^ ]+ +){1,5} debate matches bush with up to 5 words till debate
    • "()"
      • Used to remember text matched by the subexpression enclosed
        • we refer to the matched text as \1,\2
        • E.g. +([a-zA-Z]+) +\1 + to refer to a repeat of the same word matched with the expression.
  • To search for a meta character, escape it with "\"
    • e.g. to search for ".", use "\."
Working with dates
  • date() is used to return the date and time. It’s of class character
  • Sys.Date() returns the date. It’s of date class
  • formatting dates
    • use format(@dateObject , "@FormatString")
      • The Format string values are 
        • %d is day as number(0-31)
        • %a is abbreviated weekday(mon – sun)
        • %A is unabbreviated weekday(Monday to Sunday)
        • %m is month(00-12)
        • %b is abbreviated month(jan-dec)
        • %B is unabbreviated month(January to December)
        • %y is 2 digit year
        • %Y is four digit year
    • e.g. format(@DateObject, "%d %m %y")
  • as.Date()
    • used to convert a character string to a date.

                         as.Date(@dateObject, "@DateFormatOfCharString")

      • @DateFormatOfCharString tells as.Date what format to expect. e.g. for "1Feb1999", @DateFormatOfCharString is "%d%B%Y"
  • You can find details about a day
    • weekdays(@dateObject) shows you what day of the week it is (Monday – Sunday)
    • months(@dateObject) shows you what month it is (January – December)
  • you can find difference between days if they’re of date class

               @DateObject1 – @DateObject2 

  • Lubridate
    • A package for handling date objects
    • You can use it to convert a number to a date. 
      • it reads the number in the format of the function
      • use the format of the number variable as a function.
        • ymd(@Number)
        • mdy(@Number)
        • dmy(@Number)
        • ymd_hms(@Number)
      • You can specify timezone using the tz option

                                   ymd_hms(@Number, tz="@timeZone")

        • @timeZones can be checked using ?Sys.timezone
    • Use wday(@Number) to get weekday
      • This returns the number of the day in the week.
      • use option label = TRUE to return the abbreviated weekday.
      • Date and time are preferred in the Date or POSIXct or POSIXly classes.
  • Data resources
    • Open Government sites



    • Some APIs with R interfaces


  • Manipulating Data with Dplyr
    • You need to install and load the dplyr package
    • In Dplyr, when selecting columns, you don’t need the Data$@Column format to select @column. Just use @column
    • Dplyr allows you work with tabular data from a variety of sources
    • To work with data in dplyr, load the data into a tbl_df(data frame tbl)


      • Data of class tbl_df prints better than data of other classes.
    • dplyr supplies 5 verbs that cover most data manipulation tasks
      • select()
        • Select keeps only the columns you specify
        • To select Columns @Column1, @Column2 from tbl_df @data

                                             select(@Data, @Column1, @Column2…)

          • To select a sequence of columns, use the : operator to separate the first and last columns

                                                       select(@Data, @FirstColumn, @DataColumn)

          • To remove specific columns from @data, use -@ColumnToRemove

                                                       select(@Data, -@ColumnToRemove)

          • To remove a sequence of columns 

                                                       select(@Data, -(@FirstColumn:@LastColumn)

      • filter()
        • filter keeps only rows you specify
        • To filter columns where @column = @value

                                             filter(@Data, @column1 == @value[,@column2 ==@value…])

          • To filter on multiple conditions, separate the columns with ,
          • To filter with an OR condition, separate the columns with |.

                                                       filter(@Data,@column1==@value| @column2==@value)

        • You can filter on the value of a column being NA using the
          • To filter only where the column is not NA, use !
      • arrange()
        • This allows you sort a tbl_df based on the values in a column or multiple columns in ascending order

                                             arrange(@Data, @Column1ToSortBy[,@Column2ToSortBy])

          • To sort in descending order, use desc(@ColumnToSortBy)

                                                       arrange(@Data, desc(@Column1ToSortBy))

      • mutate()
        • Allows you create a computed column based on the value of an existing column in a tbl_df and inputs it as the last column.

                                             mutate(@Data, @NewColumn1 = @ExistingColumn @Operator @value)

        • You can compute multiple columns in a single statements and reference a column computed in the same statement

                                             mutate(@Data, @NewColumn1=@ExistingColumn1+1, @NewColumn2 = @NewColumn1+8)

      • summarize()
        • Collapses the dataset into a single row.
        • Can be used to give you a value based on groups in a dataset.
  • Grouping and chaining data with dplyr
    • Grouping allows the breaking up of datasets based on the values of one or more variables.
      • Done using group_by()
        • Used to group a tbl_df by one or more variables
                                   group_by(@Data, @GroupingColumn)
          • In the result set, all operations are applied on a per group basis.
            • Performing summarize() with mean() gives the mean per group                                                       
        • n() is used to count the number of observations in a group
        • n_distinct() is used to count the number of unique observations in a group
                                   summarize(@GroupedTbl, @count=n(), @Unique=n_distinct(@column))
      • To find the value that can be used to split data into a particular percentage, use quantile().
        • The value is called @percentage quantile.
          • For the 99% quantile
                                             quantile(@Data$@Column, probs= 0.99)
      • To show all the rows in a tbl_df, use View(@Tbl_df)
    • Chaining allows you string multiple function calls in a way that is compact and readable.
      • %>% is the operator used to chain functions.
        • The value before the operator is used as the input for the function on the right. 
      • This allows us work without saving the results of all the operations for each step.
      • E.g. To select columns from a tbl_df and print,

                                   @tbl_df %>% select(@Column1, @Column2, @Column3)%>%print

  • Tidying Data with tidyr
    • Tidy data is formatted in a standard way that facilitates exploration and analysis.
    • Tidy data must satisfy 3 conditions
      • Each variable forms a column
      • Each observation forms a row
      • Each type of observational unit forms a table
    • Messy data has one or more of the following characteristics
      • data-wrangling-cheatsheet.pdfColumn headers are values, not variable names
        • This problem is solved using gather()
          • Gather() takes multiple columns and collapses them into key-value pairs, duplicating all other columns as needed
          • Used when you notice you have columns that are not variables.

                                                       gather(@Data, @Key, @Value, -@ColumnToExclude)

            • @key is the name of the output column that will store a variable that can have multiple values. e.g. sex variable with values Male and Female
            • @value is the Value that stores the measure of the key in the observation.
            • @column to exclude is the column in @Data that is already stored properly
            • If their are NA values, use option na.rm=true to remove them while doing gather()
            • to gather a sequence of columns, use : between the first and last columns
      • multiple variables are stored in one column
        • This problem is solved using separate()
          • This uses a regex or a vector of character positions to turn a single char column into multiple columns

                                                       separate(@Data, col=@ColumnToSplit, into=@vectorofColsToSplitInto, sep=@valueToUseToSplit)

            • @vectorofColsToSplitInto is a vector that shows the names of variable to split @ColumnToSplit into.
            • @valueToUseToSplit contains the value in @ColumnToSplit to use to split it into the destination number of variables. By default, it splits by any character that is not alphanumeric.
      • Variables are stored in both rows and columns
        • This requires use of gather() and separate()
        • Also involves the use of spread()
          • Use to spread a key-value pair across multiple columns.


            • distinct(@ColumnWithVariable ) is used to create new columns with each one becoming a column. It is the key in this relationship.
            • @columnWithValue is used to assign a value to each of the column created. It is the value in this relationship.
      • a single observational unit is stored in multiple tables
        • To solve this problem, you need to merge both datasets.
          • First add an additional column to the datasets to signify their difference.

                                                       @Dataset1<- mutate(@Dataset1, @ColumnToDistinguish = @value)

                                                       @Dataset2<- mutate(@Dataset2, @ColumnToDistinguish = @value)

          • Use bind_rows() to join the 2 dataset

                                                       bind_rows(Dataset1, Dataset2)

      • multiple types of observational units are stored in the same table
        • We can select out values that occur multiple times like a relational table and lookup the the other values using a unique identifier.
  • To extract a number from a character string, use extract_numeric()

               extract_numeric(Value5) returns 5

  • To aggregate a data frame

               ag <- aggregate(. ~@groupingColumn, @Data, function(x) c(mean=mean(x),sd=sd(x))

    • this calculate the mean and sd for all variables in a data frame

Coursera: An Introduction to R (My Course Notes)

Setting up your Working Directory(WD)
  • To get your WD, use getwd()
  • To list the contents of your WD, use ls().
  • To change WD, look around your GUI.
  • R scripts can be created in any editor.
    • To import to R
      • Copy the text and paste in the Console
      • Create the RScript and import it to the session
        • Save the script as a .r file in the WD.
        • Use source(@.rFile) to import the file to the session.
    • To see the contents of a session, use ls()
  • To see the contents of the WD, use
R Input and Evaluation
  • Entering Input
    • We type expressions at the R prompt.
    • <- is the assignment operator
    • # indicates a comment. Anything after # is ignored
  • Evaluation
    • This occurs when you press enter after an expression and causes the result of the evaluation to be displayed.
    • To print the contents of a vector
      • Auto-printing: Enter the name of the vector and it’s contents are shown on the screen
      • use print(@vector)
    • When a vector is printed, the value in [@numberOfChar] at the beginning of the line shows the ordinal position of the first value on that line in the vector.
R Data types and operations
  • R has 5 basic classes of objects
    • Character
    • numeric(real numbers)
    • integers
    • complex: 1+4i
    • logical: TRUE/FALSE
  • The most basic object is a vector
    • It contains objects of the same class
    • Created by vector() if it should be empty, else the assignment operator (<- ) can be used.
  • A list is a vector that can have objects of multiple classes.
  • Numbers
    • Numbers are numeric objects(double precision real numbers).
    • If you want to input an integer, you must specify by putting L after the value
      • x <- 1L
    • The number Inf represents infinity. it can be used in calculation
    • The value NaN(undefined value/ not a number).
      • Also used to represent a missing value
  • Attributes
    • Every object has attributes
      • names/ dimnames
      • Dimensions(e.g. matrices, arrays)
      • class
      • length
      • other user-defined attributes
    • The attributes of an object can be accessed using attributes()
R data types( vectors and lists)
  • Creating Vectors
    • c() can be used.
      • c() concatenates
        • e.g. c(@value1, @value2)
      • A shortcut for inputting a series of numbers is @firstnumber:@lastnumber
        • e.g. @vector <- 9:19
    • vector()
      • Used to create an empty vector
      • Has the format vector(@class, length= @LengthOfVector)
      • All the values of such a vector are 0.
  • Creating a vector with mixed objects
    • R coerces so every element of the vector is of the same class.
    • The order is logical > number > character.
      • TRUE is 1 and false is 0.
    • Explicit coercion
      • as.@destinationClass() is used to convert an object to a specific class.
      • Nonsensical coercion results in NAs
        • e.g. coercing a vector of characters to numeric.
        • This results in a warning
  • Elements of a vector have [@Element] format
  • Lists
    • A vector that can contain elements of different classes
      • G <- list(1, "a", TRUE, 1+4i)
    • Elements of a list have [[@Element]] format
    • Created using list()
    • The elements of a list can be named when they’re populated
      • list(@ElementName1 = @Element1, @ElementName2 = @Element2, @ElementName3 = @Element3)
  • They’re vectors with a dimension attribute.
    • The dimension attribute is a vector of length 2 (nrow, ncol)
    • They’re created using
      • matrix()
        • matrix(nrow = @NumberOfRows, ncol = @NumberOfColumns): This creates an empty matrix
        • matrix(@listOfValues, nrow = @NumberOfRows, ncol = @NumberOfColumns)
        • e.g. matrix(1:6, nrow = 2, ncol = 3)
      • Transforming a vector using dim()
        • dim(@vectorToBeConverted) <- c(@NumberOfRows, @NumberOfColumns)
      • Column binding and row binding
        • done using cbind() or rbind()
          • cbind(@VectorToBeCol1, @VectorToBeCol2 ):
            • The vectors are converted to columns
          • rbind(@VectorToBeRow1, @VectorToBeRow2):
            • The vectors are converted to rows
    • dim(@matrix) is used to show the dimensions of a matrix
  • Every Attribute must be of the same class
  • Matrix are constructed column wise
    • Entries start in the upper-left and go down the column. When the column is filled, they start on the next column
  • Matrices can be named using dimnames()
    • dimnames(@matrix) <- list(c(@row1Name, @row2Name), c(@element1Name , @element2Name))
  • They’re vectors used to represent categorical data.
    • Ordered factors: Factors where their values have orders: e.g. small, medium, large
    • Unordered factors: Factors where their values do not have orders e.g. Male and Female
  • They’re treated specially by modeling functions like lm() and glm()
  • Factors are best represented with labels instead of values
    • E.g. use Male and Female instead of 1 and 2
  • Created using
    • factor(c(@value1, @value2, @value3))
  • Functions
    • Levels(@Factor) is used to show the distinct values of the factor
    • table(@Factor) is used to show the count of the distinct values of the factor
    • unclass(@Factor) is used to show the numeric value of the values of the factor. The numeric value is the same for each level.
  • A factor is basically a numeric vector over the levels attributes.
  • The order of the levels can be set using the levels argument to factor()
    • This is important because in linear modeling, the first level is the baseline level. Otherwise, the baseline level is the first level if the levels are ordered.
    • factor(c(@value1, @value2, @value3), levels = c(@value1, @value2, @value3))
Missing values
  • Denoted by either NA or NaN( For undefined mathematical operation)
  • is used to test objects to see if they’re NA or NaN
  • is.nan() is used to test for NaN
  • Na values have a class also
    • They can be integer NA, character NA etc
  • A NaN value is also NA but NA is not always true
Data Frames
  • Used to store tabular data
  • Represented as a special type of list where every element has the same length
  • Each element of the list can be thought of as the column, and the length of an element shows the number of rows
  • Unlike a matrix, data frames can store a different class of objects in each column
  • They have a special attribute row.names
    • It is used to annotate the data.
  • Created using
    • read.table()
    • read.csv()
    • data.frame(@FirstElement = @Values, @SecondElement = @Values)
  • A data frame can be converted to a matrix using data.matrix()
    • Implicit conversion is applied to convert the values to the same class
  • Objects in R can have names
    • Useful for writing self-describing objects and readable code
  • created by populating names(@object)
    • names(@object) <- c(@element1Name, @element2Name)
  • names(@object) prints the names of the elements of the object
Reading and writing Data in R
For reading Data into R
  • read.table: Read text files that contains data in rows and columns and return a data frame
    • For small and moderate data sizes, Usually called with only the file parameter. some of the other parameters assume their default values.
    • For large data sets,
      • The size of the data frame musn’t exceed the size of your system’s memory
      • Some parameters must be specified as they help read operations perform faster and more efficiently.
        • nrows
          • Enable R allocate memory better
        • colClasses
          • This prevents R from scanning every column to determine their data type
          • An easy method to get this is to read in the first 100 rows, let R calculate the classes for those and save it to a vector, and then use the vector as colClasses
                                                       Data<- read.table("@File", nrows =100)
                                                       classes <- sapply(Data, classes)
                                                       DataFrame <- read.table("@File", colClasses = classes)
  • You can calculate the memory requirements for a data frame
    • using the formula nrows *( (@ncolsOfColClassA * @SizeOfColClassA) +  (@ncolsOfColClassB * @SizeOfColClassB))
      • This gives the size in bytes
        • To convert to MB, divide by 220
          • To convert MB to GB, divide by 210
      • You need 2 times the amount of memory as the size of the data frame to import a data set
    • Has the following params
      • file: name of a file or a text-mode connection
      • header: logical param indicating if the file has a header line or not. if true, the first line is used to set variable Names
      • sep: string param indicating how the columns are separated
      • colClasses: character vector, length is the number of elements in the data set. indicates the class of each column in the dataset. default is for R to calculate.
      • nrows: the number of rows in the data set. default is for R to calculate it
      • comment.char: string param indicating the comment character. default is #
      • skip: the number of lines to skip from the beginning
      • stringsAsFactors: logical param stating whether variables should be coded as factors. The default is TRUE.
  • read.csv
    • same notes as read.csv, except the sep value is ","
  • scan: For reading matrices
  • readLines:
    • Reads a file and returns text
    • Can also be used to read a webpage
  • source: For reading R code e.g. functions
  • dget: For reading R codes for R objects that have been deparsed to text files
  • load: For reading in saved workspaces
  • unserialize: For reading single R objects in binary form.
For Writing Data from R
  • write.table
  • writeLines: used to write to a file
    • Each element is written to the file one line at a time
  • Dump and Dput
    • dput
      • see notes for dump.
      • Used to deparse a single R object.
        • The output of this is the code required to reconstruct the R object
          • dput(@vector, file = @destinationFile)
        • dget is used to read the data back.
          • dget(@OutputFileOfDput)
    • Dump
      • Used to deparse multiple R objects
        • dump(c("@vector1","@vector2"), file = @destinationFile)
        • source is used to read the data back
          • source(@OutputFileOfDump)
    • They’re text formats so it’s editable
    • They preserve the metadata of the data set so it doesn’t have to be specified again
    • They’re not very space efficient
  • save
  • serialize
  • Used to interface external objects.
  • Data is read in using connection interfaces.
    • file: standard uncompressed file
      • accepts the following parameters
        • description: The name of the file
        • open: a code indicating what operations are allowed on the file
          • r : read only
          • w: writing
          • a: appending
          • rb, wb, ab: reading, writing or appending in binary mode
            • This work only in windows
    • gzfile: file of gzip compression
    • bzfile: file of bzip2 compression
    • url: webpage
  • A connection is usually inferred most times e.g. when a read.csv is performed
  • Connections have to be closed after they’ve been used
               con<- file(@FIle, @allowedOperations)
  • Connections can be used to read part of a file
                    con <- gzfile(@File)
                    x<- readLines(con, 10)
Subsetting are operations to extract subsets of R objects
  • There are 3 operators for performing subsetting
    • []
      • This returns an object of the same class as the original
      • can be used to select more than one element except when referencing a matrix where it returns a vector
    • [[]]
      • is used to extract elements of a list or a data frame
      • Can only extract a single element
      • the class of the returned object will not necessarily be a list or a data frame
    • $
      • used to extract elements of a list or a data frame by name
      • similar to [[]]
  • There are several ways to perform subsetting
    • Use a numeric index
      • Enter the ordinal position of the element in the object to extract that element
        • @vector[3]
      • can be used to extract a sequence of objects
        • @vector[1:5]
    • Use a logical index
      • Enter a logical proposition to select elements in the object that satisfy the proposition
        • @vector[@vector > "a"]
        • @newLogicalVector <- @vector > "a"
          • The output of this is a logical vector that shows a logical element for all the elements of the object based on the proposition.
          • @vector[@newLogicalVector] returns an object of all the elements that satisfy the proposition
  • Subsetting lists
    • Reference list is x<- list(food=1:4, bar = 0.6, baz = c(3.14, 2.81))
    • [] returns a list
      • x[1] returns list
        • $food 1,2,3,4
      • x["elementName"] returns a list
      • Allows you return more than one element of the list
        • x[c(1,3)] returns food and baz
    • [[]] returns a sequence without the name
      • x[[1]] returns 1,2,3,4
      • x[["bar"]] returns an object of class numeric
      • Allows you work with computed index(where the name of the element in the object is the result of a computation)
        • e.g. if name <- "food"
          • x[[name]] returns the food object.
      • Can be used to subset nested elements of a list
        • x[[c(Element,subelement)]s
          • E.g. x[[c(1,3)]] or x[[1]][[3]] returns 3( the third element of the first element)
    • $@elementName: returns a sequence without the name
      • x$food returns a numeric vector object
  • Subsetting Matrices
    • Reference matrix is x<- matrix(1:6,2,3) a 2 row by 3 column matrix
                         1     3     5
                         2     4     6
    • Format is @matrix[@row, @column]
    • To select the whole row use @matrix[@row,]
      • e.g. x[1,] returns 1,3,5
      • To select the whole column use @matrix [,@column]
        • e.g. x[,1] returns 1,2
    • The Drop parameter
      • The drop parameter drops the matrix orientation when a subset operation is carried out on a matrix.
        • Default is drop = TRUE
        • When a single element of a matrix is retrieved, it returns a vector of length 1 instead of a 1X1 matrix
          • x[1,2 ,drop = FALSE]
        • When a single row or column is retrieved, it returns a vector not a matrix
          • x[1, , drop = FALSE]
  • Subsetting data Frames
    • To subset only observations(rows) 
      • use subset(@vector. @column @operator @value)
        • E.g. subset(iris, Species == ‘virginica’)#This returns all rows where species is virginica
      • use the [] operator
        • @vector[@column @operator @value,]
        • E.g. iris[iris$species ==’viriginica’]
    • To subset only variables(columns)
      • @vector[, @vectorOfColumnsNumbersToReturn]
        • E.g. iris[, 1:3]
        • iris[, c(1,2)]
  • Subsetting with partial matching of names
    • Partial matching of name is allowed with [[]] and $
    • It allows you return objects with a name that start with the characters you supply and saves you typing in the command line
    • Reference object is x<- list(aardvark = 1:5)
    • $ looks for a name in the list that matches the element
      • x$a returns aardvark
    • [[]]
      • using the exact = FALSE parameter on [[]] allows it to carry out partial matching
      • x[["a", exact = FALSE]] returns aardvark
        • Exact = TRUE by defaul t
  • Removing NA values
    • The most common way is to create a logical vectors showing what elements are NA
      • use on the object
      • return the non NA objects
        • e.g.
                                                  bad <- #bad is a logical vector F,F,T,F,
                                                  x[!bad] #returns 1,2,4,5
                                   good <- complete.cases(@DataFrame)
                                   @dataFrame[good,][1:6,] # this does it for only rows
Vectorized operations
  • Operations are vectorized to make coding more efficient and easier to read
    • Carrying out operation on vectors of the same length
      • Elements at the same position in both vectors interact based on the operation applied to the vector
        • E.g.
                                             x<- 1:4; y<-6:9
                                             x+y # returns 7     8       9      10
                                             x*y # returns 6     14     24     36
  • Carrying out comparison operations
    • Compares all the vectors returning a logical vector
      • E.g.
                                             x<- 1:4; y<-6:9
                                             x>2 # returns a logical vector F     F     T
  • Carrying out operations on vectors of different lengths
  • Vectorized matrix operations
    • Reference is x<- matrix(1:4,2,2); y<-matrix(rep(10,4),2,2)
    • x*y or x.y carries out element-wise multiplication
                         x*y #multiplies all elements in the same position together and returns a matrix
  • To carry out matrix multiplication
    • x %*% y
seq_along(): takes a vector as input and creates an integer sequence equal to the length of that vector
seq_len(): takes an integer and creates a sequence from 1 to the value of the integer
Control Structures in R
  • Used to control the flow of execution of a program, depending on runtime conditions
  • The types of control structures are
    • If…else: Allows you to test for a condition
      • If(@Condition) { Do something} else if(@Condition) {Do something} else{ Do something}
        • E.G.
                                   y <- if(x>3){10} else {0}
  • for: execute a loop a fixed number of times
    • Takes an iterator variable and assigns it successive values from a sequence or vector.
      • For loops can be nested.
    • while: Execute a loop while a condition is true
      • Begin by testing a condition. If it is true, then execute the loop body. Once the loop body is executed, test the condition again.
      • Can execute infinitely if the condition never gets met.
      • conditions are evaluated from left to right
    • repeat: execute an infinite loop
      • Initiates an infinite loop. Can only be exited with break
    • break: break the execution of a loop
    • next: skip an iteration of a loop
    • return: exit a function
      • Signals that a function should exit and return a given value.
  • The above are primarily for writing programs, for interactive work, use the *apply functions
Writing functions
  • Functions are usually prefaced with the keyword function
               function(@Argument1, @Argument2)
               { @Argument1 @Operator @Argument2 }
  • Functions are R objects
    • Functions can be passed as arguments to other functions
    • Functions can be nested.
  • Functions have named arguments which potentially have default values.
    • Formal arguments are the arguments included in the function definition
      • Formals() returns a list of all the formal arguments of a function
    • Function arguments can be missing or might have default values
  • Function arguments can be matched positionally or by name
    • You can mix positional matching with matching by name.
      • When an argument is matched by name, it is taken out of the argument list and the remaining unnamed arguments are matched in the order they are listed in the function definition
  • Function arguments can be partially matched( only first few characters are used to match the arguments)
    • The order of operations when given an argument is
      • Check for exact match for a named argument
      • Check for a partial match
      • Check for a positional match
  • Arguments value can be set to NULL
  • Arguments in R are evaluated lazily
    • They are evaluated only as needed. If the argument is never used, it’s presence is never tested.
    • No error is thrown until the compiler notices a problem with an argument.
  • The … Argument
    • Used to indicate a variable number of arguments that are usually passed on to other arguments
    • used when extending another function so you don’t copy the entire function list of the original function
                         myplot<- function(x,y, type = "l",…){
                              plot(x, y, type=type,…)} ##This copies the remaining argument list of plot into myplot
  • Generic functions use … so that extra arguments can be passed to methods.
    • Necessary when the number of arguments passed to the function cannot be known in advance
      • E.g see args(paste), args(cat)
    • Any arguments that appear after … on the argument list must be named explicitly and cannot be positionally or partially matched.
Scoping rules
  • How R binds values to symbols
  • R searches through a series of environments(list of objects/symbols and values) to find the appropriate value.
    • In command line, when attempting to retrieve the value of an R object, the order is
      • Search the global environment for a symbol name matching the one requested
      • Search the namespaces of each of the packages on the search list(see search())
    • The global environment or the user’s workspace is always the first element of the search list and the base package is always the last( with regards to namespaces, see search()).
      • The order of packages on the search list matters
      • User’s can configure which packages get loaded on startup so you cannot assume a set of packages are available
      • When a user loads a package with library, the package is moved to position 2 in search() and everything gets shifted down.
      • R has separate namespaces for functions and non-functions.
        • you can have an object and function named the same.
  • Scoping Rules
    • They’re the main difference between S and R
    • They determine how a value is associated with a free variable in a function
    • R uses lexical/static scoping
      • The scoping rules of a language determine how values are assigned to free variables.
        • Free variables are not formal arguments and are not local variables( assigned inside the function body).
      • E.g.
                                        g <-9 ##g is a local variable
                                        g*(x^2) + y/z ##x and y are formal arguments, z is a free variable. The scoping rules state how we assign a value to z.
    • Lexical scoping states that the values of free variables are searched for in the environment in which the function was defined.
      • An environment is a collection of symbol,value pairs
      • An environment has a parent environment which can have multiple children. The only exception is an empty environment
      • A function + an environment = a closure or function closure.
      • To search for the value of a free variable
        • If the value of a symbol is not found in the environment in which a function was defined, then the search is continued in the parent environment.
        • The search continues down the sequence of parent environments until we hit the top-level environment; this is usually the global environment (workspace) or the namespace of a package.
        • After the top-level environment, the search continues down the search list until we hit the empty environment is reached and an error is thrown
      • Typically a function is defined in the global environment, so that the values of free variables are just found in the user’s workspace
        • In R you can have functions inside other functions 
          • In this case, the environment in which a function is defined is the body of another function.
            • E.G.


                                                                   n is a free variable with respect to pow. hence it’s value is found in the parent environment.

          • To get what’s in a function’s environment( formal variables). 
            • run ls(environment(function))
              • e.g. ls(environment(cube))
          • To get the closure environment,
            • run get("@variable", environment(function))
    • Difference between Lexical and dynamic scoping
      • In lexical scoping, the value of a variable is looked up in the environment in which the function was defined
      • In dynamic scoping, the value of a variable is looked up in the environment from which the function was called( calling environment). 
      • E.G.

                                   y<- 10

                                   f<- function(x)
                                        y^2 + g(x)
                                   g<- function(x)
                                   #Using lexical scoping, the value of y in f() is 10. y is looked up in the global environment
                                   # Using dynamic scoping, the value of y in f() is 2. 

    • Consequence of Lexical scoping
      • All object must be stored in memory
      • All functions must carry a pointer to their respective defining environments, which could be anywhere
      • Makes it easy for objective functions to be built
        • Functions that carry around all their data and you’re sure the values used are always the same
Dates and times in R

  • Dates are represented by the date class
    • Dates are stored internally as the number of days since 1970-01-01
    • Format is YY-MM-DD
    • Dates can be coerced from a character string using as.Date()
  • Times are represented by the POSIXct and POSIXlt class
    • Times are stored internally as the number of seconds since 1970-01-01
    • They keep track of leap year, seconds, daylight savings and time zones
    • POSIXct is a very large integer. useful to store times in a data frame
    • POSIXlt is a list underneath and stores a other useful information e.g. day of week , day of year, month, day of month.
      • A component can be extracted using $.
        • @POSIXltVariable$sec extracts the value of seconds in the variable.
    • Times can be coerced form a character string using as.POSIXlt() or as.POSIXct()
    • Sys.time() is used to print the system time.
  • There are generic functions that work on dates and times
    • weekdays(): gives the day of the week
    • months(): give the month name
    • quarters(): gives the quarter number("Q1", "Q2","Q3","Q4")
  • unclass() can be used to show the content of date and time classed variables in R
    • Date class
    • POSIXct
      • the output is an integer showing the number of seconds since 1970-01-01
    • POSIXlt
      • The output is a list showing name() for the contents of the variable.
  • strptime() is used to correct the format of a date/time variable if it’s written in a different format from YY-MM-DD
    • A format string is used to indicate the format the date is stored in.
    • To get the format string, get the format from ?strptime.
  • Operations on dates and times
    • You can perform mathematical operations + and –
    • You can perform comparison operations 
    • For time objects, They both have to be of the same class
Loop Functions
  • They are usually in the *apply format
    • lapply(): loop over a list and evaluate a function on each element of the list.
      • Always returns a list
      • implemented using
        • lapply(@list, @function, …)
          • if the first argument is not a list, the the function applies as.list, else an error occurs.
        • To pass an argument to @function, specify them in the …
          • lapply(x, runif, min=0, max=10) ## this passes arguments min and max to runif.
    • sapply(): same as lapply except it simplifies the result
      • Simplifies the result of lapply if possible.
        • If the result is a list where every element is length 1, then a vector is returned
        • If the result is a list where every element is a vector of the same length(>1), a matrix is returned
        • if it can’t figure things out, a list is returned.
        • implemented using
          • lapply(@list, @function, …)
    • apply(): apply a function over the margins of an array
      • Used to apply a function to the rows or columns of a matrix
      • Can be used with general arrays e.g. taking the average of an array of matrices
      • Implemented using
        •  apply(@array, @marginToBeRetained, @function, …)
          • @marginToBeRetained is a vector that is gotten from the dimensions of the array i.e. @array(@Dimension1, @Dimension2)
            • When @marginToBeRetained = 1, It applies the function over the second dimension of the array and returns a list with length(@Dimension1) number of items
            • When @marginToBeRetained = 2, It applies the function over the first dimension of the array and returns a list with length(@Dimension2) number of items
            • If the array has more than 2 dimensions, @marginToBeRetained can return an array of the dimensions specified. 
              • e..g. apply(x, c(1,2),mean) returns a 2 dimensional array(matrix).
      • Shortcuts already exists for calculating the sum and mean of arrays, and they operate much faster than apply()
        • rowSums, rowMeans, colSums, colMeans.
    • tapply(): apply a function over subsets of a vector
      • implemented using 
        • tapply(@Vector, @listOfFactors, @function, …,[SIMPLIFY= TRUE])
        • len(@Vector) must be == len(@listOfFactors)
    • mapply(): multivariate version of lapply()
      • It applies a function in parallel over a set of arguments.
      • Can be used to apply a function to multiple lists.
      • implemented using 
        • mapply(@function, …, @MoreArgsToFunction, [SIMPLIFY = TRUE])
          • … contains the arguments to apply over
          • the number of arguments that @function takes must be equal to the number of lists passed to mapply()
      • Used to vectorize a function 
        • Allows a function that doesn’t usually take vectors to do so.
  • split(): used to split objects in groups determined by a factor or a list of factors
    • used in conjunction with lapply() and sapply()
    • implemented using
      • split(@vector, @ListOfFactors,[drop = FALSE], …)
      • the split vector splits @vector into len(@ListOfFactors) groups
  • loop functions use anonymous functions
    • Anonymous functions are functions without names
      • implemented in format
        • @LoopFunction(@variable, @FunctionDefinition)
        • e.g. 
          • lapply(x, function(elt) elt[,1]) #This creates a function that accepts the argument elt and returns the first column of the argument
  • For figuring out what’s wrong.
  • Indications that something is not right
    • message: A generic notification/diagnostic message produced by the message function; execution of the function continues
    • warning: An indication that something is wrong but not necessarily fatal; execution of the function continues; generated by the warning function
    • error: An indication that a fatal problem has occurred; execution stops; produced by the stop function
    • condition: Programmer created.
  • Debugging tools
    • traceback(): Prints out the function call stack after an error occurs; does nothing if there’s no error. 
    • debug
      • Flags a function for debug mode which allows you to step through execution of a function one line at a time
    • browser
      • suspends the execution of function wherever it is called and puts the function in debug mode
    • trace: allows you to insert debugging code into a function at specific places
    • recover: allows you to modify the error behavior so that you can browse the function call stack.
  • str()
    • A diagnostic function and an alternative to summary().
    • Answers the question what’s in this object? in a compact format.
    • str means structure.


  • Generating random numbers
    • Functions for probability distributions in R
      • rnorm(): generate random normal variates with a given mean and standard variation
      • dnorm(): evaluate the normal probability density( with a given mean/SD) at a point(or vector of points)
      • pnorm(): evaluate the cumulative distribution function for a normal distribution
      • rpois(): generates random poisson variates with a given rate
    • probability distribution functions usually have four functions associated with them. Their prefix shows their use
      • d for density
      • r for random number generation
      • p for cumulative distribution
      • q for quantile function
    • set.seed(@integer) is used to ensure reproducibility
      • It is best to set this so the randomization is reproducible.
  • Generating random numbers from a linear model.
    • to generate binary number, use rbinom()
  • Generating random numbers from a generalized linear model
  • Random sampling
    • sample() allows you draw randomly from arbitrary distributions
      • implemented using sample(@vector, @NumberOfSamples, [replace = FALSE])
        • the replace argument allows you to simulate simulation with replacement of a factor. 
R profiler
  • Useful for understanding why an operation is taking time to complete
  • A way to examine how much time is spent in different parts of a program
  • Useful when trying to optimize code.
  • system.time()
    • Takes a R expression as input and returns the amount of time taken to evaluate the expression
      • returns an object of class proc_time that includes 
        • user: the time charged to the CPU(s) for this expression
        • elapsed: duration the function ran for.
    • Usually user time and elapsed time are close for straight computing tasks
      • elapsed time may be greater than user time if the CPU spends a lot of time waiting around
      • elapsed time may be smaller than user time if your machine has multiple cores and is capable of using them
        • multi-threaded libraries
        • parallel processing via the parallel processing
    • implemented using system.time(@Expression)
  • Rprof()
    • Start the profiler in R
    • summaryRprof() summarizes the output from Rprof()
      • tabulates the output and calculates how much time is spent in which function
      • There are two methods for normalizing the data
        • divides the time spent in each function by the total run time
        • by.self: subtracts time spent in functions below in the call stack then does after
          • shows the actual time spent per function
    • Do not use system.time() and Rprof() together 
    • Keeps track of the function call stack at regularly sampled intervals and tabulates how much time is spent in each function.
      • default sampling interval is 0.02 seconds