Coursera: Getting and cleaning data (My Course Notes)

Goal of this course is the first 3 of this sequence

Raw Data-> Processing script-> tidy data -> data analysis -> data communication
Raw and processed data(1)
  • Data are values of qualitative or quantitative variables, belonging to a set of items.
    • variables are a measurement or characteristic of an item
  • Raw data
    • The original source of data
    • Has not been subjected to processing or any other manipulation.
    • No software has been run on the data
    • the numbers on the data have not been manipulated
    • Data has not been removed from the data set
    • The data hasn’t been summarized in any way.
  • Processed/Tidy data
    • Data that is ready for analysis
    • processing can include merging, subsetting, transforming
    • All steps should be recorded.
    • Each variable must be in one column
    • each different observation of a variable should be in a different row
    • There should be one table for each kind of variable
    • If you have multiple tables, they should include a column in the table that allows them to be linked.
Components of Tidy data(1)
  • When you go from raw to tidy data, you should have
    • The raw data
    • A tidy data set
    • A code book describing each variable and its values in the tidy data set
    • An explicit and exact recipe used to go from raw data to tidy data
  • Code book
    • Includes information about the variables(including units) in the data set not contained in the tidy data
    • Includes information about the summary choices you made
    • Information about the experimental study design you used
    • Should have a section called "Study Design" that has a thorough description of how you collected the data
    • There should be a section called "Code book" that describes each variable and its units
  • The instruction list
    • A script in R or python
    • The input for the script is the raw data
    • the output for the script is the processed, tidy data
    • there are no parameters in the script.
    • It might be impossible to script every step. Instead provide instructions like this
      • Step 1= Take the raw file, run version @3.1.2 of @summarize software with parameters a=1, b=2, c=3
      • Step 2= run the software separately for each sample
      • Step 3= take column three of outfile.out for each sample and that is the corresponding row in the output data set.
Downloading Files
  • R can be used to download files
  • This allows the downloading process to be put in the R script.
    • Getting/Setting your working directory
      • The 2 main commands are getwd() and setwd()
        • They both work with absolute and relative paths
          • Absolute paths: "C:\\Users\\Andrew\\Downloads"
          • Relative Paths: ".\\data" or "..\\"
            • \\ is used in windows instead of / in other OSs
    • Checking for and creating directories
      • file.exists("@directoryName") checks if the directory exists
      • dir.create("@directoryName") creates a directory if it doesn’t exist.
  • To download from the internet
    • usedownload.file()
      • parameters are url, destfile, method.

                         e.g. download.file(fileurl, destfile="./@Directory/@File.@Format". [method = "@method"])

      • @method is curl for Https websites, not necessary
    • helps with reproducibility
    • Note the time the file was downloaded
      • use date()
Reading local Data Files
  • read.table() is the main function for reading data int R
  • Read the data into RAM
  • parameters are file, header, sep, row.names, nrows
    • quote: Used to tell R whether there are any quoted values. quote="" means no quotes
      • Sometime ‘ or " are placed in data values and cause issues when reading files, setting quote="" resolves this.
    • na.strings: sets the character that represents a missing value
    • nrows: how many rows to read from the files
    • skip: number of lines to skip before starting to read
  • Related functions are read.csv and read.csv1
    • read.csv sets sep=, and header = TRUE by default
Reading Fixed Width files
  • use read.fwf().
    • the widths option specifies an integer vector that is used to specify the width of each column.
Reading Excel files
  • requires the xlsx package
    • Install using install.packages("xlsx")
  • read.xlsx() 
    • parameters are files, sheetIndex, header
      • Sheetindex is used to state which Excel sheet the data is stored on.
    • Allows you read specific rows and columns
      • Use parameters colIndex and rowIndex
        • They accept vectors showing the columns and rows to read.
  • Write.xlsx() writes out an Excel file with similar arguments
  • XLConnect package has more options for writing and manipulating excel files.
    • XLConnect.pdf
Reading XML
  • XML consists of markup and content
    • markup are labels that give the text structure.
    • content are the actual text of the document.
  • Tags, elements and attributes
    • Tags correspond to labels. Applied to particular parts of the text so it’s structured.
      • start tags <section>
      • end tags </section>
      • empty tags <line-break /> 
    • Elements are specific examples of tags
      • <Greeting> Hello, world </Greeting>
    • Attributes are components of the label
      • <img src="jeff.jpg" alt="instructor"/>
        • src and alt are attributes
      • <step number = "3"> connect A to B </step>
  • Requires the xml package
    • xmlTreeParse(@FileURL, useInternal=TRUE/FALSE) to parse out the XML file.
      • Loads the document into memory for parsing
      • If you’re reading an html file, use htmlTreeparse() instead
      • useInternal = TRUE if you want to get all the nodes inside the document.
    • xmlRoot() is used to remove the XML version header of the XML document. output is a parsed xmlObject
    • xmlName() is used to get the tag of the document’s root.
    • use names(@RootTag) to get the tags that are just below @RootTag
  • You can directly access parts of an XML document
    • Accessed the same way a list is accessed
      • using [[@ElementNumber]]
      • To access subelements, 
        • [[@elementNumber]][[@subelementNumber]]
  • Can be done programatically
    • use xmlSApply(@ParsedXMLObject, @FunctionToApply)
      • @FunctionToApply can be xmlValue if you want to return all the contents of each children elements as a single string.
    • Xpath programming can be used to specify further what element to return.
      • XML.pdf
      • Use xpathSApply(@ParsedXMLObject, "@XpathNotationOfNodeToExtract", @FunctionToApply)
        • @FunctionToApply usually is xmlValue
        • @XpathNotationOfNodeToExtract
          • /node: Used to extract top level node
          • //@node: used to extract a node at any level
            • e.g. to extract the node named house use //house.
          • @nodePath[@attr-name]: @Node with an attribute name
          • @nodePath[@attr-name=’@Value’] @Node with tag @attr-name that has a value @Value.
            • @nodePath is the path to the node that contains the tag @attr_name. usually in the format //@OutputOfxmlName()/@ParentTag/@ParentTag
Reading JSON
  • JavaScript Object Notation(JSON) is a lightweight data storage format often used for APIs.
  • It’s similar to XML but has a different syntax/format.
  • requires the jsonlite package
  • use fromJSON(@FileToRead) to read a JSON file.
    • returns a data frame
  • To access a variable
    • @dataFrame$VariableName
  • To write data frames to JSON
    • use toJSON()
      • the pretty parameter indents the JSON document
Using data.table
  • inherits from data.frame.
    • All functions that accept data.frame accept data.table
  • Faster than data.frame at subsetting, grouping and updating
  • requires data.table package.
  • use tables() to see all the data tables in memory
  • data tables allow subsetting rows
    • @DataTable[@RowNum,]
      • To return just the @rowNum row 
      • The same as @DataTable[@RowNum]
    • @DataTable[@DataTable$@columnName=="@value",]
      • To return all rows where the column @columnName has a value @Value
  • To subset columns in Data tables,
    • you cannot subset data tables the same way you subset data frames
      • @DataTable[,c(@columnIndex1, @columnIndex2)] doesn’t work.
    • requires expressions 
      • They can be used to summarize datasets
        • @DataTable[,@OutputDataClass(@function1(@ColumnName),@function2(@columnName2))]
          • @OutputDataClass can be list, table, vector e.t.c
          • as
  • To add a new column to a Data table, use := operator

                    @DataTable[,@NewColumnName:=@ExistingColumnName @Operator @Value]

    • This allows you carry out multi step operations as expressions are involved.

                         @DataTable[, @NewColumnName:= { tmp<- (@ColumnName1+ @ColumnName2); @Function(tmp @Operator @value)}]

    • Allows you carry out plyr like operations

                    @DataTable[,@NewColumnName:=  @ColumnName1 > Value]

      • Creates a new column that has boolean values
      • also allows you use the by keyword

                                   @DataTable[,b:= mean(x+w), by=a]

        • Where a is a column with boolean values.
        • hence rows where a = TRUE will have the same value in b i.e. the calculation for the mean will only considers x and w values from those rows. also rows where a=FALSE will have the same values in b
  • Data tables must be copied explicitly using copy(). 
  • Special variables
    • .N
      • Means count the number of times something occurs
      • E.g.

                                   @dataTable[, .N, by=x]

        • this counts the number of times the distinct elements in the x column occurs.
  • Keys
    • Allows rapid subsetting and sorting.



      • The above automatically subsets the datatable based on the key.
        • returns all rows where the value of x is a
  • Joins
    • Joins are based on keys
    • Data tables can be joined based on a similar element being set as the key.
  • For fast reading
    • done using fread()
Reading from MySQL

                         @ConnectionName<- dbConnect(MySQL(),user="@userName", host="@Hostname")

    • To connect specifically to a database on an instance

                         @DBName<-dbConnect(MySQL(), user="@UserName", db="@DBName", host="@HostName")

    • To run a command

                         dbGetQuery(@ConnectionName, "@DBQuery")

    • To disconnect from an instance


    • use dbListTables() to show the tables in a Database


    • To see the column names of a table, use dbListFields()

                         dbListFields(@DBName, "@TableName")

    • To run a query on a table

                         dbGetQuery(@DBName, "@SQLQuery")

    • To read a whole table to a R object

                         @RObjectToStoreTable<- dbReadTable(@DBName, "@TableName")

      • To read a subset of the Table

                                   @RObjectToSendQuery<- dbSendQuery(@DBName, " select * from @TableName @whereConditions")

                                   @RObjectToContainResult<- fetch(@RObjectToSendQuery[,[n=@NoOfRowsToRead] );

        • To clear the result of a query on the instance’s memory.


Reading data from HDF5
  • Hierarchical Data Format is used for storing large data sets.
    • Data is stored in groups containing zero or more data sets and their metadata.
      • Each group has a group header with a group name and a list of its attributes.
      • Has a group symbol table with a list of objects in the group.
    • Data sets are multidimensional arrays of data elements with metadata
      • They have a header with name, datatype, dataspace and storage layout.
      • Have a data array with the data.
  • To interface with HDF5, you need to install packages from bioconductor


    • Install the bio base packages


  • Use h5createFile() to create a hdf5 file.


    • To create group

                         @RObjectForH5 = h5createGroup("@Filename.h5", "@GroupName[/@SubGroupName]")

  • To list the contents of an hdf5 file use h5ls()


  • to write to a group, use h5write()

               h5write(@Robject, "@FileName.h5", "@GroupName[/@SubGroupName]")

    • If the destination Group(@GroupName[/@SubGroupName]") doesn’t exist, then it is created.
    • You can write different data classes e.g. dataframes, arrays, matrices.
    • You can direct data to be written to specific places using the index option

                         h5write(@ObjectToWrite, "@FileName.h5", "@GroupName[/@SubGroupName], index=list(@FirstRowNo:@LastRowNo, @FirstColumnNo:@LastColumnNo)

  • To read data, use h5read()

               readA=h5read("@FileName.h5", "@GroupName[/@SubGroupName]")

    • You can direct data specific data to be read using the index option

                         h5read("@FileName.h5", "@GroupName[/@SubGroupName]", index=list(@FirstRowNo:@LastRowNo, @FirstColumnNo:@LastColumnNo) )

  • The HDF5 tutorial rhdf5.pdf
Reading data from the Web
Scope is data scraping, authentication and interfacing with APIs
  • Web scraping is programmatically extracting data from the HTML code of websites
    • Read the data into R using
      • readLines()

                                   @ConnectionName<- url("@WebAddress")

                                   @RObject<- readLines(@ConnectionName)
        • Close the connection after you’re done


      • htmlTreeParse()

                                   @HTMLobject <- htmlTreeParse(@url, useInternalNodes=T)

        • htmlTreeParse in in XML package
        • You can specify the nodes to read using xpathSApply()
      • GET() in the httr package

                                   @RObjectToStoreTheHtml <- get(@URL)

                                   @RobjectToStoreTheContent <- (@RObjectToStoreTheHtml , as="text")
                                   @RobjectToStoreParsedHTML<- htmlParse(@RobjectToStoreTheContent ,asText=TRUE)

        • GET() is in httr package
  • Accessing websites with passwords
    • GET allows you authenticate yourself for websites.

                         @DestinationRObject<- GET("@URL",authenticate("@UserName","@Password"))

    • Handles allow you save authentication to a website.


                         @DestinationRObject= GET(handle=@RobjectForURL , authenticate("@UserName","@Password")

      • Handle() is in the httr package
      • To access the website, simple refer to it using the handle and it’ll use the previous authentication via cookies.

                                   GET(handle=@RobjectForURL , path="/")

Reading data from APIs
  • Done using the httr package
  • You need to create an account with the API organization.
  • To access twitter form R

               myapp=oauth_app("Twitter", key= "@yourConsumerKeyHere", secret="@YourConsumerSecretHere")

               sig=sign_oauth1.0(myapp, token="@yourTokenHere", token_secret="@yourTokenSecretHere")

Reading from other sources
  • Interacting with files
    • file(): to open a connection to a text file
    • url(): to open a connection to a url
      • To read a specific line from a url, use readLines()

                                   @connection<- url(@URL)

                                   @Robject<- readLines(@connection)

    • gzfile(): to open a connection to a .gz file
    • see ?connections for information on other connection mechanisms
    • always 
  • The foreign package allows you interact with other statistical programming languages
    • Loads data from minitab(read.mtp), SAS(read.xport), SPSS(read.spss), stata(read.dta)
    • library(foreign)
  • RODBC provides interface amultiple DBMSs
  • You can read images using jpeg(), readbitmap(), png() directly into R and manipulate them
  • Reading Geographic Information Systems(GIS) data using rdgal(), rgeos(), raster()
  • You can read MP3 files using tuneR packages or seewave packages
Subsetting and sorting
  • Subsetting can be done based on rows and columns
    • Based on Columns

                         @Data[, @ColumnOrVectorOfColumns]

      • @columnOrVectorOfColumns specifies the columns to select from the dataset
    • Based on rows


      • @rowOrVectorOfRows specifies the rows to select from the dataset
      • You can also subset on rows based on conditions

                                   @data[@Data$Column @ConditionalOperator @VectorOfValues,]

    • Based on rows and columns

                         @data[(@Column1 @Operator @value @CombinationOperator @column2 @operator @value),@Column1]

      • @CombinationOperator can be & for and or | for or.
    • Dealing with NA’s
      • Using which is the most efficient way of subsetting without including NAs

                                   Data[which(@Column1 @Operator @Value),]

  • Sorting 
    • To sort a column, use sort()


      • To sort in decreasing order, use the decreasing=TRUE option
      • To sort with NA’s at the end, use na.last=TRUE
    • To sort a data frame use sort()
      • This sorts the dataframe based on the values in the column. 


        • You can specify a tie breaker
    • You can also sort based on arrange() in the plyr package
  • Adding columns to a dataframe

                    @Data$@NewColumnName<- @VectorOfValueToInsertIntoColumn

    • you can use cbind

                         cbind(@Data, @VectorOfValuesToInsert)

  • Adding rows to a dataframe
    • Use rbind()
Summarizing Data
  • summary() can be used to summarize dataFrames.
  • str() can also be used for summarization. It shows the structure of the data.
  • quantile() can be used to summarize a column
    • To view the default of 5 quantiles(0,25, 50,75 & 100)

                         quantile(@Data$@Column, na.rm=true)

    • To view specific quantiles

                         quantile(@Data$@Column, probs=@vectorOfQuantiles

      • sample of @vectorOfQualities is c(0.5, 0.75, 0.9)
        • 0.5 is median or 50% percentile
  • Table can be used to make tables to summarize data


      • This returns a count of the distinct elements in @column.
      • Use the useNA="ifany" option if you want to see a count of NA values.
    • To view the relationship between the columns of @Data, input 2 values into table

                         table(@Data$@Column1, @Data$Column2)

      • Column1 becomes the row and @Column2 becomes the column. 
    • To view a count of all elements of a column that meet a particular condition

                         table(@Data$@Column @ConditionalOperator @VectorOfValues)

      • @conditionalOperator can be %in% to check if @vectorOfValues exists in @data
  • To count the number of NAs in a column


  • To check if missing values exist in a column.


    • It returns true is an NA exists
  • To check if every value in a column satisfy a condition, use all()

               all(@Data$@Column @Condition)

  • Row and column sums
    • You can calculate the sum of all the columns in a dataset using colSums()


      • To check if any of the columns in a dataset contains an NA value,
        •  apply colSums to the of the dataset



  • Crosstabs
    • used to identify where the relationships in a dataset exists.

                         xtabs(@VariableToDisplayInResult ~ @Variable1ToGroupBy [+ @Variable2ToGroupBy], data=@Data)

      • To group by all other variables, set @Variable1ToGroupBy to .
  • FlatTables
    • Often used to summarize data output of xtabs() into a compact form.


  • To see the size of a dataset


    • to specify the units of the output, use the unit=@Units command

                         object.size(@Data, unit="Mb")

Creating New variables
  • Sometimes variables are missing in a dataset. 
    • You can transform the dataset to get the values you want
    • Common missing variables are
      • Missingness indicators
      • cutting up quantitative variables
      • applying transforms
  • Creating sequences
    • You use seq()

                         seq(@FirstValue, @LastValue)

      • use the by = @value option to indicate the increment value to use in creating the sequence
      • use the length =@value option to indicate how many values should be created for the sequence.
    • To create an index of the values in a column/vector

                         seq(along = @vector)

      • The output is a sequence from 1 to n(@vector)
  • Subsetting variables

               @Data$@NewColumn= @Data$@Column %in% @vectorOfValues


    • This returns  a count of true and false observations of the conditions compared to create @NewColumn
  • Creating binary variables
    • use ifelse() to apply a value when a condition is true, and another when it is false

                         @Data$@NewColumn= ifelse(@Data$@Column @ComparisonOperator @Value, True, False)

      • To view a count of how many observations do not meet the condition,  use table() to compare @NewColumn to @Data$@Column @ComparisonOperator @Value

                                   table(@Data$NewColumn, @Data$@Column @ComparisonOperator @Value)

  • Creating categorical variables
    • Done using cut()
    • To split a variable based on it’s value, e.g. into quantiles 

                         @Data$NewColumn = cut(@Data$@Column, breaks=@MeasureToSplitData)


      • e.g. @MeasureToSplitData can be quantile(@Data$Column) to split the data into 5 quantiles.
      •  table(@Data$NewColumn) is Used to view the count of the groups @Column is split into.
      • You can compare @NewColumn to @Column using table() to view the relationship between them.
    • To split a variable into group based on quantiles
      • use cut2() in the Hmisc package

                                   @Data$@NewColumn = cut2(@Data$@Column, g=@NoOfQuantilesToSplitInto)

    • The result of a cut/cut2 is a factor variable.
  • Creating Factor variables
    • You can convert a variable into a factor

                         @Data$@ColumnToRepresentFactors<- factor(@Data$@Column)

    • To convert factor variable to numeric, use as.numeric(@Data$Column)
    • Alternatively, use transform to convert the variable type to factor

                         @Data<- transform(@data, @VariableToConvertToFactor= factor(@VariableToConvertToFactor))

      • You can use labels to change the value of the observations to a character that represents the factor

                                   @Data$Variable<- factor(@Data$variable, label=c("LabelForValue1", "LabelForValue2"…))

        • Sometimes, the number of variables are continuous(e.g decimal numbers with a variable of 0.1 and a range of 10.). You can split this into a limited number of factors using cut(). This involves creating tertiles.
          • First split the data into quantiles

                                                       @variableToStoreCutpoints<- quantiles(@Data$variable, seq(0,1, length=@NumberOfFactors), na.rm=TRUE)

          • Cut the data and define a new variable using the quantiles

                                                       data$newVariable <- cut(@Data$variable, @variableToStoreCutpoints)

Reshaping Data
  • melt is used on a data frame to select which columns are id and which are measure.

               melt(@Data, id=@VectorOfColumnsThatAreID, measure.vars= @VectorOfColumnsThatAreMeasures)

    • The output is an object with the @VectorOfColumnsThatAreMeasures where 2 new columns are made named variable and value. For each of the rows, the value of the column is put as the variable and it’s value set to the value of that column. 
  • A melted dataframe can be reshaped to a dataframe using dcast() or an array using acast()

               dcast(@Data, @Column ~variable[,@AggregationFunction])

    • This compares distinct(@column) to all the columns from the melt that are used as measure.vars and a summarized comparison showing just the count for each variable.
    • If @AggregationFunction is set to mean, it returns the mean of the variable for each distinct(@Column).
  • Averaging values
    • You can use tapply()

                         tapply(@Data$@ColumnToApplyFunctionTo, @Data$@ColumnToGroupBy, @functionToApply)

      • The result contains the sum of all ColumnToApplyFunctionTo grouped by ColumnToGroupBy if functionToApply is sum
    • You can use split()

                         split(@Data$@ColumnWithDataToMeasure, @Data$ColumnToGroupBy )

                         %>% lapply(sum)
                         %>% unlist

      • This returns a list with @ColumnWithDataToMeasure  grouped by @ColumnToGroupBy 
      • you can use lapply(sum) to get the sum of the groups
      • You can use unlist to convert the list to a vector.
    • You can use ddply
      • Part of the plyr package

                                   ddply(@Data, .(@columnToGroupBy), summarize, sum= sum(ColumnWithDataToMeasure ))


Managing data frames in R using dplyr
  • dplyr is a package developed to make it easy to work with data frames
  • dplyr expects
    • There is one observation per row
    • Each column represents a variable or measure or characteristic
    • An optimized version of plyr
  • The dplyr functions have a similar format
    • The first argument is a data frame
    • the subsequent arguments describe what to do with the first argument. 
    • You can refer to columns in the data frame directly without using $ operator
    • the result is a new data frame
    • Da
  • The dplyr verbs
    • select()
      • returns a subset of the columns of a data frame
    • filter
      • extracts a subset of rows from a data frame based on logical conditions
    • summarize
      • generates summary statistics of different variables in the data frame.
    • mutate
      • add new variables/columns or transform existing columns
    • arrange 
      • reorders rows of a data frame
      • preserves the order of the columns
    • rename
      • rename variables in a data frame
Merging data in R
  • Similar to linking as done in RDBMSs
  • merge() can be used to merge 2 dataframes x and y.
    • merge() has the following parameters which are used to specify which variable to use for the merge
      • x/y/by/by.x/by.y/all
    • By default it merges by the columns with a common name

                         merge(@DataFrameX, @DataFrameY, [by.x="@ColumnInXThatMatchesY", by.y="@ColumnInYThatMatchesX"]) 

      • if by.x and by.y aren’t specified, it’ll merge all columns that have the same name. If it’s possible for the value’s in both datasets to be different for the same named variable, by.x and by.y should be used to specify the column to use for joining.
      • all=TRUE means if a value appears in one of the datasets but not the other, it should include the value with NA for the variables in the other dataset
  • join()
    • in the plyr package
    • default is a left join.

                         join(@DataFrame1, @DataFrame2)

      • It joins columns with the same name.
    • join_all allows you join multiple dataframes.

                         list(@DataFrame1, @DataFrame2, @DataFrame3) %>% join_all

      • First combine the dataframes into a list
      • Then perform a join_all on the dataframe
      • They all need a column with the same name
Editing text variables
  • Manipulating text programmatically 
  • To see the column names of a dataframe,


    • use tolower() to convert it to small characters.
  • To split a string when a character is encountered, 

               strsplit(@Data, "@CharacterToSplitOn")

    • @CharacterToSplitOn is a special character, precede it with \\ e.g. "\\.".
    • strsplit() returns a list
    • You can use sapply to get the first elements from the list
      • Create a function that gets the first character

                                   firstElement<- function(x){x[1])

      • Use sapply to get the first elements from the list by applying a function to every element of the list.

                                   sapply(@list, firstElement)

  • To substitute out characters in a vector for others
    • To replace the first appearance of the character
      • use sub()

                                   sub("@CharacterToRemove", "@CharacterToPut", @Data)

    • To replace multiple appearances of the character
      • use gsub()

                                   gsub("@CharacterToRemove", "@CharacterToPut", @Data)

  • To search for specific values in variables
    • use grep()


      • It returns the row numbers of the observations with the string
      • to return the actual observation, use the option value=TRUE


      • to see the number of elements that satisfy the function, use length(@grepCommand)
    • use grepl()


      • This returns a logical vector with true for the observations with the string and false for observations without the string
      • You can use this to subset only cols where a string appears.

                                   @Data[grepl("@StringToSearchFor","@Data") ,]

  • The stringr package contains a lot of string manipulation functions
    • nchar(@Data)
      • To get the number of characters in a variable
    • substr(@Data, @firstPosition, @LastPostition)
      • This returns a string of the characters from the first to the last position
    • paste(@firstString, @lastString)
      • This joins 2 strings together and puts a space in between
    • paste0(@firstString, @lastString)
      • Joins the 2 strings together without a separating space
    • str_trim(@Data)
      • Removes the space characters at the end of a string
  • Characteristics of text in data sets
    • Names of variables should be
      • All lower case when possible
      • Descriptive
      • Not duplicated
      • Not have underscores, dots or white spaces
    • variables with character values
      • should be made into factor variables(depends on application)
      • should be descriptive(use TRUE/FALSE instead of 0/1)
Regular Expressions
  • They are often used with grep, grepl, sub,gsub and other string manipulation functions
  • Regular expressions can be viewed as a combination of literals and metacharacters
    • Metacharacters provide a way to represent
      • whitespace word boundaries
      • sets of literals
      • the beginning and end of a line
        • ^ represents the start of a line. 

                                             ^@String checks for observations with @String at the beginning of the line

        • $ represents the end of a line

                                             @String$ checks for observations with @String at the end of the line

      • alternatives("war" or "peace")
  • Character classes with []
    • To match a sequence of characters regardless of their cases, enclose the upper and lower cases in [] for each letter
      • [Bb][Uu][Ss][Hh] matches bush regardless of which of it’s alphabets being capital or small.
    • You can specify a range of characters i.e numbers or letters
      • to specify numbers 0 to 9, use [0-9]
      • to specify small case a to z, use [a-z] 
      • to specify capital case a to z, use [A-Z]
        • to specify either small or caps [a-zA-Z]
    • To match characters that do not satisfy a condition, use [^@C]
      • e.g [^?.]$ matches characters that do not end with ? or .
    • You can combine e.g. 
      • ^[Ii] matches only observations that the first character is I regardless of it’s case.
      • ^[0-9][a-zA-Z] matches only observation that the first character is a number and te second is an alphabet regardless of it’s case.
  • MetaCharacters
    • "."
      • The . means any character/ wild character
        • e.g. 9.11 matches 9A11, 9-11, 9.11, 9911.
    • "|"
      • | means or. It can be used to combine expressions that serve as alternatives to each other
        • e.g. food|water matches either food or water.
        • flood|earthquake|hurricane|wildfire matches either of any of the elements.
      • ^[Gg]ood|[Bbad] matches either Good, good at the beginning of the observation or Bad and bad somewhere in the observation
      • ^([Gg]ood|[Bbad]) matches either Good, good or Bad and bad at the beginning of the observation.
    • "?"
      • This indicates that the expression is optional
        • e.g. [Gg]eorge( [Ww]\.)? [Bb]ush matches "george bush", "george W. Bush"
    • "*" and "+"
      • They’re used to indicate repetition
      • "*"
        • This means the character can repeat any number of times including not appearing at all
          • e.g. (.*) matches both elements that don’t have any characters between () as well as those that have characters there
            • it matches () or (a) or (asdfas)
        • It matches the longest possible string that satisfies the regular expression.
          • e.g. ^s(.*)s returns a string containing the characters between the first and last s in an observation. 
          • To prevent greediness, use *? instead.
      • "+"
        • This means the character can repeat any number of times but must appear once.
          • e.g. [0-9]+(.*)[0-9]+ matches at lease one number,followed by any number of characters followed by at least one number.
    • "{@minimumNumberOfRepetition[, @maximumNumberOfRepetition]}"
      • They are interval qualifiers and are used to specify the minimum and maximum number of matches of an expression.
      • specifying {@minimumNumberOfRepetition} means exactly @minimumNumberOfRepitition time
      • specifying {@minimumNumberOfRepetition,} means at least @minimumNumberOfRepitition times
      • e.g. ( +[^ ]+ +){1,5} means space, not space, space between 1 to five times. 
        • e.g. " Apple Camera food " 
      • e.g. [Bb]( +[^ ]+ +){1,5} debate matches bush with up to 5 words till debate
    • "()"
      • Used to remember text matched by the subexpression enclosed
        • we refer to the matched text as \1,\2
        • E.g. +([a-zA-Z]+) +\1 + to refer to a repeat of the same word matched with the expression.
  • To search for a meta character, escape it with "\"
    • e.g. to search for ".", use "\."
Working with dates
  • date() is used to return the date and time. It’s of class character
  • Sys.Date() returns the date. It’s of date class
  • formatting dates
    • use format(@dateObject , "@FormatString")
      • The Format string values are 
        • %d is day as number(0-31)
        • %a is abbreviated weekday(mon – sun)
        • %A is unabbreviated weekday(Monday to Sunday)
        • %m is month(00-12)
        • %b is abbreviated month(jan-dec)
        • %B is unabbreviated month(January to December)
        • %y is 2 digit year
        • %Y is four digit year
    • e.g. format(@DateObject, "%d %m %y")
  • as.Date()
    • used to convert a character string to a date.

                         as.Date(@dateObject, "@DateFormatOfCharString")

      • @DateFormatOfCharString tells as.Date what format to expect. e.g. for "1Feb1999", @DateFormatOfCharString is "%d%B%Y"
  • You can find details about a day
    • weekdays(@dateObject) shows you what day of the week it is (Monday – Sunday)
    • months(@dateObject) shows you what month it is (January – December)
  • you can find difference between days if they’re of date class

               @DateObject1 – @DateObject2 

  • Lubridate
    • A package for handling date objects
    • You can use it to convert a number to a date. 
      • it reads the number in the format of the function
      • use the format of the number variable as a function.
        • ymd(@Number)
        • mdy(@Number)
        • dmy(@Number)
        • ymd_hms(@Number)
      • You can specify timezone using the tz option

                                   ymd_hms(@Number, tz="@timeZone")

        • @timeZones can be checked using ?Sys.timezone
    • Use wday(@Number) to get weekday
      • This returns the number of the day in the week.
      • use option label = TRUE to return the abbreviated weekday.
      • Date and time are preferred in the Date or POSIXct or POSIXly classes.
  • Data resources
    • Open Government sites



    • Some APIs with R interfaces


  • Manipulating Data with Dplyr
    • You need to install and load the dplyr package
    • In Dplyr, when selecting columns, you don’t need the Data$@Column format to select @column. Just use @column
    • Dplyr allows you work with tabular data from a variety of sources
    • To work with data in dplyr, load the data into a tbl_df(data frame tbl)


      • Data of class tbl_df prints better than data of other classes.
    • dplyr supplies 5 verbs that cover most data manipulation tasks
      • select()
        • Select keeps only the columns you specify
        • To select Columns @Column1, @Column2 from tbl_df @data

                                             select(@Data, @Column1, @Column2…)

          • To select a sequence of columns, use the : operator to separate the first and last columns

                                                       select(@Data, @FirstColumn, @DataColumn)

          • To remove specific columns from @data, use -@ColumnToRemove

                                                       select(@Data, -@ColumnToRemove)

          • To remove a sequence of columns 

                                                       select(@Data, -(@FirstColumn:@LastColumn)

      • filter()
        • filter keeps only rows you specify
        • To filter columns where @column = @value

                                             filter(@Data, @column1 == @value[,@column2 ==@value…])

          • To filter on multiple conditions, separate the columns with ,
          • To filter with an OR condition, separate the columns with |.

                                                       filter(@Data,@column1==@value| @column2==@value)

        • You can filter on the value of a column being NA using the
          • To filter only where the column is not NA, use !
      • arrange()
        • This allows you sort a tbl_df based on the values in a column or multiple columns in ascending order

                                             arrange(@Data, @Column1ToSortBy[,@Column2ToSortBy])

          • To sort in descending order, use desc(@ColumnToSortBy)

                                                       arrange(@Data, desc(@Column1ToSortBy))

      • mutate()
        • Allows you create a computed column based on the value of an existing column in a tbl_df and inputs it as the last column.

                                             mutate(@Data, @NewColumn1 = @ExistingColumn @Operator @value)

        • You can compute multiple columns in a single statements and reference a column computed in the same statement

                                             mutate(@Data, @NewColumn1=@ExistingColumn1+1, @NewColumn2 = @NewColumn1+8)

      • summarize()
        • Collapses the dataset into a single row.
        • Can be used to give you a value based on groups in a dataset.
  • Grouping and chaining data with dplyr
    • Grouping allows the breaking up of datasets based on the values of one or more variables.
      • Done using group_by()
        • Used to group a tbl_df by one or more variables
                                   group_by(@Data, @GroupingColumn)
          • In the result set, all operations are applied on a per group basis.
            • Performing summarize() with mean() gives the mean per group                                                       
        • n() is used to count the number of observations in a group
        • n_distinct() is used to count the number of unique observations in a group
                                   summarize(@GroupedTbl, @count=n(), @Unique=n_distinct(@column))
      • To find the value that can be used to split data into a particular percentage, use quantile().
        • The value is called @percentage quantile.
          • For the 99% quantile
                                             quantile(@Data$@Column, probs= 0.99)
      • To show all the rows in a tbl_df, use View(@Tbl_df)
    • Chaining allows you string multiple function calls in a way that is compact and readable.
      • %>% is the operator used to chain functions.
        • The value before the operator is used as the input for the function on the right. 
      • This allows us work without saving the results of all the operations for each step.
      • E.g. To select columns from a tbl_df and print,

                                   @tbl_df %>% select(@Column1, @Column2, @Column3)%>%print

  • Tidying Data with tidyr
    • Tidy data is formatted in a standard way that facilitates exploration and analysis.
    • Tidy data must satisfy 3 conditions
      • Each variable forms a column
      • Each observation forms a row
      • Each type of observational unit forms a table
    • Messy data has one or more of the following characteristics
      • data-wrangling-cheatsheet.pdfColumn headers are values, not variable names
        • This problem is solved using gather()
          • Gather() takes multiple columns and collapses them into key-value pairs, duplicating all other columns as needed
          • Used when you notice you have columns that are not variables.

                                                       gather(@Data, @Key, @Value, -@ColumnToExclude)

            • @key is the name of the output column that will store a variable that can have multiple values. e.g. sex variable with values Male and Female
            • @value is the Value that stores the measure of the key in the observation.
            • @column to exclude is the column in @Data that is already stored properly
            • If their are NA values, use option na.rm=true to remove them while doing gather()
            • to gather a sequence of columns, use : between the first and last columns
      • multiple variables are stored in one column
        • This problem is solved using separate()
          • This uses a regex or a vector of character positions to turn a single char column into multiple columns

                                                       separate(@Data, col=@ColumnToSplit, into=@vectorofColsToSplitInto, sep=@valueToUseToSplit)

            • @vectorofColsToSplitInto is a vector that shows the names of variable to split @ColumnToSplit into.
            • @valueToUseToSplit contains the value in @ColumnToSplit to use to split it into the destination number of variables. By default, it splits by any character that is not alphanumeric.
      • Variables are stored in both rows and columns
        • This requires use of gather() and separate()
        • Also involves the use of spread()
          • Use to spread a key-value pair across multiple columns.


            • distinct(@ColumnWithVariable ) is used to create new columns with each one becoming a column. It is the key in this relationship.
            • @columnWithValue is used to assign a value to each of the column created. It is the value in this relationship.
      • a single observational unit is stored in multiple tables
        • To solve this problem, you need to merge both datasets.
          • First add an additional column to the datasets to signify their difference.

                                                       @Dataset1<- mutate(@Dataset1, @ColumnToDistinguish = @value)

                                                       @Dataset2<- mutate(@Dataset2, @ColumnToDistinguish = @value)

          • Use bind_rows() to join the 2 dataset

                                                       bind_rows(Dataset1, Dataset2)

      • multiple types of observational units are stored in the same table
        • We can select out values that occur multiple times like a relational table and lookup the the other values using a unique identifier.
  • To extract a number from a character string, use extract_numeric()

               extract_numeric(Value5) returns 5

  • To aggregate a data frame

               ag <- aggregate(. ~@groupingColumn, @Data, function(x) c(mean=mean(x),sd=sd(x))

    • this calculate the mean and sd for all variables in a data frame


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s