SQL Server Administration Review


Planning and installing SQL Server

  • Planning
    • Operating System Requirements
      • SQL Server 2012 enterprise, business intelligence and web can only be installed on Windows server 2008 and higher OSs
      • SQL Server 2012 standard can be installed only on 2008 and higher Server OSs and enterprise/professional/business versions of client OSs
      • SQL Server 2012 developer tools can be installed on any windows OS (vista/ 2008 and higher)
    • An installation of SQL server on a core OS allows only
      • Database engine services, replication, Full-text search, analysis services, client tools connectivity and integration services server
    • Hard ware requirement
      • Processor
        • x64: Processor must suppport EM64T
        • x86: Pentium 3 or higher
      • Processor speed
        • x64: min 1.4 GHz, recommended 2.0 GHz
        • x86: min 1.0 GHz, recommended 2.0 GHz
      • Minimum RAM
        • min 1GB, recommended 2 GB
    • Database File management
      • Shrinking and growing databases
        • For fixed size databases, you increase their size using
        • Decreasing Database size
          • A data base can be configured to AutoShrink
            • This causes the database to move data pages from the end of the file towards the front of the file and return free space to the OS’s file system
            • Causes index fragmentation.
            • Best practice is to manually shrink and immediately rebuild indexes.
            • DBCC SHRINKDATABASE (N’@DBName’)
      • File Types
        • SQL Server has 3 file types
          • Primary:
            • Data Files with .mdf extension
            • Contain startup information for the database and pointers to other files the DB uses.
            • Can store data and objects
          • Secondary:
            • Optional files used to spread data across multiple disks and increase the size or performance of a database .
            • Multiple files can be used.
            • extension is .ndf
            • Added using
              • Alter Database [@DBName] add file (Name = N’@FileAlias’, Filename= N’@DIrectory\@Filename.ndf’, Size = 4096KB, FileGrowth= 1024KB) To filegroup [primary]
          • Transaction Log:
            • Stores information that can be used to recover the database
            • extension is .ldf
      • FileGroups
        • Used to increase performance when you use partitioning.
          • If the files of a DB are on the same filegroup, you can create a table on the filegroup. This allows the table perform like it’s written on a striped RAID and improves performance.
        • Reduce backup time for partial backups.
        • Add Filegroup using
          • Alter database [@DBName] add filegroup [@FileGroupName]
    • SQL Server Capacity constraints
      • A DB cannot exceed 524,272TB
      • A single DB file cannot exceed 16TB
      • Maximum DBs on an instance is 32,767
      • maximum of 16 columns in a primary key
      • maximum of 253 foreign key references per table
      • maximum of 32767 user connections
    • For windows server 2008 R2 and windows 7, when using a Managed service account for a SQL Server component, You need to create the account before commencing installation.
    • Benchmarking a server for SQL Server usage
      • Run SQLIO to identify hardware or I/O configuration issues before deploying SQL Server.
        • Used to calculate the saturation point of your I/O subsystem( Your I/Os subsystem’s capacity).
      • Run SQLIOSIM to simulate read, write, checkpoint, backup, sort and read ahead activities.
    • Firewall permissions
      • Grant access to the SQL server from the Group Policy management Console.
        • Create a Group policy object and link it to the designated servers organizational Unit
        • Add all the servers to an organizational unit to ease administration
        • Computer Configuration\Policies\Windows Settings\Security Settings\Windows Firewall with an advanced security node.
        • Create an inbound Program rule and allow connection to the SQL Server path(C:\Program Files\Microsoft SQL Server\MSSQL11.MSSQLSERVER\MSSQL\Binn\sqlservr.exe).
        • Limit the scope of the inbound rule to the remote IP addresses relevant from the rule’s properties
  • Installing
    • From command prompt
      • Setup.exe /qs /IAcceptSQLServerLicenseTerms /action=@Action /InstanceName= @InstanceName /Features=@Feature /SQLSYSADMINACCOUNTS=”@Domain\@Username” /updateenabled=0
        • @Action can be Install, uninstall, patch, upgrade, rebuilddatabase(to rebuild all system DBs)
        • @Feature can be SQLEngine, IS(integration services), conn(connectivity tools), AS(Analysis services), RS(Reporting services).

Configuring and managing SQL server instances

  • An SQL server instance is a separate deployment of Analysis services, database engine and reporting services hosted on a computer.
    • There can be several instances hosted on one computer
    • Each instance has unique program files, security configuration and directory
    • Up to 50 database engine instances can exist on a host.
  • Configuring instances
    • Instance level settings
      • Memory: Contains settings for minimum and maximum server memory.
        • Limits and ensures the amount of memory is within a specified range.
        • Configured to ensure that the instances play nice with each other.
        • modified using

Exec sys.sp_configure ‘min server memory’, @MinSizeInMB;

GO
Exec sys.sp_configure ‘max server memory’, @MaxSizeInMB;
GO
Reconfigure

          • This requires show advanced options to be on

exec sys.sp_configure ‘show advanced options’, 1;

Go
Reconfigure

      • Processor and I/O Affinity
        • Processor affinity ties an instance to a specific processor rather than the default of all processors on the host.
          • This eliminates processor reloads and reduces thread migration across processors
          • Modified using

Alter server configuration set process affinity CPU = ([Auto] or [1,5,8])

        • I/O affinity binds an instance’s disk I/O to a specific set of CPUs.
        • You cannot configure processor and affinity and I/O affinity for the same processor.
      • Fill factor
        • This determines the percentage of space on each leaf-level page that is filled with data when an index is created or rebuilt. The remaining space is used for future growth.
        • Can be modified

sp_configure ‘fill factor’, @destinationPercentage;

GO
Reconfigure;
GO

        • Requires advanced options to be on.
    • The model database serves as the template for all new databases that are created on an instance.
      • All the options and contents are copied over to the new database.
    • Distributed Transaction Coordinator(MSDTC)
      • A distributed transaction is a transaction that runs across multiple servers.
        • Not needed if they’re different instances on the same server.
        • Should be configured as a cluster resource before installing Database engine on the cludter.
    • Database Mail
      • User must have msdb role DatabaseMailUserRole
      • Can be enabled using

sp_configure ‘Database Mail XPs’, 1;

GO
Reconfigure;
GO

      • Options can be configured using msdb.dbo.sysmail_configure_sp stored procedure.
      • Database Mail profiles determine which users are able to send messages using database.
  • Managing Instances
    • Installing additional instances
      • You can have a maximum of 50 instances on a host or 25 on a failover cluster.
      • All instances have unique files.
      • An instance can contain database engine services, analysis services and reporting services.
    • Deploying patches
      • Shared features must be updated if a single instance is updated
      • Patches can applied/ removed to instances individually or to all the instances.
      • can be done using

.exe /qs /IAcceptSQLServerLicenseTerms /Action= (Patch or RemovePatch) [((/InstanceName = @Instance1, @Instance2) OR /Allinstances)]

        • /InstanceName Or /Allinstances should not be used if you intend to patch only shared components.
  • Resource Governor
    • A feature that enables you limit CPU and memory resource for Database engine in an instance.
    • Resource pools are a subset of the physical resources available to a database engine instance.
      • Hosts one or more workload groups.
        • Workload groups are containers for session request that share similar classification criteria.
        • There are 2 types
          • Internal workload group
          • Default workload group
        • Created using

Create workload group @WorkloadGroupName

using @ResourcePoolName;
GO
Alter resource governor reconfigure;
GO

      • Each session is assigned to a workload group and runs using the resource in the workload group through the resource pool.
        • There are 2 types of resource pools
          • The internal resource pool represents the resources the instance uses.
            • It cannot be altered and workloads in this pool are critical to the server.
            • It can seize resources from other pools
          • The User defined resource pool.
            • Hosts user sessions
        • A resource pool Consists of 2 components
          • Exclusive: doesn’t overlap with other pools. used to define minimum resource reservation for each resource pool. The sum of min across all pools cannot exceed 100%.
          • shared: used to define maximum resource consumption. The max value is 100- @TheSumOfMinAcrossAllPools
        • Created using

create resource pool @resoucepoolName

with @Option;
Go
Alter resource governor reconfigure;

          • @option can be (min_cpu_percent = @Value) and any other option
    • Resource governor classification
      • Allocates incoming sessions to a workload group based on session properties.
      • You can classify sessions into workload groups based on classifier functions.
    • Windows System Resource Manager
      • allows you allocate system resources to specific processes
      • Used to manage resource usage across several instances.
      • Percentage based resource management allows you manage hosts with one processor.
      • Resource allocation policies are can be used to manage resource allocation.
    • SQL Server error logs
      • This stores information about processes that have completed successfully or failed.
      • Sysadmins can cycle the error log using

exec sp_cycle_errorlog;

        • For SQL server agent

exec msdb.dbo.sp_cycle_agent_errorlog

 
Configuring SQL server components
  • Deploying and configuring analysis services
    • Analysis services can be configured in 2 modes
      • Multidimensional and data mining mode:
        • Default
        • supports OLAP databases and data mining models
      • Tabular mode
        • Supports tabular modeling features
        • Allows hosting of solutions built in the tabular model designer
        • allows access of tabular data model data over the network.
    • The modes can be set during installation using the /ASServerMode option
      • can be set to Multidimensional, tabular or powerpivot.
    • The mode is specific to the instance. A different mode requires a different instance.
  • Deploying and configuring reporting services
    • During installation, the install and configure option is only available if you’ve installed the web server role and are installing/ have installed a database engine on the instance.
    • You can choose to install the reporting services configuration manager.
  • Deploying and configuring sharepoint integration
    • Deploying analysis services and reporting services as shared services in a sharepoint farm allows you use PowerPivot for sharepoint and power view( a reporting services report designer)
      • TO deploy RS, power view and powerPivot requires the following are installed
        • Sharepoint server 2010 sp1
        • SQL server 2012 SQLEngine
        • SQL server 2012 RS and RS add in
        • SQL server 2012 PowerPivot for sharepoint
  • Configuring SQL server integration services security.
    • Integration services is a shared component that allows you run and schedule integration services packages in SSMS.
    • Only members of the built-in administrators group can run integration services.
      • Use DCOM configuration tool to grant user access to SSIS
        • Run dcomcnfg.exe and configure access for the SSIS application.
  • Managing Full-text indexing
    • Full text indexes store information about important words and their locations in the columns of a database table.
      • Only one full-text index can be created per table/indexed view.
      • Created using

create fulltext catalog @FulltextCatalogName;

GO
Create fulltext index on @TableName
( @columnToMonitor1, @columnToMonitor2, @columnToMonitor3)
Key index @UniqueKeyIndexName
on @FulltextCatalogName;
GO

      • Removed using

drop fulltext index on @TableName.

  • Configuring Filestream
    • Filestream allows SQL-Server based applications store unstructured data(e.g images, documents) on the host’s file system.
    • Filestream requires that
      • objects to be stored are greater than 1MB
        • for objects smaller than 1MB, use the varbinary(max) Blob data type
      • The system architecture(x86 or x64) of SQL server is the same as that of the system.
    • To Enable
      • edit the instance in SQL server configuration manager to enable filestream access
      • Set filestream access level to 2

Exec sp_configure filestream_access_level, 2

Reconfigure

        • setting it to 0 means no access, 1 allows TSQL only and 2 allows TSQL and win-32 streaming
      • Create a filestream filegroup for the database.

Alter database @DBname ADD filegroup @FileStreamFilegroupName contains filestream

      • Add filestream files to the filestream filegroup by specifying a folder that doesn’t currently exist

Use master

Alter database @DBName add file(
Name= @FileName, FIleName=’@NonExistentDirectory’)
To filegroup @FileStreamFileGroupName

  • Configuring filetables
    • File table are tables that allow you store files and documents within SQL server 2012.
      • Provides a hierarchy of files and directories.
    • To enable
      • Set filestream access level to 2
      • Enable Non-transactional access at the database level

Alter database @DBName set filestream (Non_transacted_access = Full, Directory_name=N’@DirectoryName’

        • Directory_name is the alias for the filestream directory.
    • To create a filetable

Create table @TableName as FileTable;

  • Configuring and managing databases
    • Designing and managing filegroups
      • Each database has a primary filegroup.
        • This host the primary data file and any secondary files not allocated to other filegroups.
        • The system tables are hosted in the primary filegroup
      • File groups are used to group together host data files.
      • When a filegroup contains more than one file, the database engine will write data across the files proportionally; depending on how much free space is available in each file
      • You can specify the filegroup a table or index is created in
      • To add a filegroup

alter database @DBName add filegroup @NewFileGroupName

      • You can move indexes from one filegroup to another.
        • You cannot move indexes created using a unique/primary key.
    • Contained Databases
      • This are databases that include all the settings and metadata required to define the database and have no dependency on the instance.
        • A database can be partially or fully contained.
          • Partially contained databases provide a high degree of isolation from the instance but aren’t fully contained.
      • They allow you move a database between instances easily.
      • To enabled contained databases on an instance,

Exec sp_configure ‘contained database authentication’,1;

GO
reconfigure;

      • To convert a database to a contained database

Alter database @DBNAME set containment = @Containmenttype

        • @ContainmentType is either partial or full
      • To create a contained database

create database @DBName

containment = @ContainmentType

    • Using Data Compression
      • Compression allows you save storage space by reducing the size of the database
        • It can be configured on the following
          • Clustered tables
          • Heap tables
          • Non-clustered indexes
          • Indexed views
          • Individual partitions of a partitioned table or
      • It increases the CPU usage because the data must be compressed and decompressed when being accessed.
      • Only enterprise and developer editions support data compression.
      • There are 3 types
        • Row-level:
          • Works by using more efficient storage formats for fixed-length data.
            • Stores fixed length data types as though they are VAR data types
            • not storing NULL or 0 values
            • reducing metadata required to store data
          • Requires less CPU than page-level compression
          • Maximizes the number of rows stored in a page
        • Unicode
          • Enables the engine to compress unicode values stored in page or row compressed objects.
          • used for nchar and nvarchar data types
        • Page-level
          • Compresses data by storing repeating values and common prefixes only once and then making references to those values from other locations within the table.
            • Row compression maximizes the number of rows in the page
            • Column prefix compression is applied by replacing repeating data patterns with references. the data is stored on the page header.
            • Dictionary compression scans for repeating values and then stores this information in the page header.
          • When it is applied, row compression is also applied
      • Done using
        • Tables

Alter table @TableName rebuild with (Data_compression=@compressionType)

        • Index

Alter index @indexname on @tablename rebuild partition all with (data_compression=@Compression Type)

          • @Compressiontype is either page or row.
      • Estimating compression savings can be done with sp_estimate_data_compression_savings
    • Encrypting databases with Transparent data encryption
      • Transparent Data Encryption enables you encrypt an entire database when when backed up.
        • uses a Database Encryption Key(DEK) which is stored in the database boot record.
          • The DEK is protected by the database master key which is protected by the service master key
      • To encrypt a database
        • Create the master encryption key

Use master;

Create master key encryption by password= ‘@Password’;

        • Create the certificate protected by the master key

Create certificate @CertificateName with Subject= ‘@CertificateAlias’;

          • Backup the certificate in the master database. If you lose the DB server, and don’t have the certificate, you cannot access data in the database
            • You need to create a backup of the certificate and private key both of which are required for certificate recovery
Backup certificate ServerCertificate
To file = ‘@DestinationFile’
with private key
( file = ‘@privateKeyName’,
Encryption by password = ‘@PrivateKeyPassword’
); Go
            • The files are written to \mssql\data for the instance.
        • Create a DEK and protect it with the certificate

Use @DBToBeEncrypted

Create database encryption key with algorithm = AES_128
Encryption by server certificate @CertificateName

        • Encrypt the database

alter database @DBName

set encryption on;
Go

    • Partitioning indexes and tables
      • Partitioning divides index and table data across more than one filegroup. Data is partitioned so that groups of rows are mapped to individual partitions.
      • All partitions of a table or index must reside in the same database.
      • Available in enterprise and developer editions.
      • x64 versions support up to 15000 partitions while x86 are recommended to support up to 1000 partitions.
      • Concepts
        • Partition functions define how the rows of an index or table map to specific partitions based on the values of the partitioning columns
        • Partition scheme maps the partitions of a partition function to a collection of filegroups.
        • Partitioning column is the column of an index or table that a partition function uses to partition the index or table
        • Aligned index is an index that uses the same partition scheme as the table to which it belongs
        • Non aligned index is an index that is partitioned independently from the table to which it belongs
        • Partition elimination is the process by which the query optimizer accesses only the appropriate partitions to satisfy a query’s filter criteria.
      • To create
        • Create the filegroup(s) for the partitions
        • Create the partition function:

Create partition function @PartitionFunctionName(int) as range left for values(@ValuetoUseToSplit1,@ValuetoUseToSplit2 )

          • This creates a partition function to split into 2 partitions.
          • @ValueToUseToSplit are values use to partition
        • Create the partition scheme

create partition scheme @PartitionSchemeName

As partition @PartitionFunctionName
To(@Filegroup1, @Filegroup2);

Go

        • Create/modify the table/index and specify the partition scheme

create table @TableName @TableSchema on @PartitionScheme(@PartitioningColumnDefinedInTableSchema)

      • You can alter and merge partitions functions using Alter partition function
        • This allows you split one partition into two or merge 2 partitions into 1.
      • You use alter partition scheme to modify an existing partition scheme.
    • Managing log files
      • Log truncation is the process by which SQL frees space in the logical log for reuse by the transaction log.
        • It occurs if
          • In simple recovery, When a checkpoint occurs.
            • Automatic checkpoints occur when the number of log records reach the number SQL determines it can process during the recovery interval server configuration option.
            • Automatic checkpoints occur when the virtual log becomes 70% full.
            • Manual checkpoints are triggered using the checkpoint statement.
          • In full/Bulk-logged recovery,
            • After a log backup, as long as a checkpoint has occurred after the previous backup.
      • Use DBCC SQLPERF (LOGSPACE) to monitor the amount of log space used.
    • Using Database Console commands(DBCC)
      • they allow you perform Server administration using queries.
      • They are divided into
        • Maintenance statements
          • DBCC Cleantable reclaims space from dropped var-length columns in indexed tables or views
          • DBCC DBreindex: rebuilds one or more indexes for a table
          • DBCC Dropcleanbuffers: removes all clean buffers from the buffer pool
          • DBCC Freeproccache: removes all elements from the plan cache, a specific plan from the plan cache or all cache entries related to a specific resource pool
          • DBCC Indexdefrag: defragments indexes
          • DBCC Shrinkdatabase: shrinks the size of all data and log files of the specified database.
          • DBCC ShrinkFile: shrinks a specified data or log file
          • DBCC UpdateUsage: updates page and row count data for catalog views to remove inaccuracies.
        • Informational Statements
          • DBCC inputbuffer: shows the last statement forwarded from a client to SQLEngine
          • DBCC Opentran: shows information about the longest running transaction
          • DBCC outputbuffer: displays the current output buffer in hex and ascii for a specific session id
          • DBCC procache:provides information about the procedure cache
          • DBCC show_statistics: provides current query optimization statistics for a table or indexed view.
          • DBCC showconfig: provides fragmentation information for tables, views or indexes
          • DBCC sqlperf: displays transaction log space usage statistics for all databases hosted by an instance.
          • DBCC tracestatus: provides information about trace flags
          • DBCC useroptions: provides information about currently set options on the connection.
        • Validation statements
          • DBCC checkalloc: performs a consistency check of disk space allocation structure
          • DBCC checkcatalog: checks catalog consistency of online databases
          • DBCC Checkconstraints: verifies the integrity of a specific constraint or all constraints on a table within the current databases
          • DBCC checkDB: Checks the physical and logical integrity of all objects in a database.
          • DBCC checkFileGroup: verifies the allocation and structural integrity of indexed views and tables in a specified filegroup
          • DBCC checkident: verifies and if necessary, changes the identity value for a specific table.
          • DBCC checktable: verifies the integrity of all pages and structures that make up a table or indexed view.
        • Miscellaneous Statements
          • DBCC dllname(Free): unloads a specific extended SP DLL from memory
          • DBCC freesessioncache: flushes the distributed query connection cache
          • DBCC Freesystemcache: flushes all unused cache entries from all caches
          • DBCC Help: provides infromation on a specific dbcc command
          • DBCC traceoff: disables specific trace flags
          • DBCC traceon: enables specific trace flags
Migrating, importing and Exporting
  • Migrating and upgrading to SQL server 2012
    • Upgrading an instance
      • From SQL server 2005.: Ensure the host operating system can run SQL 2012
        • Apply 2005 SP4
      • From SQL server 2008: Apply SP2
      • From 2008 R2: Apply SP1
    • Upgrade advisor
      • A tool that allows you identify issues that will potentially block an upgrade
      • It checks the services, but not applications or encrypted SPs.
      • Can be run only on hosts that support SQL server 2012.
    • Distributed replay utility
      • Simulates how a test server will function before upgrading to SQL server 2012 by simulating mission-critical workload.
      • Allows you determine if any incompatibilities might occur between the application and SQL server 2012.
    • Performing an upgrade
      • For X64 instances, you need to upgrade AS before database engine.
    • Migrating a Database to a SQL server 2012 instance
      • Allows you move to a SQL server of a different architecture
      • Detach a database
        • This involves removing the database from the instance while keeping the DB and log files.
        • Requires that
          • The DB is not replicated and published
          • The DB is not mirroed
          • The DB has no snapshots
          • It is not a system DB
        • Done using

exec sp_detach_db @dbname= @DBName;

      • Attach a database
        • You can attach a database only if all data files are available.
          • Done Using

create database @databaseName on(filename=’@Directory\@DatafileName), (Filename=’@Directory2\@datafilename2), (Filename=’@directory3\@logFileName) for attach;

        • attaching a database sets the compatibility level to 90 if it was less, otherwise it leaves it as is.
      • Copying a database to other servers.
        • This allows you retain a copy of the database while you deploy a copy to the target.
        • This ensures the original database is still available
        • 3 ways to copy a database
          • Copy database wizard:
            • The database becomes immediately available on the target instance after the transfer completes and is automatically upgraded.
            • Full text indexes are unavailable during the upgrade.
            • Allows you move SQL logins, jobs, SPs and custom error messages
            • Allows you schedule the copy.
            • Cannot copy
              • system DBs
              • replicated, offline, suspect, loading, recovering, inaccessible and emergency mode DBs
            • Requires SQL server agent to be started on destination.
            • There are 2 transfer methods
              • Detach and attach method: Fastest but requires the DB @ the source be taken offline.
                • The DB is reattached for the copy
              • SQL management object method
                • slower but allows the DB @ source to be online
            • Create a credential and an integration Service proxy account to use for the copy database wizard.
          • Backup and restore
            • Prior to this, create the same backup devices on the destination as you use on the source.
            • When restoring a DB backup from a previous version, use the move option
            • The principal who performs the restore becomes the owner of the restored database.
            • Backups don’t have all metadata for the database
              • You must migrate Logins and jobs
          • Publish database using generate and publish scripts wizard
            • Used to publish a DB to a web host provider.
            • Can also publish a script you can use to transfer the DB.
      • Migrating SQL Logins
        • You can use 2 methods besides the copy database wizard
          • Use the generate a script function on the login
            • This allows you to recreate an object in the DB.
            • Allows you transfer windows authenticated SQL Logins between instances
            • for SQL-authenticated logins, the account are disabled and will need to have their passwords reset.
          • Use the microsoft script.
  • Exporting and importing data
    • Large scale data import and export operations are called bulk imports
    • SQL server has several tools
      • The BCP command line utility
        • Allows you bulk export data from a DB to a file or import from a file into tables in a DB.
        • Cannot transfer from DB to DB.
        • The result is unordered as SQL server uses parallel scans unless queryout and orderby are selected.
        • Befroe importing data, ensure the destination table has columns configured that are compatible with the data type you want to import.
        • Y
        • Export done using

bcp “@QueryWithOrderByClause” queryout @DestinationFile -T

          • -T signifies you want to useuse trusted connections.
        • Import done using

bcp @Database.@Schema.@TableName in @FileContainingData -T

        • You can use a format file with bcp by specifying the -f option
      • Bulk insert statement
        • Used to load data from a file.
        • Allow you use a format file
        • Done using

bulk insert @Database.@Schema.@TableName from ‘@Directory\@File’

      • OPENROWSET(BULK)
        • Allow you connect to an OLE DB data source to retrieve data.
        • used in a insert into select from OPENROWSET(Bulk…) clause.
          • … are the parameters
        • Most suitable for one-off importations
        • Allow you use a format file
      • Integration services import and export wizard
        • To migrate data from SQL server and OLE DB providers such as Microsoft access and MS excel
        • Creates an integration services package and allows you perform complex data transformation.
        • Allows you schedule it to run automatically using SQL server data tools.
      • Select into.
        • The into clause allows you to create a new table by using the results of a select statement.
        • The columns of the new table inherit the characteristics of the output columns.
        • Indexes, triggers and constraints aren’t transferred and must be manually created @ the target.
    • Format files are used to map data file fields to the columns of a table.
      • Can be used with bcp, BULK INSERT or OPENROWSET(BULK…) when importing data from a file that uses fixed-with or character terminated fields.
      • Format files can be XML or non-XML.
      • Must be used if you are performing bulk import from a data file with fixed width or fixed-length fields.
      • bcp is used to create a format file

bcp @TableName format nul -c -x -f @Filename.xml-t, T

    • Preparing data for bulk operations
      • To improve performance
        • Make the database logging model not full recovery for the duration of the import.
        • Perform the bulk import operations in parallel
          • If the destination table has no indexes, specify the TABLOCK option.
            • This causes the DB to perform bulk optimization and prevents blocking issues.
        • Disabling triggers and constraints
        • If the destination table has a clustered index, import data from the data file using the same ordering as in the table
SQL Server Logins, Roles and Users
  • Logins are credentials that authenticate connections to an instance.
  • SQL Logins
    • A database user must map to an existing SQL Server login except the instance supports contained databases.
    • There are 4 types
      • Windows-authenticated login
        • Instance logins in which the OS handles authentication.
        • The SQL login is mapped to a OS user account
        • Done Using

Create Login “@DomainOrHostName\@UserName” from windows;

      • SQL server-authenticated login
        • Logins are authenticated by the database engine.
        • Done using

Create Login @UserName with password = ‘@Password’;

        • Can be configure to use password expiration and complexity policies of the host OS using 2 create login option
          • Check_Expiration: use the password expiration of the host OS. can be set to on or off
          • Check_policy: use the password complexity policy of the host OS. can be set to on or off
      • Certificate
        • You can create a certificate to be associated with a login if a database master key is present.
          • To create a master key

create master key encryption by password = ‘@password’

        • To do this
          • Create a certificate and configure it’s expiry

Create certificate @CertificateName

with Subject=’@DetailsOfCertificate_E.G._TheUserOfTheCertificate’
expiry_date=’DD/MM/YY’

          • Create the login from the certificate

Create login @UserName from Certificate @CertificateName

      • Asymmetric Key
        • Asymmetric keys contain both a public key and a private key.
        • If there is no database master key, you must provide a password when creating an asymmetric key.
        • To do this
          • Create the asymmetric key

create asymmetric key @KeyName with algorithm=RSA_2048;

          • Create the Login from the asymmetric key

create login @Username from asymmetric Key @KeyName

    • Altering existing logins
      • Done using

Alter login @UserName @Option

        • @Option can be disable, unlock and Must_change
    • Viewing logins
      • Query sys.server_principals
      • For sql server authenticated logins, query sys.sql_logins
    • Removing logins
      • Done using

Drop login @UserName.

      • The login musn’t be connected, own a server-level object,SQL server agent or a securable.
      • best practice is to disable the login.
    • Denying server access
      • used to prevent some windows security principals from connecting to the database engine
      • done using

Deny Connect SQL “@Domain\@User”

  • Server Roles
    • They allow you simplify the assignment of permissions at the database instance level.
    • Permissions are assigned to server roles rather than individual logins
    • TO grant a login permissions, you add the login to the role
    • There are 9 fixed server-level roles
      • Sysadmin: Members can perform all activities possible on the database engine instance
      • Serveradmin: Members can perform instance-wide configuration tasks.
      • Securityadmin: Members can carry out instance-level permissions management.
      • Processadminn: Members can terminate process running on an instance
      • Setupadmin: members can add and remover linked servers
      • Bulkadmin: members can use bulk insert on databases on the instance.
      • diskadmin: members can manage instance-related files.
      • dbcreator: Members can create, alter, drop and restore databases on the instance
      • pubic: All logins are members of this role. You can add permissions to this role.
    • You use Alter server role to modify the membership
      • use the add member option

Alter server role @RoleName add member “@Domain\@UserName”

    • To manage server roles
      • sp_helpsrvrolemember
        • Provides fixed server role membership
      • sp_srvrolepermission
        • provides fixed server role permissions
      • is_srvrolemember
        • Enables you check whether a login is a member of a specific fixed or user-defined server role
      • sys.server_role_members
        • Provides information about role members
    • User defined server roles
      • Used to create custom server roles.
      • To create User defined server roles,
        • Create the server role

Create server role @ServerRoleName [Authorization = “@domain/Username”]

          • You can use the authorization to set who owns the server role.
        • Assign permissions to the server role

grant @Permission to @ServerRoleName

          • @Permission is the permission to grant the role e.g. alter any database
        • Add users to the server role
  • Credentials
    • They store authentication information that facilitates a connection to a resource external to the DB engine instance e.g. an FTP server.
    • You can map a single credential to multiple SQL logins.
    • You can map a login to a single credential only.
    • Created using

create credential @CredentialName with identity = ‘@UserName’, secret= ‘@Password’

 
  • Managing Users and DB roles
    • Database users
      • Database users represent the identity of SQL server logins when connected to a DB.
      • Each DB user maps to a login except for contained databases.
      • You can map a login to one user per DB but different users in different DB
      • Add a database user using

Use @DBName

Go
create user “[@Domain\]@UserName” for Login “[@domain\]@Username”
Go

      • Drop a database user using

drop user “@domain\@Username”

        • Dropping a user doesn’t remove the SQL login
        • Deleting an SQL Login doesn’t delete the database users associated with the SQL login
        • Orphaned users are database users whose corresponding SQL logins has been dropped or the database been moved to a different instance
          • To get orphaned users ,use sp_change_users_login @Action=’report’
    • Database roles
      • They allow you assign permissions to DB users on a collective basis. There are 9 fixed DB roles
      • Fixed DB roles: Have specific permissions that cannot be altered
        • db_owner: Assigned to principals who need to perform all DB config and management tasks. Allows members drop DBs
        • db_securityadmin: Members manage membership of roles.
        • db_acccessadmin: Members manage database access for logins
        • db_backupoperator: Members can back up the DB
        • db_ddladmin: Members can run any DDL command in the DB
        • db_datawriter: Members can insert, delete or modify data in a DB user table
        • db_datareader: members can read all data from all user tables in a DB
        • db_denydatawriter: Blocks members from inserting, deleting or modifying data in a DB user table
        • db_denydatareader: Blocks members from reading data stored in a DB’s user table
      • Flexible DB roles
        • This have custom permissions
        • Created using

create role @RoleName [Authorization=[@Domain\]@Username]

          • Authorization is used to set ownership of the role
        • Assign permissions to a role using

grant @Permission to @role

      • To add a DB user to a role,

Use @DBName

Go
Exec sp_addrolemember ‘@Role’, “[@Domain\]@Username”

        • You cannot use alter role @DBroleName add member as used for instance level users(Alter server role @InstanceRoleName add member)
      • msdb roles
        • msdb has a set of special DB roles in addition to the 9
          • They allow you assign permissions to IS, data collector, server groups, DB mirroring, Server agent and policy-based management
      • User and role catalog views
        • Used to determine information about users, roles and permissions at the DB level
          • sys.database_principals: provides information about DB principals
          • sys_database_role_members: provides information about roles and the security principals who are members of those roles
          • sys_database_permissions: provides information about permissions assigned to security principals at the DB level.
    • Contained Users
      • Contained DBs allow you move a DB to a different instance/ cloud and deploy without additional config
      • A contained user is a DB user who does not have a SQL server login.
        • They connect by specifying credentials in the connection string
      • To create contained users
        • Enable contained DB authentication on the instance

sp_configure ‘show advanced options’,1;

reconfigure with override;
go
sp_configure ‘contained database authentication’,1;
reconfigure with override;
go

        • Create the contained users

create user @Contained_user with password = ‘@Password’

          • Partially contained DBs can have windows authenticated users

create user [@Domain\]@Username

    • Least privilege
      • The principle of least privilege is a principle of information security that dictates granting a security principal only privileges necessary to perform required tasks.
    • Application roles
      • This enable you grant permissions to a specific application.
      • Application access data using permissions assigned the application role instead of using a DB user’s permissions.
        • Happens in this way
          • User interacts with a client application that connects to an instance with the user’s credentials
          • The application executes sp_setapprole, authenticating by using a password configured for the application
          • When authenticated, the application role is enabled, and the connection uses the permissions assigned to the application role.
        • Can be created using

Create application role @ApplicationRoleName with password=’@Password

        • can alter the name,password or default schema using alter application role
        • Can be dropped using drop application role
Securing SQL server
  • Database permissions
    • This enable you control which actions can and cannot be performed by security principals at the database level.
    • It involves understanding permissions and how they can be managed
    • Securables
      • An item for which you can assign permissions e.g. users, schemas,roles, objects
      • Permission keywords
        • Alter: provides the ability to modify the properties of a securable.
        • Backup/dump: permission to back up or dump the securable
        • control: all permissions on the securable; can grant permissions on the securable; grants control permissions on all securables within the scope
        • create: Permission to create the securable
        • delete: Permission to delete the securable
        • execute: permission to execute the securable
        • impersonate: permission to impersonate the securable principal.
        • insert: permission to insert data into the securable
        • receive: permission to receive messages from service broker
        • reference: permission to reference the securable
        • restore/load: permission to restore the securable
        • select: Permission to view data in the securable
        • take ownership: permission to take ownership of the securable
        • update: Permission to modify the data stored in the securable
        • view definition: permission to view the securable definition
        • connect: permission to connect to the securable
      • Assigning permissions on objects
        • Use the following to manage permissions on objects
          • Grant: to allow the permission to be used

grant @PermissionKeyword on @DBObject to @DBUser

          • Deny: to block the permission from being used

Deny @PermissionKeyword on @DBObject to @DBUser

          • Revoke: to remove an existing grant or deny permission

revoke @permissionKeyword on @DBObject to @DBUser

    • Managing permissions by using database roles
      • Assigning fixed DB roles
        • The permissions on the fixed DB roles cannot be changed
        • among the fixed DB roles, you can only add the public role to another database role.
          • Other Fixed DB roles do not support this feature

sp_addrolemember @RoleWithDesiredPermission, @RoleToBeEnhanced

OR
alter role @RoleWithDesiredPermission add member @RoleToBeEnhanced

    • Using Flexible DB roles
      • This allows nesting of roles.
    • Protecting objects from modification
      • This involves configuring permissions so only authorized principals can make modifications.
      • To carry it out on the DB level, add security principals to the db_denydatawriter role.
        • This applies to the whole DB
      • To protect specific database objects, use deny with alter, control, delete, insert, take ownership and update permissions

Deny alter on @DBObject to @SecurityPrincipal

Deny control on @DBObject to @SecurityPrincipal

      • Placing objects in read-only filegroups doesn’t stop all forms of modification
    • Using Schemas
      • Schemas simplify application of permissions by enabling you collect securables into containers.
      • Objects belong to schemas and schemas belong to security principals.
      • You can create a flexible role, create a schema owned by the role then add principals to the role to grant them permissions over objects within the schema.
        • You can alter the schema of securables in the same databases

Alter schema @DestinationSchema transfer @SourceSchema.@Object;

    • Determining effective permissions
      • Use HAS_PERMS_BY_NAME() to determine a specific principal’s effective permission on a certain securable.
        • An effective permission can be one of the following
          • Permission granted directly to the security principal
          • permission is granted to a role of which the security principal is a member
          • Permission is implied by a higher-level permission held by the security principal.
          • Permission is not denied to the principal directly or indirectly through role membership.
        • HAS_PERMS_BY_NAME() returns 1 if the principal has an effective permission and 0 if it doesn’t. It returns Null if neither the permission nor securable class is valid.
        • It runs in the context of the current principal.
  • Troubleshooting SQL server security
    • This involves ensuring that the security settings configured for an instance or database match the security requirements for that instance or database.
    • Troubleshooting authentication
      • query sys.server_principals to determine if a login is disabled
      • use alter login to configure login permissions
        • use must_change parameter to force a password change upon next logon
        • use unlock parameter to unlock a user’s account
      • Authentication modes
        • Windows authentication mode disables SQL login. When selected during install, the sa account is disabled
        • Mixed authentication mode allows both windows and SQL login. This requires a strong password be assigned the sa account
    • Client connection problems
      • The browser service helps clients connect to instances that are not using fixed ports.
      • If users cannot connect to named instances, this is the most likely cause
    • Troubleshooting certificates and keys
      • A common problem is expired certificates.
      • Ensure keys and certificates are regularly backed up because encrypted data is unrecoverable if keys and certificates become corrupt.
      • To see information about them, query
        • sys.asymmetric_keys:
        • sys.certificates:
        • sys.key_encryptions: provides info about each symmetric encryption specified when using create symmetric key with the encryption by parameter
        • sys.symmetric_keys:
    • Troubleshooting endpoints.
      • You need to understand the precise configuration of the endpoint that is causing the issue. To do this, query
        • sys.database_mirroring_endpoints
        • sys.endpoints
        • sys.http_endpoints
        • sys.service_broker_endpoints
        • sys.tcp_endpoints
      • When troubleshooting security endpoints, you might need to alter the authentication or encryption settings.
      • An endpoint can be in a stopped state.
      • You can configure an existing endpoint using alter endpoint
        • To start a stopped endpoint, use state parameter
        • To change the endpoint authentication, use authentication parameter
        • To enable, disable or require encryption, use the encryption parameter
      • use the grant statement to grant permissions on an endpoint to a login.
      • When configuring permission on endpoints, ensure the remote instance can authenticate the local instance using windows authenticated logins or certificates.
    • Security catalog views
      • They are used to view the properties of server and database security. They include
        • sys.server_permissions
        • sys.server_principals
        • sys.server_role_members
        • sys.sql_logins
        • sys.system_components_surface_area_configuration
      • To view security config at the individual DB level, use the following
        • sys.database_permissions
        • sys.database_principals
        • sys.database_role_members
        • sys.master_key_passwords
  • Auditing instances
    • auditing enables you track activity on a SQL server instance.
    • SQL Server audit
      • It enables you to configure sophisticated auditing at the instance and database levels
      • It forwards audit results to a target/Audit destination. the target can be
        • A flat file
        • Application log
        • security log
      • To create a server audit

Create server audit @AuditName

To @Target
With
( Queue_delay=@DelayDuration, On_Failure=@AuditFailureAction)

        • @Target can be the directory to the file, application_log or security_log
        • @DelayDuration is the delay in ms before audit actions myst be processed.
        • @AuditFailureAction determines the action to take when there is an audit log failure. If set to shutdown, the instance shuts down and might require the -f service parameter to startup again.
      • SQL server audit action groups and actions
        • This are used to determine the kind of actions Server audit records.
        • you can audit the use of specific actions on DB schema and schema objects with DB audit specifications
      • To create a server audit specification

create server audit specification @AuditSpecificationName

For server audit @AuditName
Add (@ServerAuditActionGroup)

      • To alter a server audit specification

alter server audit specification @AuditSpecificationName

add(@ServerAuditActionGroup)

        • To turn on the audit specification

alter database audit specification with (State = on)

      • To create a DB audit specifciations

create database audit specification @AuditSpecificationName

For server audit @AuditName
ADD ( @Action on Object::@Schema.@Object by @SecurityPrincipal)

        • To turn on the audit,

alter database audit specification with (State = on)

        • To create, alter or drop a DB audit specification, a security principal must have permission to connect to the db and must be assigned either alter any DB audit permission or alter or control permission on the DB
      • Viewing audit views and functions
        • sys.dm_audit_actions
        • sys.server_audits
        • sys.dm_server_audit_status
        • sys.server_audit_specifications
        • sys.server_audit_specifications_details
        • sys.database_audit_specifications
        • sys.database_audit_specifications_details
        • fn_get_audit_file()
    • C2 Audit Mode
      • This configures the instance to record both successful and failed attempts to access statements and objects as per the C2 security standard
      • Enabled using

sp_configure”c2 audit mode”,1;

go
reconfigure;

      • c2 audit data is store in the instance data directory.
    • Common criteria compliance
      • Common criteria compliance supersedes c2 audit mode. This information is viewed by querying sys.dm_exec_sessions.
      • To enable

sp_configure “common criteria compliance enabled”, 1;

go
reconfigure

    • Policy based management
      • This provides you with a method of managing more than one instance across multiple servers.
      • Can be used to manage, instances, databases, and other objects like backup devices
      • Objects to configure policy-based management
        • Facet: A collection of properties that can be configured
        • Condition: The settings that can be configured for the property
        • Policy: A condition that should be enforced.
        • Category: A collection of collections that should be enforced together.
        • Target: Defines the scope of policy application. can include instances, databases or DB objects
      • Policies can be exported and imported between instances.
      • You can create a central management server for applying policies on one instance to all instances.
      • To view information about policy based management
        • syspolict_conditions
        • syspolicy_policies
        • syspolicy_policy_execution_history
        • syspolicy_policy_execution_history_details
        • syspolicy_policy_group_subscriptions
        • syspolicy_policy_categories
        • syspolicy_system_health_plan
Mirroring and replication
  • Mirroring is the process of creating and maintaining an always up-to-date copy of a database on another SQL-Server instance. If a witness instance is configured, automatic fail over is possible if the principal instance becomes unavailable
  • DB mirroring
    • A strategy for increasing the availability of a database.
    • DB mirrors are paired copies of a single DB hosted on separate instances of SQL server.
    • All operations performed against the principal DB is redone on the mirror through a stream of active transaction logs the engine applies in sequence immediately after it receives them.
    • Mirroring can be configured in 2 modes
      • High-safety mode: Enables failover to occur with data loss from committed transactions. Transactions are committed on both partners after they are synchronized.
      • High-performance mode: data loss is possible. The primary instance doesn’t wait for the mirror instance to confirm it has received the log record
    • Mirroring can only occur for user databases
    • Only databases with full-recovery model can be mirrored
    • Mirroring supports only a single principal instance and a single mirror instance.
  • Mirroring pre-requisites
    • All instances must run the same version and edition, except the witness which can run any edition.
    • Mirroring restrictions
      • Mirrored databases cannot be renamed
      • Mirroring cannot work for a DB with Filestream filegroups.
      • Filestream filegroups cannot be created on the principal server.
      • Cross database and distributed transactions are not supported for mirrored databases
    • Before mirroring can occur, you must restore a database using the Norecovery option on the mirror instance.
      • Apply transaction logs with Norecovery option before mirroring is enabled.
    • Firewall rules
      • traffic must be permitted through the endpoint using port permissions for the port configured
    • Configuring mirroring with Windows authentication
      • This is possible if the service accounts on all servers are in the same active directory domain.
        • Otherwise, a login in the master database for the accounts used by the other must be created on all sides.
    • Configuring mirroring with certificate authentication(This sets the database to mirror in high-safety mode)
      • Used when the the service accounts aren’t on the same domain.
      • Create certificates and the endpoint
        • Create the certificate, back them up and create a user on the partner instance
          • Create the certificate

create certificate @CertificateName with subject=@CertificateDescription

          • Backup the certificate on both instances

backup certificate @certificateName to file=’@directory\@File.cer’;

          • Create a user on the partner instance after moving the backed up certificate to @directory on the partner

Create login @SQL_Login with password=@Password;

create user @UserName for login @SQL_Login;
create certificate @PartnerCerficateName
Authorization @UserName
From file = ‘@directory\@FileName.cer’
go

        • Create the endpoint on both instances

create endpoint @EndpointName

state=started
as tcp( listener_port= @PortNumber, listener_ip=all)
for database_mirroring(authentication= certificate @CertificateName, encryption = required algorithm AES, role=all);

          • Grant connection permissions to the endpoint on partner instances

grant connect on endpoint::@endpointName to @partnerSQL_Login

      • Configure mirroring on the database on partner instances

Alter database @DBName

Set partner = ‘TCP://@partnerName.Domain.com:@PortNumber

    • Changing Operating Modes
      • You can change mirroring operating mode on the mirroring page of the database’s properties. You can change to high safety with automatic failover mode only if a witness server is present.
      • Done using Alter database statement with the safety option
        • Setting safety to full configures the database for high-safety mode. This is the default mode for DB mirroring
        • Setting the safety option to off configures the DB for high-performance mode.
      • You can add a witness to a mirroring pair by creating the endpoint on the witness, creating logins for the principal and mirror instance service accounts and granting connect privileges on the endpoint to that login.
        • Also, create a login on the principal and mirror instances for the domain account used for the SQL Server service on the witness.
        • To specify the witness

Alter database @DBName

Set witness = ‘TCP://@WitnessServerName:@PortNumber

    • Role switching and failover
      • There are 3 types of role switching
        • Manual failover
          • Available if the mirroring session is configured in high-safety mode.
          • Can be initiated when the mirroring session is synchronized.
          • Initiated on the principal instance using the command

alter database @DBName set partner failover

        • Automatic failover
          • Possible is a witness is present and the mirroring session is configured for high-safety mode.
          • Requires the mirroring session is in a synchronized state
          • If all instances lose communication, automatic failover will not occur even if the witness and the mirror establish communication.
        • Forced service( with possible data loss)
          • Can be used in high-safety mode when no witness is present or in high-performance mode. The mirror is forced to become the principal when the original principal is not available.
          • Can result in data loss.
          • Initiated on the mirroring instance using the command

alter database @DBNames set partner force_service_allow_data_loss

    • Monitoring mirrored DBs
      • Query sys.database_mirroring on either principal or mirroring instance.
      • Use the Database mirroring monitor.
        • Allows you monitor how data is being transmitted between the primary and mirrored instances in a mirroring session and other details of the mirroring session
      • use SPs
        • sp_dbmmonitorresults
        • sp_dbmmonitorchangemonitoring
        • sp_dbmmonitordropmonitoring
        • sp_dbmmonitoraddmonitoring
    • Upgrading mirrored DBs
      • Requires the following
        • Change the operating mode to high-safety without automatic failover
        • Perform a full backup of the principal server
        • Run DBCC CheckDB on the principal server
        • Upgrade the mirror
        • Manually failover to the upgraded mirror
        • Upgrade the Witness
        • resume mirroring the DB
        • Switch back to high-safety without automatic failover
        • Upgrade the new instance to SQL server 2012.
Database replication
  • Replication enables SQL server distribute and synchronize data and DB objects from one DB to another
  • Replication components
    • Publisher: An instance that makes data available through publication
    • Article: A published object
    • Publication: A collection of articles
    • Distributor: An instance that manages the transmission from publisher to subscriber.
    • Subscriber: An instance that receives publications
    • Agent: A service that enables the publisher, distributor or subscriber perform replication-related tasks.
  • Replication Types
    • Snapshot replication:
      • Allows complete refreshes of data rather than incremental updates
      • Publisher synchronizes with subscriber by creating and transmitting a DB snapshot.
        • The snapshot are generated using bcp
        • They are stored in a folder that the subscriber need access to for pull subscriptions instead of push subscriptions.
      • Suitable for DBs with infrequent use
    • Transactional replication:
      • Allows subscriber to stay up to date with changes on the publisher
      • Changes to the publisher propagate to the subscriber as they occur.
      • Replication occur from the publisher to the subscriber.
      • Update occur at the publisher
      • Allows Oracle database publish to MSSQL
      • Changes must be applied to the subscriber in the order they occurred in the publisher
    • Peer to peer replication:
      • Enables peer nodes to read and write changes and have those changes propagate to other nodes in the replication topology
      • You prevent conflicts by partitioning the data so the same row will not be updated at separate locations.
        • if the data isn’t partitioned properly and a conflict occurs, the replication fails or the changes will not replicate.
      • Requirements
        • All instances must run SQL-Server enterprise edition.
        • Row and column filtering aren’t allowed
        • Publication names must be identical on all instances participating in the replication
        • Each participant should use its own distribution DB to ensure there is no single point of failure
        • Tables in multiple peer to peer publications cannot be included in the same publication DB
        • peer to peer subscriptions cannot be reinitialized. You must restore a backup at a node to force data update
        • If you add new nodes and have to perform a restore, you must only use backups created with the new nodes in the topology
    • Merge replication:
      • For scenarios where data conflict is possible and any node may issue data changes that will eventually consistent across all nodes.
      • Allows databases hosted on instances at separate locations to be updated and those changes replicated to all other DBs in the replication topology.
      • Publisher and subscribers track changes made using triggers.
      • synchronization involves exchanging all rows that have changed between the publisher and subscriber since the last synchronization.
      • Requirements
        • Subscriber can make offline changes to data and synchronize those changes back when they connect again.
        • You can detect and resolve update conflicts
        • You must support non-SQL server nodes participating in the replication
        • Applications do not require tables to be transactionally consistent
      • It uses SQL server snapshot agent and the merge agent for conflict resolution.
        • If the merge agent detects a conflict, e.g a row being updated with different values at different locations, the conflict resolver determines which data is accepted.
        • When configuring a subscription in merge replication, you specify a subscription type and a conflict resolution priority.
          • Servers assigned a higher priority override servers assigned a lower priority.
          • When you use client subscription type, the priority for conflict resolution is first to publisher wins,
          • When a conflict occurs between a publisher and subscriber, the publisher change is kept and the subscriber change is dropped.
  • Replication monitor
    • Can be used to monitor replication performance.
    • Can be used to configure alerts based on performance benchmarks.
  • Size limitations for Replication
    • 256 merge publication articles
    • 32767 Snapshot or transactional publication articles
    • 246 Merge publication columns in a table
    • 1000 snapshot or transactional publication columns in a table
    • 1024 merge replication bytes for a column used in a row filter
    • 8000 snapshot or transactional publication bytes for a column used in a row filter
  • Controlling replication of constraints, columns and triggers
    • The not for replication option enables you configure Foreign key constraints, check constraints, identity column values and triggers behave differently when an insert, update or delete operation is performed by a replication agent than they behave in normal operation.
      • They will not be enforced for the operation with that option.
  • Heterogeneous Data.
    • You can integrate to Oracle and DB2
      • Oracle can publish to DBs on SQL server
        • The data can be published on the SQL server to serve other DBs on Oracle, DB2 or SQL Server.
      • SQL can publish or distribute data for replication on oracle and DB2
    • This doesn’t support Merge or peer to peer replication.
Clustering and Always On
  • Failover clustering is an approach to high availability that ensures a database remains available in the event of a server failure.
    • A failover cluster instance is a deployment of SQL server that stores DB files on a shared storage device. If the server that hosts an instance fails, another Instance in the cluster takes control of the files and seamlessly continues to provide service.
      • The DB files must be on a SAN
      • Connect the server to the SAN using the iSCSI initiator
    • Failover clustering requires the windows failover cluster feature.
    • Installing a SQL failover cluster
      • During installation of SQL server, select advanced then advanced cluster preparation
      • Conclude installation of the instance
    • Adding a node to the cluster
      • During installation of SQL server, select advanced then advanced cluster completion
    • To perform a manual failover,
      • using powershell

move-clustergroup @ClusterResource @DestinationServer

    • Troubleshooting failover clusters
      • If failover occurs and a node is irrecoverable,
        • evict the node from the failover cluster
        • Replace the hardware and add the node back to the cluster
        • run SQL setup to admit the node to the failover cluster instance.
  • AlwaysOn is a technology that replaces DB mirroring. It allows clients read-only access to the secondary replica
    • An availability group is a collection of user DBs(availability DBs) that can fail over together.
    • Supports a set of read-write primary databases and up to four sets of secondary databases.
    • They allow you configure one or more sets of secondary databases so that they are accessible for read-only operations
    • Failover occurs on a per-replica basis, and all DBs in the replica fail over.
    • Availability group prerequisites.
      • All hosts must be nodes in a failover cluster.
    • Availability modes
      • The availability mode selected depends on data loss and transaction latency requirements.
      • There are the following availability modes
        • Asynchronous-commit : Suitable when you must place availability replicas at different location.
          • The primary will not wait for the secondaries to write log records to disk. This reduces transaction lateny.
        • Synchronous-commit : increases transaction latency but minimizes the chance of data loss in the event of an automatic failover
          • Each transaction is applied to the secondary replica before being written to the local log file.
      • You can alter availability mode using the alter availability group statement

alter availability group @AvailabilityGroupName modify replica on ‘@Host\InstanceName’ with (availability_mode=@availabilityMode)

    • Selecting failover modes
      • Availability groups fail over at the availability replica level.
      • Failover involves another instance becoming the primary replica, with the original primary replica being demoted to become a secondary replica. There are 3 types
        • Automatic failover:
          • Occurs without administrator intervention.
          • Involves no data loss
          • requires the primary and a secondary to be configured with a failover mode of automatic and be synchronized.
          • Can occur only if the primary and replica are in synchronous-commit mode.
        • Planned Manual failover
          • Triggered by an administrator
          • Involves no data loss
          • Can occur if at least one of the secondary replicas is in a synchronized state.
          • Require primary and replica instances to be in synchronous-commit mode.
          • To carry out manual failover,
            • TSQL(run the following on the proposed primary instance)

alter availability group @AvailabilityGroup failover;

            • Powershell(run on the proposed primary instance

switch-sqlavailabilitygroup -path SQLSERVER:\SQL\@HostName\@FailoverClusterName\AvailabilityGroups\@AvailabilityGroupName

        • Forced manual failover
          • This form of failover involves the possibility of data loss. Use forced manual failover when no secondary replica is in the synchronized state or when the primary replica is unavailable.
          • Can work if either node is set to asynchronous-commit mode.
          • Done using
            • TSQL(run the following on the proposed primary instance)

alter availability group @AvailabilityGroup force_failover_allow_data_loss;

            • Powershell(run on the proposed primary instance

switch-sqlavailabilitygroup -path SQLSERVER:\SQL\@HostName\@FailoverClusterName\AvailabilityGroups\@AvailabilityGroupName -AllowDataloss

    • Configuring readable secondary replicas
      • Readable secondary replicas can service read-only requests for DB access
    • Deploying Alwayson availability groups
      • Requires the following tasks
        • create a mirroring endpoint:
          • Not necessary if using a domain account for all SQL server service, as it is automatically created
        • enable alwaysOn on the instance
          • From configuration manager. Enable AlwaysOn on the instance.
          • Powershell:

enable-sqlalwayson -path SQLSERVER:SQL\@Hostname\@Instance

        • create an availability group
          • From SSMS
            • Use the always on availability node.
            • To add an encrypted DB or a DB with an encryption key, use add a secondary replica.
          • TSQL

create availability group @AvailabilityGroupName

for database @DbName
replica on ‘@ReplicaHostName1\@InstanceName’ with
( Endpoint_URL= ‘TCP://@ReplicaHostName1[.@Domain.com]:@Port’,
availability_mode= @AvailabilityMode, Failover_mode= @FailoverMode),

‘@ReplicaHostName2\@InstanceName’ with

( Endpoint_URL= ‘TCP://@ReplicaHostName2[.@Domain.com]:@Port,’
availability_mode= @AvailabilityMode, Failover_mode= @FailoverMode);
        • create an availability group listener
          • An availability group listener is a network connectivity endpoint for an availability group.
          • Clients connect to the listener, which connects them to the availability group’s primary instance.
          • To create,

alter availability group [@AvailabilityGroupName] add listener ‘@ListenerName’ (with IP((‘@ListenerIP’, ‘ListenerSubnet’)), port=@portNumber);

        • add a secondary replica
          • Requires some conditions
            • fewer than four secondary replicas
            • primary replica is online
            • Secondary replica’s instance can connect to the primary replica mirroring endpoint
            • Secondary replica instance has enabled alwaysOn
          • To add

alter availability group @AvailabilityGroupName join;

Troubleshooting SQL server
  • Performance monitor
    • A windows utility for capturing statistics about the hardware, OS and apps to expose properties and counters.
    • Collects and displays real-time performance data in the form of counters for server resources e.g. processor and memory use.
    • Data collector sets provide the ability to group data collectors into reusable elements for different performance-monitoring scenarios.
      • They organize data collection points into a single component that can be used to review or log performance.
      • Can be recorded individually, grouped with other into logs, and used to generate alerts.
      • They can contain
        • Performance counters
        • Event trace data
        • System configuration information(registry key values)
    • Counters are organized into a 4 level hierarchy
      • An object is a resource that can be monitored.
        • a component, application or subsystem within an application.
      • an object exposes one or more counters
      • a counter might have several instances if more than one resource of that type exists e.g. multiple processors
    • When you define counters to capture, you can specify criteria at any level within the hierarchy.
    • Counters are used in conjunction with other counters and environmental info to diagnose problems
    • 3 Counters indicate a system problem on their own. Their values determine how many requests are waiting for the resource to become available. When there values are greater than 0 for a long period, it indicates a hardware bottleneck
      • System processor Queue Length
        • Indicates when the processor is overwhelmed with activity
      • Network Interface output queue length
        • Indicates when the network interface is overwhelmed with activity
      • Physical disk avg. disk queue length
        • Indicates when the disk is overwhelmed with activity
  • SQL server profiler
    • Most performance issues can be improved by tuning poor performing queries. SQL server profiler helps identify where to focus query tuning efforts.
    • SQL server profiler and extended events profiler provide the ability to trace the activity that is occurring in SQLEngine.
      • Server profiler can also trace AS activity.
      • Traces captured by both can be used to diagnose performance issues and replay workloads.
      • Profiler can be configured to filter events but they have a performance cost.
        • However, they minimize the overhead that is incurred during tracing
    • Captured events are displayed graphically and can be captured to a file or DB table.
      • creating trace files is an efficient way to store traces using minimal system resources.
    • A category is a group of related event classes. An event class contains all the data columns that can be reported by an event.
      • An event is the occurrence of an action within an instance of the SQLEngine and is defined by its attributes.
    • Trace columns represent data that can be captured when an event occurs. Data columns contain the attributes of events
    • Runs on a machine and communicates to the SQLEngine using SQL Trace procedures.
    • SQL Trace is a feature that runs in SQLEngine and runs traces using sys.SPs.
      • It is lightweight
      • Trace must be defined using system stored procedure calls
      • runs inside the SQLEngine
      • Do not automatically restart if the instance is restarted. To auto start it, script it and put it in a startup procedure.
      • SQL has a default trace that starts with the instance.
        • This includes SQL server events
        • objects being created or dropped.
      • Traces can be opened using profiler and imported into a table.
  • Monitoring SQL server
    • Activity monitor is a graphical representation of monitoring information that uses DMVs and DMFs
      • It provides an overview of the system performance and information about processes, waits, I/O resources and expensive queries.
      •  
    • Dynamic management views(DMVs) and dynamic management functions(DMFs) provide insight into current and historical SQL server activity.
      • They return server state information that can be used to monitor the health of an instance, diagnose performance and tune performance.
      • There are 2 types of DMVs and DMFs.
        • Server-scoped DMVs and DMFs
          • They require View server state permission on the server
        • Database-scoped DMVs and DMFs
          • they require view Database state permission on the database
      • common DMO categories
        • sys.dm_exec_*: info about connections, sessions, requests and query execution
        • sys.dm_os_*: info about OS related stuff
        • sys.dm_tran_*: access to transaction management
        • sys.dm_io_*: provide information on I/O processes
        • sys.db_db_*: provide database scoped information.
    • Querying DMVs
      • DMV/Fs cannot be referenced using one part names.
      • They exist in the sys schema and are named in the convention dm_*
      • They expose in-memory structures, which are cleared if the instance is restarted.
    • There are 2 types of DMOs
      • Objects that provide real-time state information
      • Objects that provide recent historical information
  • Data collector
    • Provides a high-level view into aggregated trends over time and information to identify issues.
    • Allows you capture historical data and access detailed data when investigating issues.
    • Provides a central point for data collection across DB servers and applications.
    • Stores collected data in a relational database known as the management data warehouse(MDW).
      • Data can be collected constantly or on a schedule
    • When you configure the MDW, some system data collection sets are created.
    • Data collection sets
      • This define the data that needs to be collected, how often they are uploaded to the warehouse and how long they are retained in the warehouse.
      • They can query DMOs to retrieve detailed information about the operation of teh system
      • Can retrieve performance counters that provide metrics about the performance of both SQL server and the entire server
      • Can capture SQL trace events.
    • Data collector topology: The data collection system consists of 3 components
      • The data collector: a set of jobs that run on the local server and collect data from DMOs, performance counters and SQL trace events and upload to the MDW
      • The MDW which can consolidate data from multiple instances
      • 3 standard reports and several subreports to analyze data in the DMW. You and also write your own reports.
    • You create a MDW for
      • access to reports that combine information for all server instances in your enterprise
      • Offload the need to hold collected data and report on it from production servers
      • Persist data stored in DMOs
    • configuring the data collector: this involves 2 actions
      • Configure the MDW
      • configure the instances to upload the data
        • Lower volume data is sent immediately to the warehouse
        • Higher-volume information is cached locally and uploaded to the warehouse using SSIS
    • Monitoring data collector
      • configuration and log information for the data collector is written to msdb and is implemented by SPs and SSIS logging features.
    • Analyzing collected performance data
      • After performance data has been collected from a number of instances and been consolidated into a MDW, the data can be analyzed using reports built with SSRS.
      • There are 3 default reports
        • Server activity history
          • contains CPU, memory, disk and network I/O. SQL server waits and SQL server activity
        • Disk Usage History
          • Trends and details of disk and file usage
        • Query statistics history
          • Most expensive queries ranked by CPU, duration, reads and writes
  • Identifying Bottlenecks
    • Monitoring Disk Usage
      • There are 2 counters for disk activity on the OS level
        • PhysicalDisk:% Disk Time
        • PhysicalDisk:Avg. Disk Queue length
      • There are 2 counters for disk activity by SQL server
        • SQL server: Buffer manager: Page reads/sec
        • SQL server: Buffer manager: page writes/sec
        • If the values for these counters approach the capacity limit of the hardware I/O subsystem, tune you application or DB to reduce I/O Operations( better indexing, normalization).
    • Monitoring Memory usage
      • There are 2 counters for memory at the OS level
        • Memory: Available Bytes
        • Memory: Pages/Sec
          • this indicates the number of pages that were either retrieved from disk due to hard page faults or written to disk to free space in the working set due to page faults.
      • Counters for memory usage by SQL server
        • Process: working set
        • SQL Server: Buffer Manager: Buffer Cache Hit Ratio
        • SQL Server: Buffer Manager: Total pages
        • SQL Server: Memory Manager: Total Server Memory(KB)
    • Monitoring CPU Usage
      • CPU usage can be monitored using
        • Single Processor systems
          • processor: %Processor time counter
            • Measures the amount of time the CPU spends executing a thread that isn’t idle.
        • Multi processor systems
          • System: % Total processor time counter
            • This takes the average for all the processors. To monitor a single processor, use the instances of the counter
Indexes and concurrency
  • Implementing and maintaining indexes
    • A balanced tree structure consists of the start of the tree i.e. the root node.
      • Beneath the root node are 1…n intermediate level pages
      • each intermediate-level page can have additional intermediate-level pages.
      • the last level comprises the leaf-level pages.
      • Query optimizer knows at most how many page accesses are needed to reach a leaf page.
      • When a page is full, it is split to maintain this balanced condition.
    • A heap is an unorganized collection of data. It is appended to the last data page when presented for storage.
      • This makes it terrible for data retrieval but wonderful for data loading.
    • Clustered and non-clustered indexes can be partitioned to provide better performance.
    • Types of Indexes
      • Clustered
        • A structure in which the actual table data resides in the leaf-level pages of the index. The data is physically sorted in the order of the clustering key.
        • The clustered index key(CIK) is the row identifier for the table data.
          • It is added to any non-clustered indexes created on the table.
          • The key should have the following specifications.
            • Unique
            • Narrow
            • Static: Not updated.
      • Non-clustered
        • It contains only keys and no actual data.
        • In addition to the key columns, SQL server stores the clustering key for that row in the table.
          • This allows the storage engine to use a non-clustered index to reduce the number of rows returned and then use the clustering key to go to the clustered index to retrieve the data requested.
        • They can be applied to heaps or clustered table. The difference is in how rows are identified
          • In clustered tables, the non-clustered index contains the clustered index key to locate the rows in the base table
          • In heaps, The non clustered index uses a row identifier in place of the CIK.
      • Covering indexes
        • A non-clustering index that contains all the information needed to satisfy a query
        • Allow adding information to the non-clustered index data pages and avoid having to look up the row in the clustered index.
      • Filtered indexes
        • Basically an index with a where clause. This allows you limit the amount of data in the B-Tree.
      • Primary XML and secondary XML indexes
        • To provide reasonable performance when querying XML data.
        • If not created, the entire XML document must be loaded into an XML DOM and searched
      • Spatial Indexes
        • Improve performance when querying data of geography and geometry data types
      • Full-text indexes
        • Enable the efficient searching of text data, often using words that occur near each other in the text
      • Columnstore indexes
        • The index uses a column oriented storage that is very efficient for data warehouse operations.
        • Doesn’t use the B-tree structure.
    • Designing indexes for efficient retrieval
      • When constructing multicolumn indexes, list equality columns( columns instead in a where clause with =) before inequality columns. List the most selective columns for the above types first before the least selective columns.
      • When joining, index the column used to join to the other table.
      • Clustered indexes
        • A clustered index is the most efficient index for retrieval with
          • Queries that return a majority of the columns in the table
          • Queries that return a single row based on the CIK
          • Queries that return range-based data
      • Non-clustered indexes
        • A Non-clustered index is the most efficient index for retrieval with
          • Queries that return few rows
          • Queries that can be covered by the index.
        • They have a maximum limitation of 16 key columns and 900 bytes
      • Covering indexes
        • Consider using include columns for columns in the select list that aren’t already part of the key columns
      • Filtered indexes
        • A filtered index is the most efficient index for retrieval
          • When combined with sparse columns to locate specific non-null rows
          • When queries that are a small subset of the rows are selected often
      • Primary XML and secondary XML indexes
        • The first XML index must be a primary XML index type.
          • The primary XML index builds a nodes table.
          • For each element in the XML document, there will be one row in the nodes table. If you have many rows, each of which contain an XML document with many nodes, the size of the XML index can be quite large.
        • There are 3 secondary XML index types
          • Path
            • Consists of a regular non-clustered index.
            • supports operations such as .exist()
          • Property
            • Consists if a regular non-clustered index
            • Useful for searching for nodes that have particular values
          • Value
            • Consists of a regular non-clustered index
            • The reverse of the path index
            • most useful for searching for descendant nodes for a given path
      • Full-text indexes and semantic searches
        • Special indexes on char, text, image, xml and varbinary data types.
        • These are token based functional indexes that store information about significant words, where they exist within text within a column or table in the DB
        • Only one is allowed per table, and it contains all the tokens for all columns that were indexed for that table.
          • This allows you search for words in one column that have significant words in other columns in an efficient manner
        • Allows you carry out inflectional searching.
      • Columnstore indexes
        • They are a non B-tree type of index.
        • used to provide performance improvements for data warehouse type queries that perform aggregations over large data sets.
          • This queries mostly use a few of a table’s columns to satisfy queries. They use column-based index structures
        • They optimize storage by compressing repeating data values
    • Statistics
      • Statistics are objects in the database that contain statistical information about how the values in a column are distributed. Query optimizer uses this information to determine the cardinality of the rows.
        • cardinality refers to number of rows that exist for a given value.
      • When you create an index, SQL Server creates a histogram of the distribution of values in the leading column of the index.
      • Statistics can be either a sampling of the rows or all rows, based on how they were created or updated and the number of rows in the table.
    • Updating statistics
      • statistics are manually updated using update statistics

update statistics @TableName with @Option

        • @option can be
          • fullscan[[, index][,Columns]]
          • Sample [@NoOfRows Rows]|[@percent percent]
    • Index internals
      • There are DMVs that can be used to retrieve information about indexes. e.g. sys.dm_db_index_physical_stats
    • Creating Indexes

create @Indextype Index @Indexname

On @Table(@ColumnList);
Go

    • Fill factors
      • An option when creating indexes
      • They determine how much free space will be left on the leaf data pages.
      • Default is 0(Leave no free space)
        • When considering change, determine the data access pattern of the table.
          • If the CIK is an identity column, this increases the amount of space used by the system.
          • If the CIK is a random value, it might limit page splits.
      • To specify an fill factor, rebuild the index

alter index all on @TableName @DefragmentationType

with (fillfactor=@PercentageToFillPageTo)

        • @PercentageToFillPageTo can be 80 if you want the pages to fill to 80%
        • @DefragmentationType is either rebuild or reorganize
    • Fragmentation
      • Fragmentation in index pages is when the logical ordering(based on the key value) doesn’t match the physical ordering within the data file.
      • Fragmentation information is available in DMV sys.dm_db_index_physical_stats
      • Removing index fragmentation.
        • When fragmentation is bad, rebuild or reorganize the index.
          • They are
          • Indexes should be rebuilt when fragmentation is above 30 %
          • Indexes should be reorganized when the fragmentation is between 10 and 20 %.
          • Reorganize
            • Reorganize doesn’t update the statistics histogram for the object.
            • Reorganize is online and doesn’t hold any blocking locks
            • Requires 8Kb free space in the DB
            • Single threaded
          • Rebuild
            • Rebuild is used when heavier fragmentation levels exist or when out of order extents are limiting read-ahead operations.
            • Rebuild requires long-term locks and can cause blocking unless performed online
            • Builds a second copy of the index and requires space to hold this copy.
            • This can use parallel operations
            • Requires Enterprise edition to perform online Rebuild
              • Online rebuild cannot be performed on indexes with XML and spatial data types.
    • Tracking missing indexes
      • Each time query optimizer compiles, SQLEngine tracks 500 latest indexes that the optimizer could have used to satisfy the query more efficiently had they existed.
      • It keeps tracks of queries since the last SQLEngine Restart
      • They are calculated using 4 DMOs
        • sys.dm_db_missing_index_columns
          • lists the columns that would comprise the missing index
        • sys.dm_db_missing_index_details
          • list detail on the columns used for equality, inequality and included columns
        • sys.dm_db_missing_index_groups
          • A table that joins the above 2
        • sys.dm_db_missing_index_group_stats
          • Provides the metrics on a group of missing indexes
    • Reviewing unused indexes
      • Each time a table is InsUpDeled, all indexes for the table must be maintained to reflect those changes
      • If the cost for maintaining an index is more than the overhead of maintaining it, disable the index.
        • See the cost using this script.
SELECT
OBJECT_SCHEMA_NAME(i.OBJECT_ID) AS [SchemaName],
OBJECT_NAME(i.OBJECT_ID) AS [ObjectName],
i.name AS [IndexName],
i.type_desc AS [IndexType],
ius.user_updates AS [UserUpdates],
ius.last_user_update AS [LastUserUpdate]
FROM sys.indexes i
INNER JOIN sys.dm_db_index_usage_stats ius
ON ius.OBJECT_ID = i.OBJECT_ID AND ius.index_id = i.index_id
WHERE OBJECTPROPERTY(i.OBJECT_ID, ‘IsUserTable’) = 1 — User Indexes
AND NOT(user_seeks > 0 OR user_scans > 0 or user_lookups > 0)
AND i.is_primary_key = 0
AND i.is_unique = 0
ORDER BY ius.user_updates DESC, SchemaName, ObjectName, IndexName
  • Identifying and resolving concurrency problems
    • Concurrency refers to the availability of SQL Server to service requests from the maximum amount of users at the same time
    • Defining transactions and transaction scope
      • A transaction is a way to group, or batch a series of updates to a data so that either all updates succeed and are committed at once or if one fails, none are committed and the entire transaction is rolled back.
        • To preserve the consistency of the data involved in the batch of updates, locks are held on the rows that are being updated.
        • The length of time that these locks are held is know as the transaction scope.
        • Concurrency can be improved by ensuring transaction scopes are as short as possible.
      • SQLEngine will lock the minimum number of resources needed to accomplish its goal(multi granular locking).
        • To do this, it must announce its intent to take locks to higher levels.
          • To modify a row, an exclusive row must be obtained for the row, and an intending exclusive to the page and higher levels.
          • This is done for lock management. If another transaction wants a lock, SQLEngine checks locks from the highest level down to determine if the resource is locked.
      • Lock modes
        • Shared: For read operations that do not change or update data. E.g. select
          •  
        • Update: Used on resources that might be updated
        • Exclusive: Used of Data modification operations. Ensures multiple updates cannot be made to the same resource at the same time. E.g. Exclusive
        • Intent: Used to establish a lock hierarchy. The types are
          • intent shared(IS),
          • intent exclusive(IX)
          • shared with intent exclusive(SIX).
        • Schema: Used when an operation dependent on the schema of a table is executing. The types are schema modification(Sch-M) and schema stability(Sch-S)
        • Bulk update(BU): Used when bulk copying data into a table and the tablock hint is specified
        • Key-range: Protects the range of rows read by a query when using a serializable transaction isolation level. Ensures other transactions cannot insert rows that would qualify for the queries of the serializable transaction if the queries were run again.

Image

      • Lock management
        • SQLEngine uses lock management to increase concurrency.
        • Query optimizer determines the scope of locks to be taken on a database at the beginning of the transaction.
        • SQLEngine can dynamically escalate the lock type during execution.
      • Transaction Isolation levels
        • There are 4 ANSI isolation levels
          • Serializable
          • Repeatable read
          • Read committed(default)
          • Read uncommitted
        • SQL server introduces 2 custom isolation level that use row versioning in tempdb
          • Snapshot Isolation
            • This improves concurrency and has the same characteristics as the serializable isolation level.
            • It specifies that the data read by any statement in a transaction will be the transactionally consistent version of the data that existed at the start of the transaction.
              • Data modification made by other transactions after the start of the current transaction are not visible to statements running in the current transaction.
            • Configured on the database and requires you to set the transaction isolation level at the beginning of the transaction
Alter Database @DBName set allow_snapshot_isolation on

Set transaction isolation level repeatable read;

          • Read Committed snapshot isolation
            • Similar to snapshot isolation level as it allows readers read data without being blocked by locks acquired by writers.
            • doesn’t require the user to ask for it explicitly as it is automatically applied whenever the transaction is at a read committed isolation level.
        • The characteristics of the isolation levels
          • Dirty read
            • When a transaction is allowed to read data from a row that has been modified by another running transaction that has not yet committed
          • Non-repeatable read:
            • Occurs when during the course of a transaction, a row is retrieved twice and the values within the row differ between reads
          • Phantom read:
            • Occurs when in the course of a transaction, two identical queries are executed, and the collection of rows returned by the second query is different from that returned by the first.

Image [1]

          • To query the isolation level of a transaction, use

DBCC useroptions

        • Lock duration is based on
          • The lock owner.
          • A server side cursor
          • The session

Image [2]

        • Monitoring locks
          • DMVs provide you information about locks and transactions currently in effect.
          • sys.dm_tran_locks displays the locks being held by open transactions
          • sys.dm_tran_database_transactions displays the open transactions at the DB level
          • sys.dm_tran_session_transactions correlates associated sessions and transactions
      • Blocking
        • Blocking occurs when transactions are waiting for a resource to be freed due to another transaction’s lock on it.
          • Can be identified when
            • request_status in Sys.dm_tran_locks DMV shows the value of wait.
            • If the blocking_session_id of sys.dm_exec_requests or sys.dm_tran_locks > 0, it indicates blocking
        • Blocking can also occur due to compiling or recompiling query execution plans.
          • Can be identified when
            • wait_resource in sys.dm_exec_requests contains compile
            • resource_subtype in sys.dm_tran_locks
        • Blocks can be avoided by
          • keeping transaction scope as short as possible
          • Not allowing user interactions during transactions.
          • Practicing proper indexing to limit locks acquired and reduce the chance of blocking
          • Elevating the transaction isolation level above the default only for a good reason
          • Examining the T-SQL code to see whether developers have added locking hints, index hints, or join hints
        • Using AlwaysOn Replicas to improve concurrency
          • Reporting queries can reduce concurrency in OLTP systems as they often carry out aggregation which blocks InsUpDel operations.
          • A new feature allows you offload reporting load to a read-only replica created with always on.
          • To do this, add a clause to the connection string ApplicationIntent=ReadOnly.
        • Deadlocks
          • Deadlocks occur when 2 or more tasks permanently block each other because each has a lock on a resource that the other tasks are trying to lock.
          • Resources that could cause a block are
            • Locks.
            • Worker Threads.
            • Memory.
            • Parallel query execution-related resources.
            • Multiple active result sets(MARS) resources.
          • Types of deadlocks
            • conversion deadlocks
              • They occur when two connections both own shared locks on a resource and they both try to convert their locks to exclusive locks.
            • writer-writer deadlocks
              • caused by resource ordering
              • Fixed by ensuring that all code access resources in the same order
            • Reader-writer deadlocks
              • Occur when both connections hold exclusive locks on different resources, and each tries to request a shared lock on the other connection’s resource
            • Cascading deadlocks
              • They occur when there are more than 2 connections involved in the deadlock. Even after a deadlock victim has been chosen, another deadlock still exists.
              • Deadlock victims will continually be chosen until all the deadlocks in the chain have been resolved.
          • SQL server has a thread just for looking for deadlocks
          • Deadlocks raise a 1205 error upon which one of the processes in the deadlock is terminated and rolled back.
            • The victim is usually the one with the fewest transaction log records
              • This behavior can be overridden by setting deadlock_priority on the connection
          • Capturing deadlock information
            • Using trace flags
              • Turn on trace flag 1204
                • Writes deadlock information into SQL server activity log in a text format
              • Turn on trace flag 1222
                • Writes the deadlock information to XML
              • The trace flags can be turned on using DBCC Traceon

DBCC Traceon(@TraceFlagNo, -1)

                • This allows you carry out tracing to the SQL server activity log without the overhead of a profiler trace
            • For more detailed information, use
              • SQL trace to capture
                • lock:deadlock chain
                • deadlock graph
                • lock:deadlock
              • Extended events
                • xml_deadlock_report events in the system_health session
      • Activity monitor
        • Used to monitor the health of an instance.
        • It has 4 views to show real time metrics
          • %Processor time
          • Waiting tasks
          • Database I/O
          • Batch requests per second
      • Diagnosing bottlenecks
        • Microsoft recommends using the waits and queues methodology to solve bottlenecks

4fb29d0f39a90cf22617aa4d818d5fe7

          • First determine what is causing the bottleneck
            • Examine sys.dm_os_wait_stats so you can see the cumulative wait times for every possible wait type since SQL server was restarted.
              • To reset it run DBCC sqlperf(‘sys.dm_os_wait_stats’, clear);
            • Examining the wait type can reflect a bottleneck on the same resource.
            • A task is a unit of work scheduled for execution.
              • It can be in 1 of 3 stages
                • Running: currently executing
                • Runnable: It has the resources it needs and is waiting for a scheduler to execute
                • Suspended: Waiting for a resource
              • You monitor task status using sys.dm_os_waiting_tasks
                • The wait_duration_ms column shows how long
                • The wait_type shows what the wait type is for that session
SELECT wt.session_id,wt.wait_duration_ms,wait_type, blocking_session_id, status
FROM sys.dm_os_waiting_tasks AS wt
JOIN sys.dm_exec_sessions AS s ON wt.session_id = s.session_id
FROM sys.dm_os_waiting_tasks AS wt
JOIN sys.dm_exec_sessions AS s ON wt.session_id = s.session_id
FROM sys.dm_os_waiting_tasks AS wt
JOIN sys.dm_exec_sessions AS s ON wt.session_id = s.session_id
      • Using reports for performance analysis
        • The standard reports provide a lot of information in graphical format.
        • Custom reports can be added for additional reports access.
        • SQL server 2012 performance dashboard report shows a simple method of seeing current and historical reports for a database.
      • The Kill process
        • Used to remove a session that is not responsive or is the head blocker in a blocker chain.
        • Can be done using
          • Activity monitor: Select the process and click kill
          • TSQL:

Kill { @SessionID } [with statusonly]

            • the With statusonly option shows you the status of the kill operation as the DB is moved to a transactionally consistent state.
            • Kill cannot be used to kill your own user processes
SQL server agent, backup and restore
 
Managing SQL server Agent
  • Server Agent enables you automate execution of scheduled or event-driven administrative jobs.
    • A job is a collection of tasks(Job steps).
    • The Server agent account determines how the system interacts with resources(Both internal and external).
      • If The agent account is on the domain, it needs
        • Logon as a service right
        • membership of the access security group so jobs owned by domain users that aren’t in the host’s local admin group don’t fail.
        • be a member of sysadmin role to
      • To support SQL server agent proxies, the service account must have additional permissions
        • This allows users that aren’t part of sysadmin role create jobs
      • If you want to support multi-server job processing, you must add the account used for agent to the TargetServersRole in msdb on the master server.
    • Server agent security
      • If a user is not a member of the sysadmin role, he must be a member of either of the following roles to use server agent.
        • SQLAgentUserRole: Users have permissions only on the local jobs and job schedules they own. They cannot run multi-server jobs
        • SQLAgentReaderRole: Members have SQLAgentUserRole permissions and can view the properties of all jobs and job schedules including multi-server jobs
        • SQLAgentOperatorRole: Members can execute, enable, disable, stop or start all jobs and can delete the job history for any local job.
    • SQL server agent mail
      • Database mail is used to transmit alerts as email messages.
    • SQL server agent error log
      • This stores warnings and error messages from server agent.
    • SQL server can be configured to generate alerts when it detects events and trigger conditions
      • Generating an alert involves
        • Creating an alert name
        • Specifying an alert trigger
        • Specifying an alert action
    • Jobs are used to automate routine tasks.
      • They can be scheduled to run at particular times
      • They can be monitored with the job activity monitor.
      • Notifications can be sent to operators from SQL agent.
      • In multi server environments
        • Server are either targets or master servers.
          • Target servers report to master servers
        • Server agent’s service account should be on the domain.
        • To link a target to a master, on the target, run sp_msx_enlist, and to delist, run sp_msx_defect
  • Backup strategy
    • SQL server supports the following backup types
      • Full DB backups
        • Includes all DB objects, system tables and data.
        • Transactions that occur during the backup are also recorded
        • Allow you to completely restore a database.

Backup database @DBName to @Destination

      • Differential backups
        • Backs up data altered since the last full backup regardless of whether other differential backups have been taken.
        • Require less time than full DB backups
        • Transactions that occur during the backup are also recorded

Backup database @DBName to @Destination with differential

      • Transaction log backups
        • This records the changes that have occurred since the previous transaction log backup then truncates the transaction log thus removing transactions from the log that have been committed or rolled back.
        • They represent the state of the transaction log at the time the backup starts.
        • They are incremental backups and must be restored in sequence of being taken.

Backup log @DBName to @Destination with differential

      • File and filegroup backups
        • This backs up individual DB files and filegroups rather than performing a full DB backup.
        • This backs up very large databases
        • The transaction log must also be backed up
        • The truncate log on checkpoint must be disabled.
        • Useful for very large DBs
      • Copy only backups
        • Functionally the same as full DB or transaction log backups.
        • They do not affect the backup sequence
          • It doesn’t truncate the transaction log and is ignored by differential backups as a reference point.
    • You can use multiple backup types in a backup strategy.
    • Backup compression reduces the amount of space required to store a backup
      • CPU load is increased during the compression.
      • Resource governor can be used to limit CPU usage.
      • Compressed and uncompressed backups cannot be stored in the same media set.
    • Recovery models
      • A recovery model determines how you backup a database and the strategy you implement to meet your recovery point objective(RPO).
        • RPOs enable you measure the amount of data that might be lost in the event of failure. Often times the time referenced in SLAs
        • RTOs (Recovery Time Objective) measure the maximum amount of time it can take to restore data to the RPO.
      • There are 3 recovery models
        • Simple: No transaction log backups are taken. It is only possible to recover to the most recent DB backup only.
        • Full: This enables you take transaction log backups and full backups.
          • This allow you recover to the point of the last full/ transaction log backup.
          • Allow point-in-time recovery
        • Bulk-logged: a recovery model that minimizes transaction log activity during bulk operations. You cannot perform point-in-time recovery using this model
    • System DB backups
      • The preferred backup strategy for each system DB is
        • master: Frequently
        • model: only after modifications
        • msdb: Only after you make changes to jobs and alerts
        • tempdb: Not backupable
        • Resource DB:
          • Holds copies of all system objects
          • Cannot be backed up directly. Require a File-based backup. Should be backed up after patching or upgrading SQL server
        • Configure distribution.
          • Used when the instance is the distributor in a replication topology.
          • Should be backed up when the publisher and subscription DBs are backed up.
    • Backing up replicated DBs
      • When backing up the replicated DBs, you should also back up the system DBs used by the replication process.
      • The following strategies are required
        • At the publisher: master, msdb and publication
        • At the distributor: master, msdb and distribution
        • At the subscriber: master, msdb and subscription
      • When backing up databases that use snapshot and transactional replication, set the sync with the backup option on the publication and distribution databases.
        • This ensures transactions in the publication DB log are not truncated until they have been backed up at the distribution DB
    • For mirrored DBs, you can only back up the principal DB. On it, you cannot use the backup log with norecovery option
    • Backing up alwaysOn replicas
      • This depends on the automated backup preference setting applied at the availability group level
      • There are 4 options for availability group backup
        • Prefer secondary
          • Default
          • Backups occur only on secondary except when only the primary is online in which case it uses the primary
        • Secondary only
          • Backups occur only on secondary and do not occur if the primary is the only replica online
        • Primary
          • Backups should be taken on the primary replica.
          • This allows differential backups to be taken
        • Any Replica
          • Backups can occur on any replica in the availability group.
      • The backup preference can be changed using the alter availability group statement with the automated_backup_preference option

Alter availability group @AvailabilityGroupName set (automated_backup_preference = secondary_only)

      • Log backups and copy only backups are supported on replicas
        • You can only backup when the replica is in synchronized or synchronizing state
    • Database checkpoints
      • Checkpoints write all modified pages held in memory to disk then records that information in the transaction log.
      • checkpoints can be configured on the instance using sp_configure with the recovery interval (min) option.
    • Backup devices
      • To add a backup device, use the sp_addumpdevice SP

exec sp_addumpdevice ‘disk’, ‘@BAckupDeviceName’, ‘@directory\file.bak’

        • This allows you to backup to the device directly buy specifying @BAckupDeviceName

backup database @DBName to @BAckupDeviceName

    • Backing up media sets
      • A media set can include a single or multiple files OR tapes.
      • A media set can contain up to 32 backup devices.
    • Backup events are written to the Application event log.
      • Successful backup are written to the msdb.dbo.backupset table.
  • Restoring DBs
    • Restore options depend on the backups that exist
      • Full backups only: this allow you restore to the last time the backup was carried out
      • Full and differential backups: Allow you restore to the point where the differential base taken
        • Restore the full backup with the norecovery option
        • restore the differential with the recovery option.
      • Full and transactional log backups.
        • Backup the current transaction log with the no_truncate option.
        • restore the most recent full backup with the Norecovery option
        • restore transaction logs in sequence with the NoRecovery option till you get to the newest
        • Restore the newest transaction log with the recovery option
      • Full, differential and transaction logs
        • Backup the current transaction log with the no_truncate option.
        • restore the most recent full backup with the Norecovery option
        • restore the differential with the Norecovery option.
        • restore transaction logs in sequence with the NoRecovery option till you get to the newest
        • Restore the newest transaction log with the recovery option
    • File restores
      • If a DB contains multiple files or filegroups, and uses the full or bulk-logged recovery model, you can perform a file restore rather than a full restore.
      • Done if the volume that hosts a particular file fails or a file becomes corrupt while other files used by the DB do not have any problems.
      • You can perform an online restore only on enterprise editions.
        • It enables you restore files in secondary filegroups as long as the primary filegroup is online.
    • Page restores
      • Used to repair a small number of damaged pages in an otherwise healthy database.
      • If a DB file constantly suffers from damaged pages, the volume that hosts it is faulty.
      • Possible under the following conditions
        • The Database uses full or bulk-logged recovery
        • Only on read-write filegroups
        • Only Db pages can be restored.

Restore DB @DBName page=’@IDOfFileHostingPage:PageNo1, @IDOfFileHostingPage:PageNo2′ from @BackupFile with Norecovery.

        • Restore any logs taken after the backup of BackupFile with Norecovery option
        • Perform a log backup
        • Restore the log with the recovery option.
      • AlwaysOn Availability Groups and DB mirrors support automatic page repair. If an error occurs on the primary, it broadcasts a request to the secondary and retrieves the page.
        • To verify the status of automatic page repair
          • Always on: Query the sys.dm_hadr_auto_page_repair DMV
          • DB mirrors: Query sys.dm_db_mirroring_auto_page_repair DMC
    • Restoring a DB protected with transparent Data encryption
      • You can restore a DB With TDE as long as you have the certificate and private keys.
        • Include backing up the certificate and private keys in your backup scheme.
      • Before performing a restore on a system without the certificate, use the create certificate statement with the from file option.
    • Restoring system DBs
      • Model and msdb are restored like user DBs
      • Tempdb doesn’t need to be restored
      • Master:
        • Start the DB in single user mode( specify -m in the startup options of the service)
        • Run the restore DB command

restore database master for @Backup with replace

        • If it is unsuccessful, rebuild the system databases and restore recent copies of all databases

setup.exe /Q /Action=RebuildDatabase /InstanceName=MSSQLServer

    • Restoring replicated DBs
      • When you restore a publication, distribution or subscription DB, you must also restore the master and msdb Databases.
        • This ensures all the DBs are consistent as regards replication configuration and settings
    • Checking DB status
      • Run the following against master to see the DB status

select databasepropertyex(‘databasename’, ‘status’);

        • Possible statuses are
          • Online
          • Offline
          • Restoring
          • Recovering
          • Suspect: A recovery operation failed and the DB might be corrupt
          • Emergency: The DB is in a read-only state and accessible only to members of the sysadmin role.
      • You can check which users have access to a DB by running the following against master

select databasepropertyex(‘databasename’,’useraccess’)

Advertisements

SQL server querying revision

  • Relational Model is the foundation of SQL. A relation has a head(columns ) and a body(rows) The relational model is based on 2 mathematical models.
    • Set Theory
      • A set is a collection M into a whole of definite, distinct objects m( the "elements" of M).
    • Predicate logic
      • A predicate is an expression that when attributed to some object, makes a proposition true or false or unknown (in TSQL ).
  • Logical Query processing:
    • The phases are  From ->where ->group by -> having-> select-> order by. The output of each phase is a virtual table which serves as the input of the next phase.
      • From: where the tables you want to query is specified and table operators(apply, Join) also.
      • Where: Filters rows based on the predicate specified. Only rows for which the predicate evaluates to true are returned.
      • Group By: defines a group for each distinct combination of values in the grouped elements from the input table. All expressions processed in subsequent phases must guarantee a single value per group.
      • Having: Filters based on a predicate but is only evaluated per group and filters groups as a whole. Only groups for which the predicate evaluates to true are returned
      • Select: If a query is a grouped query, each group is represented by a single row in the result. It has two steps
        • Evaluating the expressions in the select list and producing the result attributes. Aliasing is also carried out here. Aliases defined here aren’t visible to phases processed earlier and to items in the same select list.
        • using the Distinct clause to remove duplicates.
      • Order By: returns the result in presentation order according to the expressions that appear in the order by list. The output of this clause isn’t relational
Using the from and select Clauses
  • From Clause
    • Where you indicate the tables that you want to query
    • where you can apply table operators like joins to input tables.
  • Select Clause
    • Evaluates expressions that define the attributes in the query’s result, assigning them aliases if needed
    • Eliminates duplicate rows using the DISTINCT clause(if specified) .
Data Types and Built in functions
  • Data Types
    • They do the following to objects
      • constrain the data that is supported: It has a domain of supported values and will not allow values outside that domain.
      • Encapsulates behavior that operate on it: If data is stored in the wrong data type, it will need to be converted before it can be used. e.g. + is addition for Int, but concatenation for char.
      • Assign the size of elements: The bigger the type used, the more storage it uses and the more I/O is required to process queries that include it.
      • Exposes the data through operators.
    • For Fixed types, updates will not require the row to physically expand, requiring no data shifting as they use the storage for the indicated size. Hence they offer better update performance.Variable types on the other hand use the storage for what you enter+offset information. They let you save a lot of storage, and offer better read performance.
    • Choosing a data type for keys
      • When choosing keys, we either go with Intelligent keys(those generated by the application we interact with), or surrogate keys(attributes chosen solely for the purpose of being used as keys). To generate surrogate keys, the most common techniques are
        • Identity column property: Generates keys in an attribute(of a numeric type) with a scale of 0.
        • Sequence object: An independent object from which you can obtain sequence values. It is not tied to any column and can be generated before use.
        • Non-sequential GUID(Globally Unique ID): type is UniqueIdentifier. generated with function NEWID(). GUID are globally unique across space and time.
        • Sequential GUID: This is used to generate sequential GUIDs using the NewsequentialID(). Can only be used as a constraint on an attribute.
      • When choosing surrogate keys, the size matters for performance(read performance) reasons as larger attribute(data size) consume more storage. If indexes are built on the key, the storage requirement is increased.
    • Date and time functions
      • To get current date and time
        • Getdate(): TSQL only, Datetime attribute
        • current_timestamp: standard, datetime  attribute
        • Sysdatetime(): standard, datetime2
        • sysdatetimeoffset(): standard, datetimeoffset
      • To get the date and time component of a datetime attribute
        • cast the attribute as either date or time.
        • datepart() is used to extract a part of a datetime attribute and return it as an integer. e.g. datepart(month, current_timestamp). can also calculate the week number e.g. datepart(week, current_timestamp). The day(), month() and year() are abbreviations of datepart().
        • datename() is similar to datepart() but it returns a character string. Mostly used to extract the month from a datetime attribute.e.g. datename(month, current_timestamp)
      • To build a datetime attribute from components.
        • Datefromparts(): build a date from a numeric type.
        • datetime2fromparts()
        • datetimefromparts()
        • datetimeoffsetfromparts()
        • smalldatetimefromparts()
        • timefromparts()
      • To get the last day of a month, use EOMONTH().
      • To add to a datetime attribute use DateAdd(). You must select which of the components you want to add to. format is Dateadd(component, difference, datetime). e.g. dateadd(day, -100, current_timestamp).
      • To calculate the difference between 2 datetime attributes use datediff(). You must select which of the components you want to calculate the difference for e.g. datediff(day, current_timestamp, dateadd(week, -100, current_timestamp))
    • Character Functions
      • Concatenation: You can use either + or concat
        • +: If any of the inputs is null, the operator returns a null.
        • Concat(): If any of the inputs is null, it substitutes the null with an empty string.
      • Substring extraction and position
        • Substring(): To extract a substring from a string. format is substring(string, start_position, no_of_characters). If no_of_characters >Len(string), it just returns the chars available till the end of string(Len-Start_position).
        • Left() and Right(): To extract a specified number of characters from the left of right of a string. Format is left(string, No_of_Characters).
        • Charindex(): Returns the position of the first occurrence of a string provided as the first argument in the string provided as the second. format is charindex(string_to_search_for, String_to_be_searched).
        • PatIndex(): returns the position of the first occurrence of a pattern provided as the first argument in the string provided as the second. Use % for any string, _ for a character and [] to represent a range of characters. Format is PatIndex(pattern, String_to_be_searched). e.g. Patindex(‘%[0-9]%’, @String_to_be_searched).
      • String Length
        • LEN(): to measure the number of characters in a string. Ignore spaces at the end of a string.
        • Datalength(): To measure the length of a string in bytes. Doesn’t ignore spaces at the end
      • String Alteration
        • Replace(): Replace in a string provided as the first argument, all occurrences of the string provided as the second argument, with the string provided as the third argument. Format is replace(@String_to_be_used, @string_to_replace, @replacement_string)
        • Replicate(): Replicate an input string a requested number of times. Format is Replicate(@string, @number_of_times).
        • Stuff(): From the string input as the first argument, it starts at the position input as the second argument, removing the number of characters input as the third argument and replacing them with the string input as the 4th argument. Format is Stuff(@String, @Start_position, @Number_of_characters_to_remove, @replacement_string).
      • String Formatting
        • Upper(): returns the input string in upper case.
        • Lower(): returns the input string in lower case.
        • Ltrim(): returns the input string with the leading spaces removed.
        • RTrim(): returns the input string with the ending spaces removed.
        • Format(): Format an input string based on a format string. Can also format into a culture as the third parameter. E.g. format(1759, ‘00000000’) returns 00001759.
    • Case Expressions and related Function
      • Case returns a specified value depending on what is used as the input string. It is either simple or searched.
        • Simple when the Case simply compares an input expression to multiple when expressions and returns a result expression. Format is
                                             Case @input_expression
                                             When @value then @result_expression
                                             When @value2 then @result_expression2
                                             Else @result_expression3
                                             End as @Alias.
        • Searched when case uses predicates in the when clauses to determine the result expression to return.
                                             Case
                                             When @input_expression @operator(e.g. >) @value(e.g. 10) then @result_expression
                                             When @input_expression @operator(e.g. >) @value2(e.g. 10) then @result_expression2
                                             When @input_expression @operator(e.g. >) @value3(e.g. 10) then @result_expression3
                                             Else @result_expression4
                                             End as @Alias
      • Coalesce(): Standard. Accepts a list of expressions and returns the first that is not null. the type of the return expression is determined by the return expression. It also carried over the nullability of  the first element if used in a select…into statement.
      • NullIf(): Standard. Accepts 2 expressions and returns null if they’re equal. returns the first if they’re not.
      • IsNull(): Accepts 2 expressions and returns the first that is not null. the type of the return expression is determined by the first input. It set the attribute as not null when used in a select…into statement.
      • IIF(): returns one value if input predicate is true and another value otherwise. Format is IIF(@predicate, @true_result, @false_result)
      • Choose(): allows you provide a position and a list of expressions and returns the expression in the indicated position.Format is Choose(@postion, @expression1, @expression2, @expression3, @expression4)
Filtering Data with Predicates
  • T-SQL supports 3 query clauses that enable you filter data based on predicates.
    • Where
    • On
    • Having
  • Predicates, 3-valued logic and search arguments.
    • Predicates follow 2 Valued logic when an attribute is not nullable. An expression can either evaluate to true or false only.
    • Predicates follow 3 valued logic when an attribute is nullable. An expression can either evaluate to true, false or unknown(when comparing with a null mark). for a statement to capture all tuples in 3 valued logic, you must take care of null. You can do that using the following predicates
      • Is Null: returns true when the tested operand is null.
      • Is not null: returns true when the tested operand is not null.
    • Filtering queries have a performance side. with the right filters, the DBMS can choose to use indexes without doing a full table scan.
    • Search arguments
      • This refers to how predicates are arranged to maximize the efficiency of queries by making them use indexes.
      • Predicates that are Search arguments are in the format @attribute @operator @value OR @value @operator @attribute. e.g. col1 = 10.
      • Applying manipulation to the filtered column prevents the predicate from being a search argument e.g where coalesce(col1, ‘3’) = coalesce(@value, ‘3’) ensures that indexes defined on col 1 aren’t used. Instead. consider using where col1 = @value or (col1 is null and @value is null);
    • Manipulating predicates
      • Combining predicates: you can use AND and OR to combine predicates.
      • Negating predicates: negating true = False and vice versa. However negating null still yields null. To include Null in a result, use the is null operator.
      • When combining operators, there’s precedence. Highest is NOT the AND then OR. You can use the () operator to control evaluation.
    • Filtering Character Data
      • When filtering character data, ensure the literal in the filter is of the same data type as the attribute being filtered. Otherwise, SQL will have to carry out implicit conversion to get them to the same data type.
      • Like Predicate: this supports filtering string data base on pattern matching using wildcards. The wildcards are
        • %: Any string including an empty one
        • _: A single character
        • []: A character list. E.G. ‘[AC]%’
        • [-]: A range of characters. E.G. ‘[A-C]%’
        • [^]: A single character that is not in the list or range. E.G. ‘[^A-CG]%’
      • If you want to look for a character that is a wildcard, you indicate it after the wildcard as an escape character using the ESCAPE keyword. you also use the escape character(!) before the character e.g. Like ‘!_%’ Escape’!’ lets you search for _ in a string.
    • Filtering Date and Time Data
      • For date literals,
        • it’s best to express them in the language-neutral pattern YearMonthDay i.e. ‘20140319’. This is neutral for all date and time types. It ensures the date is the same even if a users language sets the dateFormat to something else.
        • You an also use the convert or parse to specify the format
          • Convert(): this lets you specify the style number as the conversion style
          • Parse(): this lets you specify a culture name
      • If only date is specified during an insert into a Datetime attribute, the time is assumed to be midnight.
      • When specifying the range, use the format where @date >= @firstdate and  @date < @lastdate+1. If the time value is specified, SQL server can round up and cause wrong data to be included in the result.
Sorting Data
  • A table represents a relation and as it is a set, no order is guaranteed for its elements. A query that doesn’t have an explicit instruction to return the rows in a particular order doesn’t guarantee the order of rows in the result.
  • Order BY clause
    • This is the only way to guarantee the rows are returned in an order.
    • The default sorting order is ascending. otherwise, specify desc.
    • Determinism: When ordering isn’t done by a unique attribute, the order of the query’s output isn’t deterministic. The order is deterministic when the order by list is unique. Deterministic queries are the only ones where the result set is repeatable. You can specify ordering by a unique attribute for the ordering to be deterministic. format is
                         Order by @Attribute1 desc, @attribute2
    • You can order by elements you’re not selecting as long as the rows would have been normally allowed there. This doesn’t apply when distinct is specified as the result rows don’t map to the source rows in a one to one manner, but in a one-many manner.A attribute can represent a number of tuples in this scenario.
    • It is executed after select so column aliases in the select clause are visible here.
    • Nulls are sorted before non-nulls in ascending order. to sort them after, use a coalesce to replace them with a cheat value.
    • Indexes help SQL server avoid actually sorting the data to address a order by request.
Filtering Data with TOP and OFFSET-FETCH
  • Top: Used to filter a requested number or percent of rows based on indicated ordering. Specified in select clause along with the requested number of rows in parentheses. The ordering is dependent on the order by clause. Format is
                    select top (@No_Of_Rows) @attribute1, @attribute2, @attribute3
                    select top (@percent_of_rows) percent @attribute 1, @attribute2, @attribute3
    • You can use a variable as the @no_of_rows.
    • To ensure the result is deterministic, you can either
      • Use the with ties option: This ensures that all ties are included. It usually results in more rows being included than were requested. Format is
                                   Select top(@No_of_rows) with ties @attribute1, @attribute2, @attribute3
      • Guarantee determinism by adding a tiebreaker: add an element to the order by that makes the result deterministic.
  • Offset-Fetch: A standard alternative to TOP. Can be used to skip a certain number of rows.
    • Appears right after the order by clause which must be present.
    • When offset-fetch is used, the order by clause plays an additional role of telling the offset-fetch which rows it needs to filter.
    • Offset
      • You must specify the offset clause which allows you select the number of rows you want to skip(or 0 if you don’t). Format is
                                   Offset @no_of_rows_to_skip rows
      • When used without Fetch, it skips @no_of_rows_to_skip and fetches all the remaining rows.
    • Fetch
      • The Fetch clause is optional and is used to set how many rows you want to filter. You must specify first or next before @number_of_rows_to_retrieve. Format is
                                   Offset @no_of_rows_to_skip rows fetch first @number_of_rows_to_retrieve rows only
    • You can use expressions as input for the @no_of_rows_to_* value.
Using Joins
  • Joins are used to query data from different tables and match the rows that need to be related. There are 3 types.
  • Cross Joins
    • Performs a cartesian product of two tables, multiplying between the tables and yielding a row for each combination of rows from both sides. If @RelationA has @X tuples and @RelationB has @Y tuples, then their cross join will have @X * @Y tuples.
    •  The format is
                         Select A.@Attribute1, B.@Attribute2
                         From @relationA as A
                              cross join @relationB as B
                         where A<= @Value
  • Inner Joins
    • Allows matching of rows from two tables based on a predicate(usually comparing a primary key to a foreign key on another table). Rows for which the predicate evaluates to false or unknown are discarded.
    • The join’s matching predicate is specified in the ON clause.
    • If an attribute of either table will be ambiguous, then aliasing must be used.
    • Filtering predicates can be specified in either the Where or On clauses . It is the equivalent of concatenating the predicates with an and operator.
    • Format is
                         Select A.@Attribute1, B.@Attribute2
                         From @relationA as A
                              inner join @relationB as B
                                   On A.@Attribute1 = B.@Attribute2
                         where A<= @Value
  • Outer Joins
    • Allows you preserve all rows from one or both sides of the join, irrespective of their being matching rows on the other side based on the ON predicate. e.g. Left outer join preserves the @first_relation, inserting nulls for the attributes that aren’t matched in the @second_relation.
    • The ON and Where clauses play different roles. 
      • ON: This plays a strictly matching role as it purely matches rows on the preserved side with the other side, inserting nulls into the attributes on the other side if it doesn’t find a match. This isn’t final with respect to the preserved side of the join.
      • Where: This carries out filtering. This is final with respect to the preserved side of the join.
    • The full outer join is an outer join that preserves both sides.
    • The Format is 
                         Select A.@Attribute1, B.@Attribute2
                         From @relationA as A
                              left outer join @relationB as B
                                   On A.@Attribute1 = B.@Attribute2
                         where A<= @Value
  • Multi Join Queries
    • A join takes place between 2 tables(physical or virtual) at a time from left to right. The result of one is used as the left input of the next.
    • You can use parentheses to separate joins into different logical phases. e.g.
                         Select A.@Attribute1, B.@Attribute2
                         From @relationC as C
                              Left outer join
                                   (
                                        @relationA as A
                                        @relationB as B
                                        inner join
                                        on A.@Attribute1 = B.@Attribute2
                                   )
                               On C.@Attribute1 = B.@Attribute2
Using Subqueries, Table Expressions and Apply
  • Nesting queries allows you refer to one query’s result from another, removing the need to store the result of one query in a variable before another query can access it.
  • Exists Predicate
    • This accepts a subquery as input and returns true when the subquery returns at least one row and false otherwise.
    • Usually used in co-related subqueries with where for filtering. Filters out rows where attributes from the @input_query do not exist in the result set of the subquery specified in the exists predicate.
    • format is

                         Select A.@Attribute1

                         from @Relation1 as A
                         where EXISTS in
                              (    select B.@Attribute1
                                   From @Relation2 as B
                                   where B.@Attribute1 <> @Value 
                                        and A.@Attribute1 = B.@Attribute1)
  • In Predicate
    • This accepts a subquery as input and returns true when the subquery returns at least one row and false otherwise.
    • Usually used in co-related subqueries with where for filtering. Filters out rows where the specified attribute from the @input_query do not exist in the result set of the subquery specified in the IN predicate.
    • Format is
                Select A.@Attribute1, B.@Attribute2
                from @Relation1 as A
                where A.@Attribute1 IN
                (    select B.@Attribute1
                     From @Relation2 as B
                     where B.@Attribute1 <> @Value 
                          and A.@Attribute1 = B.@Attribute1)
  • Subqueries can be categorized in terms of their containment and result
    • Containment:
      • Self-contained: Subquery that has no dependency on the outer query. They can be run independently of the outer query. E.g.
                                   Select  A.@Attribute1, A.@Attribute2
                                   From @relationA as A
                                   where A.@Attribute3 >= (select Min(@Attribute4) from @relationB);
      • Correlated: Subqueries where the inner query has a reference to a column from the table in the outer query. They cannot be run independently of the outer query.

                                   Select  A.@Attribute1, A.@Attribute2

                                   From @relationA as A
                                   where A.@Attribute3 >= 
                                       (select Min(@Attribute4) 
                                        from @relationB as B
                                        where B.@Attribute2 = A.@Attribute2 );
    • Result:
      • Scalar: Subqueries that return a single value and can be used where a single-valued expression is expected. If the subquery returns more that one value, the query fails, if it returns an empty set, it is converted to a Null  e.g.
                                   Select  A.@Attribute1, A.@Attribute2
                                   From @relationA as A
                                   where A.@Attribute3 >= (select Min(@Attribute4) from @relationB);                                   
      • Multi-valued: Subqueries that return multiple values in a single column. Can be used where a multi-valued result is expected(e.g. when using the IN predicate)

                                   Select A.@Attribute1, A.@Attribute2

                                   From @relationA as A
                                   where A.@Attribute1 not IN
                                        (select @Attribute4 from @relationB where @Attribute3=@Literal)

      • Table-valued: See Table Expressions for expatiation.
  • Table Expression: This are named queries. You write an inner query that returns a relational result set, name it and query it from an outer query. Their result must be relational. There are 4 types
    • Derived Tables: This are visible only to the statement that defines them. 
      • The derived table’s query is defined in parentheses in the from Clause of the outer query and the inner query is named.
      • Due to the all-at-once property of SQL, the name assigned a derived table isn’t visible to other elements in the same query processing phase as the derived table’s definition. To join multiple instances, you must duplicate the code.
      • Format is below where  @Inner_Query = select @Attribute4 from @relationB where @Attribute3=@Literal
                         Select A.@Attribute1, A.@Attribute2
                         From (@Inner_Query) as A                         
    • Common Table Expressions(CTEs): This are visible only to the statement that defines them.
      • As the CTE name is assigned in another query processing phase, the same CTE can be referenced several times.
      • Format is below where  @Inner_Query = select @Attribute4 from @relationB where @Attribute3=@Literal

                                   With @CTE_NAME

                                   As
                                   (
                                        @Inner_Query
                                   )
                                   Select @Attribute1, @Attribute2, @Attribute3
                                   From @CTE_NAME
      • To define multiple CTEs, use the format

                                   With @CTE_NAME1

                                   As
                                   (
                                        @Inner_Query
                                   ),
                                   @CTE_NAME@ 
                                   AS
                                   (
                                        @Inner_Query2
                                   )
      • CTEs have a recursive form. 
        • The body of the recursive query has 2 or more queries separated by a union all operator( for a union all, the attribute type for both sides has to be the same).
        • One of the Queries in the CTE serves as the anchor member and returns a valid relational result. It is invoked only once.
        • The other queries have a reference to the CTEname in their recursive member. This query is repeatedly invoked until it returns an empty set.  
        • In each iteration, the reference to the CTE name from the recursive member represents the previous result set. 
        • The reference to the CTE name from the outer query represents the unified results of the invocation of the anchor member and all invocations of the recursive member.
        • Format is, @Anchor_member = ‘select @Attribute1, @Attribute2 from @relation1 as @Alias1 where @Attribute5=@literal’ and @recursive member = ‘ select @Attribute1, @Attribute2 from @CTE_NAME1 Join @relation1 as @Alias2 on @Alias1.@Attribute2 = @Alias2.@Attribute2

                                             With @CTE_NAME1

                                             (
                                                  @Anchor_Member
                                                  Union ALL
                                                  @recursive_Member
                                             )
                                             @Outer_query

    • Views: These are database objects, hence they’re re-usable. They don’t accept parameters.
      • Queries can select from them as if they’re tables, but they’re not persisted in the database
      • Format for creation is below where  @Inner_Query = select @Attribute4 from @relationB where @Attribute3=@Literal 

                                   Create view @View_name

                                   As
                                   @inner_Query;
                                   GO 
      • Format for execution is

                                   Select @Attribute1, @Attribute2

                                   From @View_Name                  

    • Inline table-valued functions: These are database objects, hence they’re re-usable. 
      • They support input parameters.
      • Format for creation is below where  @Inner_Query = select @Attribute4 from @relationB where @Attribute3=@Literal

                                   Create Function @Function_Name (@Variable as Int) returns table

                                   As
                                   Return
                                   @inner_Query    
                                   GO
      • Format for execution is

                                   Select @Attribute1 

                                   From @Function_name(@variable);
  • Apply
    • The apply operator allows you apply a table expression given to it as the right input to each row from a table expression given to it as the left input.
    • The inner query in the right table expression can have a reference to an element from the left table. the right table expression is evaluated separately for each left low.
    • Sometimes used to replace cursors.
    • There are 2 types
      • Cross apply
        • The right table expression is applied to each row from the left input.
        • If the right table expression returns an empty set for a left row, the left row isn’t returned.
        • The Format is 

                                             Select A.@Attribute1, A.@Attribute2, C.@Attribute2

                                             From @RelationA as A
                                             Cross Apply ( 
                                                  Select B.@Attribute2, B.@Attribute3
                                                  From @RelationB as B
                                                  Where A.@Attribute3 = B.@Attribute4) as C

      • Outer apply
        • The right table expression is applied to each row from the left input
        • if the right table expression doesn’t return a set for a left row, the right expression returns null as a place holder. 
        • The format is 
                                             Select A.@Attribute1, A.@Attribute2, C.@Attribute2
                                             From @RelationA as A
                                             outer Apply ( 
                                                  Select B.@Attribute2, B.@Attribute3
                                                  From @RelationB as B
                                                  Where A.@Attribute3 = B.@Attribute4) as C
Set Operators
  • Set Operators operate on two result sets of queries, comparing rows between the results. Depending on the result of the comparison and the set operator used, the operator determines whether to return the row or not.
  • They have the following guidelines
    • As complete rows are matched between the result sets, the number of columns in the queries must be the same and the column types of corresponding columns need to be compatible(implicitly convertible).
    • They consider 2 nulls as equal
    • The individual queries must return a relation.
    • the column names of result columns are determined by the first query
    • They have an implied distinct. They do not return duplicate rows
  • The format for set operators is
               @Query1
               @Set_operator
               @Query2
  • They are
    • Intersect: It returns only rows common to both result sets.
    • Union: It unifies the results of 2 input queries. 
    • Except: It performs a set difference between the first result set and the second returning only rows present in the first but not in the second. The order of the input queries matters
    • Union all: A multiset operator. It unifies the result of set 2 input queries, while keeping duplicates( the implied distinct for set operators doesn’t apply).
      • Format is

                                   @Query1

                                   Union all
                                   @Query2

  • Their order of precedence is Intersect>Union/Intersect and Union = Intersect, however this can be controlled with parentheses
Data Analysis Operations
  • Data analysis functions are functions applied to a set of rows and return a single value.
  • They are either Group functions or window functions.
Grouped Queries
  • Grouping allows you to define groups in your data, and then perform data analysis computations(e.g. aggregate functions) per group.
  • Grouped queries include the group function or the group by clause.
    • Aggregate Functions: They arrange all rows in one group if no group by clause is specified and perform computation for the group.
      • Examoles include MIN, MAX, SUM, COUNT, AVG
      • You can work with distinct occurrences by specifying the DISTINCT option
        • Format is

                                                       select @Attribute1, count(distinct @Attribute2)

                                                       from @Relation1
                                                       Group by @Attribute3.
    • Group By clause: 
      • Allows grouping the rows based on a specified grouping set of expressions specified in the group by clause.
      • All elements with the same value in the grouping set’s attribute are grouped and subsequent computation give a value per group.
        • Count(): Specifying count(*) instead of count(@Attribute1) returns a smaller count as * causes the count to ignore nulls.
      • The grouping set can be made of multiple elements(Attributes).
      • All Attributes with Null are considered equal.
      • Having Clause
        • This uses a predicate that evaluates per group unlike WHERE that’s evaluated per row.
      • Format is 

                                    Select @Attribute1, @Attribute2, @Attribute3, @Attribute4

                                   From @Relation1
                                   Group By @Attribute2, @Attribute1

      • All clauses that operate after a group by clause must be guaranteed to return a single result  value per group. 
        • You can refer to elements from the group by list as they return one distinct value per group.
        • If you want to refer to Attributes from the underlying table that aren’t in the group by list, 
          • apply an aggregate function to them to show they’re guaranteed to return one value per group 
          • add the attribute to the group by clause.
          • Carry out the grouping in a CTE and join the CTE to the base table adding the elements during the join.
      • Multiple Grouping Sets
        • They allow you group data in more than one way.
        • They’re specified in the group by clause.
        • There are three ways to define them
          • Grouping Sets: 
            • Used to list all grouping sets you want to define in the query.
            • the empty (         ) grouping set defines one group with all rows for computation of total aggregates.
            • Format is 

                                                                 Select @Attribute1, @Attribute2, @AggregateFunction(@Attribute3) as @Attribute4

                                                                 From @Relation1
                                                                 Group by grouping sets
                                                                 (
                                                                 (@Attribute1, @Attribute2), –GroupingSet1
                                                                 (@Attribute2), –@GroupingSet2
                                                                 (     ) –@GroupingSet3
                                                                 );

          • Rollup:
            • Used when there’s a hierarchy formed by the input elements. where only a subset of the possible grouping sets is interesting.
            • When the elements form a hierarchy, you use rollup to avoid computing unnecessary aggregates.
            • E.G. rollup(@Attribute1, @Attribute2, @Attribute3) groups by
              • (@Attribute1, @Attribute2, @Attribute3)
              • (@Attribute1, @Attribute2)
              • (@Attribute1)
              • (                 )
          • Cube:
            • Accepts a list of expressions as inputs and defines all possible grouping sets that can be generated from the inputs including the empty grouping set.
            • E.G. Group by Cube(@Attribute1, @Attribute2) groups by
              • (@Attribute1, @Attribute2)
              • (@Attribute1)
              • (@Attribute2)
              • (                 )
        • Nulls are used are as place holders in rows where an attribute isn’t part of the grouping set. 
          • If all grouped attributes are defined as not null, you can identify the rows that are associated with a single grouping set, based on a unique combination of nulls and non-nulls in those columns. 
          • However, if the attributes allow nulls, you need to use either  grouping() or grouping_ID()to differentiate.
            • Grouping():Accepts a single elements as input and returns 0 when the element is part of the grouping set and 1 when it is not.
              • You can identify a grouping set by looking for 0s in the elements that are part of the grouping sets and 1 in the others.
              • Format is
Select @Attribute1,grouping(@Attribute1), @Attribute2, grouping(@Attribute2), @Attribute3, grouping(@Attribute3), @AggregateFunction(*)
from @Relation1
Group by rollup(@Attribute1, @Attribute2, @Attribute3);
            • Grouping_ID(): 
              • This accepts the list of grouped columns as inputs and returns an integer representing a bitmap. Calculate the bitmap to determine which of them(having 0) is part of the grouping set and which(having 1) is not part of the grouping set.
              • The result integer is the sum of the values representing elements that are not part of the grouping set because their bits are turned on
              • Format is

select @Attribute1, @Attribute2, @Attribute3, grouping_id(@Attribute1, @Attribute2, @Attribute3), @Aggregate_Function(*)

From @relation1
group by rollup(@Attribute1, @Attribute2, @Attribute3)
Pivoting and Unpivoting data: done using table operators.
  • Pivoting: A technique of grouping and aggregating data, transitioning it from a state of rows to a state of columns. 
    • You use the Pivot operator.
    • In all pivots, you need to specify
      • Grouping element(on rows): what do you want to see on the rows
      • Spreading element( on columns): what do you want to see on the columns
      • Aggregation element( Data): what do you want to see where columns and rows intersect.
    • It is recommended to use a CTE because the grouping element is determined by elimination(i.e. the elements in the table not specified in the pivot statement). If the base table is queried directly, other elements will be included as grouping elements. In the example below, grouping is done by the @grouping_column. 
    • The @spreading_column and @aggregation_column cannot be the results of an expression. They must be column names from the CTE. However, an expression can be used and aliased in the CTE.
    • Count(*) cannot be used as the aggregate function.if you want a count, define a dummy column in the CTE and reference that.
    • Only one aggregate function can be used.
    • The IN clause only accepts a list of static values. you can use dynamic SQL if you don’t know the values during query definition.
    • @select list = @Grouping_column, @distinct_spreading_value1, @Distinct_Spreading_Value2, @Distinct_Spreading_Value3
    • Format is 

                         With Pivotdata as 

                         (
                              Select @grouping_column, @spreading_column, @aggregation_column
                              from @Relation1
                         )
                         Select @Select_list     
                         from Pivotdata
                              Pivot (@Aggregate_function (@aggregation_column)
                                   For @spreading_column in (@distinct_Spreading_Value1, @Distinct_Spreading_Value2, @Distinct_Spreading_Value3))as @Alias;
  • UnPivoting: A technique for rotating the input data from a state of columns to a state of rows.
    • It takes a set of source columns and rotates them to multiple rows, generating 2 target columns: One to hold the source column names, another to hold the values.
    • You need to specify 3 elements.
      • The set of source columns you’re unpivoting
      • The name to assign to the target names column
      • The name to assign to the target values column
    • The unpivot operator filters out rows in the @value_column with NULLS as they represent inapplicable cases.
    • The types of all columns that you’re pivoting must be the same
    • The Format is
                         Select @column_list_alias, @name_columns_alias, @values_column_alias
                         From @Relation1
                         Unpivot(@values_column for @name_columns in (@source_Column1, @Source_column2, @Source_column3) as @Alias;
Window Functions
  • Window Functions enable you to perform data analysis computations. You define the set of rows per function and then return one result value per each underlying row and function. You define the set of rows for the function to work with using the over clause.
  • They use aggregate function which are applied to a window of rows defined by the over clause. 
  • Over Clause
    • Used to define a window of rows for aggregate functions. The window is defined in respect to the current row. 
    • an empty over clause Over() represents the underlying query’s result set.
    • You can use window partition clause to restrict the window.
    • Format is

                         Select @Attribute1, @Attribute2, @Attribute3, @Aggregate_function(@Attribute1) over(partition by @Attribute2)

                         From @Relation1;
  • Framing: 
    • Used to define ordering within the partition using a window order clause, and based on that order, confining a frame of rows(rows or range) between 2 window frame extents/delimiters.
    • You define the Windows frame extents using a window frame clause, which requires a window order clause to be present.
    • Window Frame extent/ delimiters define the range of elements in a result set as a result of an aggregate function.
    • Window frame units
      • Rows: 
        • Allows 3 Window Frame Extents/delimiters.
          • Unbounded preceding/following: the beginning or end of the partition
          • Current Row: representing the current row. assumed as default if none is specified.
          • @number Rows preceding/following: @number rows before or after the current respectively
          • Format is

Select @Attribute1, @Attribute2, @Attribute3,@Attribute4 @Aggregate_function(@Attribute2) over( partition by @attribute1 order by @Attribute1, @Attribute3 rows between @delimiter1 and @delimiter2) 

      • Range
        • Allows 2 Window Frame Extents/delimiters.
          • Unbounded preceding/following: the beginning or end of the partition
          • Current Row: representing the current row. assumed as default if none is specified.
Window Ranking
  • Allow you rank rows within a partition based on a specified ordering specified in the over clause.
  • There are 4 ranking functions
    • Row_number(): computes a unique sequential integer within the window partition based on the window ordering.
      • Isn’t deterministic if the ordering isn’t unique.
    • Rank(): Assigns the same ranking value to all rows with the same ordering value. If n of the ordering value have the same value, the next ordering value is N+@rank_count.
    • Dense_rank(): Assigns the same ranking value to all rows with the same ordering value.  Assigns the same ranking value to all rows with the same ordering value. However, it only ranks distinct values hence if n of the ordering value have the same value, the next ordering value is @rank_count + 1.
    • Ntile(@NumOfTiles): Allows you arrange rows within the partition in a requested number of equally sized tiles, base on the specified ordering.
      • Isn’t deterministic if the ordering isn’t unique.
  • Format is
               Select @Attribute1, @Attribute2, @Window_Ranking_function() over (order by @Attribute1 as @ordering_value partition by @Attribute2).
Window offset Functions
  • Return an element from a single row that is in a given offset from the current row in the window partition, or from the first or last row in the window frame
  • There are 4 of them
    • LAG & LEAD : 
      • relies on an offset with respect to the current row. Supports Window partition and ordering clauses, but not window frame clauses. For Lag, Returns an element from the row in the current partition that is a requested number of rows before the current row(based on window ordering). For Lead, Returns an element from the row in the current partition that is a requested number of rows after the current row(based on window ordering).
      • Allow you specify the offset and the value to return if the attribute is null.
      • Format is

                                   Select @Attribute1, @Attribute2, @Attribute3, Lag(@Attribute1, @offset, @value_If_Attribute_is_null)over (partition by @Attribute2 order by @Attribute3)

                                   From @Relation1
                                   

    • FIRST_VALUE & LAST_VALUE:
      • Returns the first/last value from the first or last rows in the window frame.
      • Supports Partitions, order and frame clauses.
Querying Full-Text Data

Full-Text search 
  • Full text search allows approximate searches. They require full-text indexes inside full-text catalogs. Full text indexes can be created over Char, Text,image, XML and varbinary(max) type attributes.
  • Full-Text search components
    • You can store documents in binary or XML columns and use full-text queries on them.
    • Word breakers and stemmers perform linguistic analysis on all full-text data. they’re language specific.
      • Word breakers identify individual words(tokens) which are inserted into full-text indexes in compressed format.
      • Stemmers generate inflectional forms of a word based on the rules of a language.
    • You can specify a different language for each full-text indexed column.
    • Stopwords: used to define words that shouldn’t be indexed because they’re irrelevant.
    • Full-text queries can also search for synonyms as defined in the thesaurus file. The following can be configured if thesaurus files
      • diacritics_sensitive: set to 0 if a language is accent insensitive and 1 if it is.
      • Expansion: to add expansion words( words to search for in addition e.g. student, scholar) for a word.
      • replacement: To define replacement words or terms for a specific word or term e.g. Windows 2008 instead of win2k8.
    • To search a document’s properties, you can specify a search property list and add properties to the list.Document properties have predefined GUID and integer IDs.
    • Full Text Catalogs and indexes
      • Full text indexes are stored in full-text catalogs which are virtual objects that contain full-text indexes. 
      • Statistical semantics are an option for full-text indexes
        • They create additional key phrase and document similarity indexes that are part of statistical semantic indexing. Full-text search uses the key phrases to identify and index documents that are similar.
        • They allow you to query the meaning of the documents
        • Allow you to create text-mining solutions.
        • You need to install the semanticsDB
Using the contains and freetext predicates
Contains and freetext are predicates for limiting the result set of a query using full-text indexes
  • Contains 
    • Allows you search for Words.
    • You can add custom weights to words you are searching for.
  • FreeText
    • Less specific that contains. 
    • Searches for the values that match the meaning of a phrase not just exact words. 
    • The engine performs word breaking of the search phrase, stems and identifies expansions and replacements for the words in the searched term with words from the thesaurus.
    • Format is 

                         Select…..From…..Where freetext(Ftcolumn, ‘Searchword1’).

Full-text search functions
  • Containstable() and freetexttable() returns 2 columns: Key and Rank. 
    • The key column is the unique key from the index used in the key index clause of the create fulltext index statement. Rank returns an value between 0 and 1000 which states how well a row matches your search criteria.the higher the rank, the higher the relevance.
    • Freetexttable supports a fewer number of parameters.
    • Format is 
                         CONTAINSTABLE ( table , { column_name | ( column_list ) | * } , ‘ <contains_search_condition> ‘ [ , LANGUAGE language_term] [ , top_n_by_rank ] )
    • Search conditions are the same as for a contains predicate.
  • Semantic search functions
    • There are 3 functions for this
      • SemanticKeyPhraseTable
        • returns a table with key phrases associated with the full-text indexed column from the column list.
        • The source_key parameter specifies the unique key from the index used in the key index clause of the create fulltext index.
        • Format is 

                                                            SEMANTICKEYPHRASETABLE ( table, { column | (column_list) | * } [ , source_key ] )

      • SemanticSimilarityDetailsTable
        • Returns a table with key phrases that are common across 2 documents.
        • Format is

                                        SemanticSimilarityDetailsTable(table, source_column, source_key,matched_column, matched_key

      • Semanticsimilaritytable
        • Returns a table with documents scored by semantic similarity to the searched document specified with the source_key parameter. 
        • Format is
                                             SEMANTICSIMILARITYTABLE( table, { column | (column_list) | * }, source_key )

Querying and Managing XML Data

Intro to XML
  • XML uses tags to name part of an XML document. The parts are called elements.
  • XML documents are ordered: The position of elements matters
  • XML is case-sensitive unicode text
  • Some characters in XML have special meanings as they are considered markup. to include them,
    •  you must escape them using ampersand(&) followed by a special code, followed by semicolon(;).
    • you can use the XML CDATA section, written as <![CDATA[@String_with_character_to_insert]]> which prevents the special characters in the string being parsed.
  • processing instructions are written in <?@processing_instruction?>
  • Comments are in the format <!–@Comment–>
  • XML is a document when it has a single root node and a fragment when it doesn’t.
               Document:
               Fragment:
  • Elements have attributes
    • Attributes have their own names and values. They can be presented in 2 ways
      • attribute-centric: Attributes values are enclosed in quotation marks. 
      • element-centric: Attribute are nested elements of the original elements. Use the keyword elements to ensure the output is element-centric.
  • Namespaces are used to distinguish between elements from different sources. 
    • they are declared in the root element of an XML document.
    • They can be aliased. then you prefix element names with the alias 
  • XML data should have a fixed schema(metadata) as the actual data is mixed with elements and attribute names. The most common  schema is XSD.
    • Validating an XML document is checking whether the document complies with a schema.
    • XMLSCHEMA directive: 
      • This is used to return the XSD schema of the document. It is returned before the actual XML data. It accepts a parameter that defines a target namespace.
Producing XML from Relational Data.
  • You can convert the result set to XML using the FOR XML clause of the select T-SQL statements. 
  • The FOR XML clause comes after the order by clause.
  • There are several modes
    • For XML raw
      • The output is similar to the tabular expression of the data. Every row from the returned rowsets converts to a single element named row, and the columns become the attributes of this element.
      • You must use the elements keyword for the output to have element-centric attributes
      • You can enhance the RAW mode by renaming the row element, adding a root element, including namespaces and making the XML returned element centric.
      • The XMLSchema directive can be used here to return the XML schema
    • For XML auto
      • Elements are nested if a previous element shares an attribute specified earlier. 
      • You must use the elements keyword for the output to have element-centric attributes
      • Due to the nesting, the order by clause must be selected carefully as well as the select clause so the nesting is properly done.
        • The ordering should follow a one-to-many relationship
      • The XMLSchema directive can be used here to return the XML schema
    • For XML Path
      • Allows you manually define the XML returned.
      • It uses standard XML XPath expressions to define the elements and attributes you are creating.
      • Column names and aliases serve as Xpath expressions defining the path to the element in the XML generated. 
        • Path is expressed in a hierarchical way; levels are delimited with slash /
        • Every column becomes an element. If you want attribute-centric XML, prefix the alias name with @.
      • To create XML with nested elements for child tables, you need subqueries in the select part of the query in the path mode. The subqueries have to return a scalar value in a select clause.
        •  You return a scalar value by returning XML from the subquery. Then the result is returned as a single scalar XML value. You format nested XML
          from the subquery with the FOR XML clause, like you format XML in an outer query. Additionally,you have to use the TYPE directive of the FOR XML clause to produce a value of the XML data type, and not XML as text, which cannot be consumed by the outer query.
  • The With Namespaces clause 
    • Used before the select clause to define namespaces and aliases in the returned XML.
    • A colon is used to separate the namespace from the element name.
  • The Root Keyword is used to specify a root element.
Shredding XML to tables
You can convert XML to tables. Their are 2 methods
  • OPENXML rowset function: this provides a rowset over in-memory XML documents by using the Document Object Model(DOM) presentation.
    • Useful when a documents will be shredded multiple times.
    • Before parsing the DOM, you prepare the DOM presentation of XML using sys.sp_xml_preparedocument. After shredding the document, you remove the DOM presentation using sys.sp_xml_removedocument.
    • OpenXML uses the following parameters.
      • An XML DOM document handle, returned by sp_xml_preparedocument. 
        • This is an integer.
      • An XPath expression to find the nodes you want to map to rows of a rowset returned. 
        • Specified as rowpattern and defines how XML nodes translate to rows. The path to a node is used as a pattern
        • nodes below the selected node define rows of the returned rowset.
      • A description of the rowset returned
      • Mapping between XML nodes and rowset columns.
        • Done using the WITH clause of the openXML function. 
          • In the With clause, you specify an existing table, which is used as a template for the rowset returned or you define a table syntax.
        • You can use the flag parameter of the openXML function to specify the mapping. 
          • 1 is for attribute-centric mapping
          • 2 is for element-centric mapping
          • 11 is for both mappings
Querying XML data with XQuery
XQuery allows you navigate XML, loop over nodes, shape the returned XML instance and much more.
  • Xquery is case-sensitive
  • Variable start with $ e.g. $i
  • Xquery returns sequences which can include atomic or complex values(XML nodes).
    • Any node(element, attribute, text, processing instruction) can be included in the sequence.
    • Sequences can be formatted to get well-formed XML. 
  • Query() is used to query XML.
    • What to return from a query can be specified with the following
      • query(‘*’) returns the complete XML instance.
      • data() is used to specify what to return when you query XML. 
        • Used with query(). e.g. @XML_Variable.query(‘data(@attributes_to_return)’)
        • It returns the data values not the attributes.
        • data(*) returns all the atomic data values in the xml.
        • data(root/@Attribute1/@Attribute2) is used to return the atomic data values in the path(root/@Attribute1/@Attribute2) specified.
  • Every identifier is a qualified name(Qname). A Qname consists of a local name and optionally a namespace prefix.
    • Standard namespaces are
      • xs: namespace for an XML schema
      • xsi: The XML schema instance namespace. Used to associate XML schemas with instance documents
      • xdt: namespace for data types.
      • fn: functions namespace
      • sqltypes: namespace for mapping for SQL data types
      • xml: default XML namespace.
    • You can use standard namespaces in queries without defining them. 
  • You define your own data types in the prolog(at the beginning of your XQuery) which is seperated from the query body with a semicolon. 
  • XQuery Data Types
    • Xquery data types are either 
      • node types: attribute, comment, element, namespace, text, processing-instruction and document
      • atomic types: xs:string, xs:boolean, xs:QName, xs:date, xs:time, xs:datetime, xs:float, xs:double, xs:decimal, xs:integer.
  • Xquery functions
    • Xquery functions are organized into categories.
  • Navigation
    • With Xquery, you can specify a path absolutely or relatively from the current node. XQuery takes care of the current position in the document.
    • Every path consists of a sequence of steps, listed from left to right.
      • Steps are separated with slashes
      • A step may consist of 3 parts
        • Axis: the direction of travel. there are 6 axes
          • Child::
            • children of the current node. default and can be skipped
          • descendant::
            • all descendants of the node. Direction is down
          • self::
            • the current node. direction is here
          • descendant or self::
            • Retrieves the context node and all its descendants. direction is here and then down
          • attribute::
            • Retrieves the specified attribute of the current node. Direction is right
          • parent::
            • Retrieves the parent of the current node. direction is up.
        • Node test: specifies the criterion for selecting nodes.
          • This follows the axis you specify. Can be as simple as a name test.
          • Specifying a name means you want nodes with that name.
          • You can use wildcards.
            • Asterisk(*) means you want any principal node with any name.
            • A principal node is the default node kind for an axis. It’s an attribute if the axis is attribute:: and an element for all others
          • You can also perform node type tests
            • They help you query nodes that are not principal nodes.
            • Types of node type tests 
              • comment(): select comment nodes
              • node(): returns any kind of node
              • processing-instruction(): retrieve a processing instruction node
              • text(): retrieve text nodes or nodes without tags
        • Predicate: narrows down the search
      • The format of a path is

                                   @Node_name/@Axis@Node_test[@predicate]

  • Predicates: There are 2 types
    • Numeric:
      • Select nodes by position. they’re included in brackets.
      • e.g. (/x/y)[1] means the first y child of elements x
    • Boolean:
      • Select all nodes for which the predicate evaluates to true.
      • Can work on sequences and atomic values
        • For sequences, if one atomic value in a sequence leads to a true exit of an expression, the whole expression is evaluated to true.
          • Done using general comparison operators(=, <, > ,!=)
        • For Atomic values,
          • use value comparison operators do not work on sequences
    • Xquery supports if…then…else expression.
      • Used to return one expression or another depending on the value of the expression in the if. Similar to case.
  • FLWOR
    • They’re a for each loop used to iterate through a sequence of nodes returned by an XPath expression.
    • Signify For..let..where..order by..return
      • For: used to bind iterator variables to input sequences. 
      • Let: Optional. let you assign a value to a variable for a specific iteration
      • Where: optional. lets you filter the iteration
      • order by: control the order in which the elements of the input sequence is processed. the order is based on atomic values.
        • Expressions passed here must be compatible with the gt operator.
      • return: evaluated once per iteration. Also used to format the result XML.
        • Use {} around expressions to extract the value of a the expression
The XML data type(using)
  • XML data type has 5 methods that accept XQuery as a parameter.
    • query(): use to query XML data.
    • value(): retrieving atomic values
      • It returns a scalar value. you must specify the position of the element in the sequence you are browsing.
      • value(@Xquery_Expression, @dataTypeToReturn)
    • exist(): checking existence of a node in an XML instance.
      • Commonly used in the where clause.
      • returns 1 where the node searched for exists in the XML instance
      • returns 0 where the node searched for doesn’t exist in the XML instance
      • returns null if the XML instance is NULL
    • modify(): modifying sections within the XML data
      • Due to the size of XML data, it’s more efficient to change a small portion of it.
      • Invoked in an update T-SQL statement
      • There are 3 keywords for data modification
        • insert
        • delete
        • replace value of
    • nodes(): shredding XML data into relational data. 
      • Usually faster than using the OPENXML function.
      • returns a result set that contains logical copies of the XML instance. 
      • it has to be invoked for every row in the table.
  • XML is a Large object(LOB) type. 
  • Using XML data type for Dynamic Schema
    • You can validate XML data against an XML schema collection. 
      • This is useful for a dynamic schema. and allows you support data of different schemas.
      • XML schemas allow you constrain the possible nodes, the data type of the nodes, the values of the nodes.
      • Validating XML against a schema result in a typed XML.
    • to prevent binding a product to the wrong schema, use a trigger or (a check constraint to check whether the namespace is the same as the product category name.
  • XML indexes
    • They allow index seek operation instead of a table scan on XML data.
    • The first index on an XML column is the primary XML index. It contains a shredded persisted representation of the XML values. For each XML value in the column, the index creates several rows of data. It can only be created on tables with a clustered primary keys.
    • there are 3 types of secondary XML indexes
      • Path: useful if queries specify path expressions. 
      • value: useful if queries are value-based and the path is not fully specified or it includes a wildcard
      • Property: useful for queries that retrieve one or more values from xml instances using value()
Creating tables and enforcing data integrity.
Tables are created in two ways
  • Select into: creates a table automatically using the output of a query for the basic table definitions
  • Create table statement: you explicitly define the components of the table
    • The definition of a table(table Schema) can be defined after the table’s creation using the ALTER TABLE command.
    • The database Schema  is a container that you can use to group database objects
      • A table can belong to only one schema.
      • Every table’s name has two parts @Database_schema.@table_name.
      • Schema’s cannot be nested in other schemas.
      • Schema’s are owned by users.
      • To move a table from one schema to another, use the format

                                   Alter schema @destination_Schema transfer  @table_name.

    • Naming tables and columns
      • The name of tables and columns must be valid SQL identifiers.There are 2 types of identifiers
        • Regular: 
        • Delimited: identified by []or "" surrounding the name.
      • As the maximum identifier length is 128 characters, make sure identifier names aren’t too long
    • Choosing column data types
      • Use the most efficient data type.
    • Null and default values.
      • You can specify whether a column allows null in its value or not by specifying null or not null after the column’s data type.
      • If you don’t want to allow nulls, but there is a possibility of the data being unknown at insertion time, you can create a default constraint to assign a default value

                                   @column_name @datatype  not Null default(‘@default_value’).

    • Identity property and sequence numbers
      • The identity property can be assigned to a column to automatically generate a sequence of number upon insert
        • Only one column in a table can have it.
        • You specify both seed and increment values for the numbers generated.
        • To insert into an identity column, use the format

                                             set identity_insert @table_name On;

    • Computed columns
      • This are columns that are computed based on expressions.
      • The value of a computed table cannot be non-deterministic e.g. from using getdate().
      • Format is

                                   @computed_table_name As @Attribute1 @Operator @attribute2

    • Table Compression
      • You can compress the data in a table to get more efficient storage in the enterprise versions of SQL server. Data can be compressed in
        • Row: SQL server applies a more compact storage format to each row of a table
        • Page: Include row-compression plus additional compression algorithms.
        • Format is

                                             Create table @Table_name

                                             (
                                                  Table_schema
                                             ) with (Data_compression = @CompressionType)

  • Altering a table: used to alter a table’s structure and are or remove table properties. 
    • To add a column, you use the format

                         Alter table @Tablename

                         Add @columnName @column_Definition

    • To alter a column, use the format

                         Alter table @tablename

                         Alter column @columnname @column_definition

    • Cannot be used to
      • Change a column name
      • add an identity property
      • remove an identity property
Enforcing data integrity
  • Tables in a database sometimes need to be validated to ensure
    • The range of values the column accepts
    • Whether the values should reference another column
    • Whether they should be unique
  • The validation are controlled using constraints.
    • Constraints are database objects and must have unique names. however their names should reflect their function(the table name they’re constraining).
    • Can be declared during the create table and alter table commands
    • With Check
      • This causes the constraint creation to fail if any data already in the column
    • Format is 
Create table @table_name
(
     @table_definition
     constraint @constraint_name @constraint_type(@column_name)
)
Alter table @table_name
Add constraint @constraint_name @constraint_type(@column_name)
Go
    • There are 5 types
      • Primary Key constraints
        • This are used to enforce uniqueness of values for every row in a column.
        • there can only one primary key per table.
        • They can only be created on columns that don’t allow nulls and have only unique values
        • SQL server enforces uniqueness using a unique index.
      • Unique constraint
        • To enforce uniqueness of rows in a table.
        • Only one null can exist in a column with a unique constraint.
        • A unique non-clustered index is created for enforcing uniqueness.
      • Foreign Key constraints
        • A column or combo of columns in one table used to look up data in another table. 
          • In the other table, the referenced column has a primary key or unique constraint applied.
          • The referencing column can have the attribute duplicated.
        • Used to enforce that values entered into the referencing table are from a referenced table
        • Very useful for joining.
          • Creating a non-clustered index on foreign key columns helps query performance.
        • Format is 

                                             Alter table @table_name with check

                                             Add constraint @constraint_name foreign key(@column_name) references @Referenced_table(@referenced_column) with check
      • Check Constraint
        • Used to limit the range of values in a column, beyond the limits of the data type.
        • Format

                                             Add constraint @constraint_name check(@column @comparison_operator @value)

      • Default Constraint
        • This supplies a default value during an insert if no other value is supplied.
        • Values don’t have to be supplied to columns with default during insertion
        • Format is 

                                             Add constraint @constraint_name default(@default_value)

Designing and creating views, inline function and synonyms
Views, inline functions and synonyms are ways of presenting a logical view of a table without exposing the physical base table directly.
Views and inline functions
  • They allow you present the contents of tables to user and present complex query logic so a user doesn’t have to remember them.
  • Views
    • Appear as table and can be used to store and re-use queries in the database.
    • Properties for a view can be 
      • With Encryption: this obfuscates the view text so the user can’t easily see the select text of the view
      • With Schemabinding: binds the view to the table schemas of the underlying tables. The underlying tables cannot be altered without dropping the view and the view cannot have it’s schema changed unless the view is dropped.
      • With View_metadata: returns the metadata of the view instead of the base tables.
      • With Check option: this ensures that if you modify the rows of a table with a view that has a where statement, it ensures you can’t update rows so they’ll not be filtered in by the where statement.
    • A view can only have one result set. Hence it can only have one select statement except a union is used.
    • Views must return a relational result set
    • Views cannot create a table or reference a temporary table.
    • You can alter a view using the alter view command. You must reissue the the view when you alter it.
    • Partitioned views
      • You can use views to partition large tables on one server, in one or more tables across different databases and across multiple servers.
      • A partitioned view uses a union across partitioned tables.
    • To carry out DML(insert, update, delete) with a view,
      •  you must reference exactly one table at a time, no matter how many tables the view references.
      • The view columns must not have expressions surrounding them e.g. aggregate functions.
      • You cannot modify a view column that is computed from a union, cross join, except, intersect.
      • you cannot modify a column whose result is from grouping or distinct.
      • you cannot modify a view with a( top or offset-fetch and a with check option.)
    • Format is
  • Inline Functions
    • They are parameterized views(views with parameters)
    • returns a rowset based on the select statement.
    • Properties of an inline function can be
      • With Encryption
      • With Schemabinding
    • Format is 

                         Create function @Function_name(@parameter1 @parameter1type, @parameter2 @parameter2type)

                         returns table    
                         As
                         return
                         (@select_statement)
Synonyms
  • This provide an abstraction layer for all schema-scoped database objects. They are names stored in a database that can be used as substitutes for other object names
  • You can refer to a synonym anywhere you’ll refer to a table except in DDL statements(e.g. alter)
  • they can refer to objects in a different database or a linked server.
  • Synonyms cannot refer to other synonyms.
  • Synonyms do not expose metadata of to users.
  • Synonyms have a late-binding behavior
    • The object referenced doesn’t have to exist and SQL server doesn’t test for its existence
    • This means you can use a single synonym for many different objects, creating the synonym for each object as you need it. or have it in different DBs refering to the same object.
  • Format is

               Create synonym @Synonym_name for @TableName

Inserting, Updating and deleting data
Inserting data
  • You can Specify the column names in the insert  statement
    • Lets you skip columns that have identity/default values defined.
      • To assign a value to a identity column, you need to issue the set identity_insert @Table_name on.
      • During an insert, If a value or default isn’t specified for a column,
        • SQL checks whether the column gets it’s value automatically e.g. identity or default constraint. If it doesn’t
        • SQL checks if the column allows nulls and inserts a null. If it doesn’t
        • An error is thrown
  • Data can be inserted into a table using
    • Insert values: Allows you insert rows into a table based on value expressions.
      • Allows you change the arrangement of values in the insert statement. 
      • Format is

                                   Insert into @Relation(@attribute1, @attribute2, @attribute3) 

                                   values(@Value11,@value12,@value13),(@value21,@value22,@value23)
    • Insert Select: Allows you insert the result set of a query into the specified target table.
      • Allows you change the arrangement of values in the insert statement.
      • Format is

                                   Insert into @Relation1(@attribute1, @attribute2, @attribute3) 

                                   Select @select_list
                                   From @Relation2
                                   where @predicateExpression
    • Insert Exec: Allows you insert the result set returned by dynamic SQL or a stored procedure into the specified table.
      • Allows you change the arrangement of values in the insert statement.
      • Format is

                                   Insert into @Relation1(@attribute1, @attribute2, @attribute3) 

                                   Exec @Stored_Procedure1

    • Select Into: Creates the target table based on the definition of the source and inserts the result rows form the query into the table.
      • Copies the data from the source.
      • Copies some aspects of the definition(e.g. column names, types, nullability and identity). If you don’t want them copied, you need to apply some modifications to the data. 
        • e.g @identityColumn +1 removes the identity property from the destination.
        • If you want to change the data type, use case() or convert().
        • Modifying a column makes it nullable. To prevent it from being nullable, use ISNULL() e.g. isnull(cast(@Table1 as @datatype),’@AlternateValue’)
      • Some table definition like indexes, permissions, constraints and triggers aren’t copied. If you want to include them, you need to script them from the source to the target.

Update data

  • The update statement is used to update existing rows in your tables.
  • Format is

               Update @target_table

               Set @Attribute1 = @Expression1, @Attribute2 = @Expression2, @Attribute3 =  @Expression3
               where @predicateExpression
  • Updates can be based on Joins.
    • Update rows in a table while referring to related rows in other tables for filtering and assignment
    • Format is

                         Update @Target_table_alias

                         Set @Attribute1 = @Expression1, @Attribute2= @Expression
                         From @Target_table as @target_table_alias
                         Inner Join @Source_table as @Source_table_alias
                         on @Source_table_alias.@Attribute2 = @target_table_alias.@Attribute1
                         where @predicateExpression
    • Nondeterministic Updates can happen when multiple source rows match a target row.
      • SQL server updates one of the rows.
  • To prevent non-deterministic updates, use the merge statement instead of updates based on joins. It fails if there a multiple source rows for a target row.
  • Update can be done through table expressions
    • Useful when you want to see the rows that will be modified before the actual update.
    • Define a Table expression based on the select, then modify using update
    • Useful when you want to modify rows with expressions not allowed in set clause(e.g. window functions)
    • Format is
                         @Table_Expression as @TE_Alias
                         update @TE_Alias
                         set @TE_Alias.Attribute1 = @Expression1, @TE_Alias_Attribute2 = @Expression2.
                         where @predicateExpression
  • Update Based on a variable
    •  Modify a row and collect the result. Instead of using an update and a select, use the short version
    • Format is

                         Declare @variable as @DataType

                         Update @relation1
                         Set @variable1 = @Attribute1+=@Expression
                         where @predicateExpression
Deleting Data
There are 2 statements you can use to delete rows from a table: 
  • Delete: Used to delete rows from a table. A predicate can be used to limit the rows deleted.
    • Fully logged and can take a long time to complete.
      • Causes the transaction log to increase in size.
    • Exclusively locks the table hence a delete of a lot of data can cause lock escalation.
      • To prevent this, split the delete into smaller chunks using TOP and use an infinite loop

                                   While 1=1

                                   Begin
                                        Delete Top(1000) from @relation1
                                        if @@rowcount<1000 break
                                        where @predicateExpression
                                   End

    • Delete based on a join
      • Enables you delete rows from one table based on information that you evaluate in related rows in other tables.
      • You can also use a subquery instead(with in or exists)
      • Format is

                                   Delete from @relation1_alias

                                   From @relation1 as @relation1_alias
                                   Inner join @relation2 as @relation2_alias
                                   On @relation1.@attribute1 = @relation2.@attribute1
                                   where @predicateExpression

    • Delete based on table expressions
      • Use a table expression to define the rows you want to delete, then issue a delete against the table expression. 
      • Format is

                                   With @TableExpression1

                                   Delete from @TableExpression1

    • Format is 

                         Delete From @relation1

                         where @predicateExpression

  • Truncate: Deletes all rows from the target table.
    • Doesn’t allow an optional predicate filter.
    • Format is

                         Truncate table @relation1

  • Truncate is minimally logged and is faster than delete.
  • Delete doesn’t reset the identity property of a column and truncate does
  • Delete works for a table with a column referenced by a foreign key as long as the row being deleted isn’t referenced. Truncate isn’t
  • Delete is allowed against a table involved in an indexed view. truncate isn’t
Data Modification
Using sequence and identity column property
  • Identity column property
    • Automatically assigns a value to the column upon insertion. 
      • You can’t put a value in the identity column during an insert except you enable identity_insert on the table.
        • Set identity_insert @Table on
    • You cannot update an identity column.
    • Only one column can have this property.
    • The data type must have a scale of 0(no decimal). Numeric/Decimal with a scale of 0 is allowed
    • You can define a seed(first value) and increment( value to increment previous supplied value by). default for both is 1.
    • To query the last identity value generated use
      • Scope_identity(): returns the last identity value generated in your session in the current scope
      • @@identity: returns the last identity value generated in your session regardless of scope.
      • Ident_current(@relation): accepts a table as input and returns the last identity value generated in that table regardless of session.
    • Delete doesn’t reset the identity property of a column and truncate resets the identity column to its seed.
    • Identity doesn’t guarantee uniqueness as you can turn on identity_insert and can reseed the value. To enforce uniqueness, you need a unique index(primary or unique constraint)
    • Identity doesn’t guarantee there’ll be no gaps between values.
      • If an insert fails, the identity value isn’t reset, so the value the column returns is lost.
    • Identity doesn’t support cycling. After you reach the maximum value in the data type, the next insertion will fail. you need to reseed the identity value
      • To reseed an identity value, use the DBCC CheckIdent command.
        • dbcc checkident(‘@tableName’, reseed, @NewSeedValue)
  • Sequence Object
    • An independent object in the database.
    • Can be used for keys that shouldn’t conflict across different tables.
    • The value is generated before use.
    • Supports cycling.
    • The data type must have a scale of 0(no decimal). Numeric/Decimal with a scale of 0 is allowed
      • Default data type is bigint
      • E.g create sequence @sequenceName as @datatype;
    • Has the following properties
      • Increment by: The increment value. Default is 1
      • MinValue: the minimum value to support. Default is the minimum value in the type.
      • MaxValue. The Maximum value to support. Default is the maximum value in the type.
      • Cycle|No cycle: defines whether to allow the sequence to cycle or not. default is no cycle
      • Start with: sequence start value. default is MinValue for an increasing sequence and MaxValue from a decreasing(increment is -1) sequence.
    • To request a new value from the sequence, use the Next Value For function.
      • The next value for is usually part of a default constraint to generate values automatically.
      • e.g. Add constraint @constraint name default(next value for @SequenceName)
    • You can change all properties of a sequence except it’s data type.
      • Use the Alter Sequence command. 
        • To change the current value, use the restart with command
        • E.g. Alter sequence @sequence1 restart with @restartValue.
    • Has a cache Option which controls how ofter the current sequence value is written to disk versus to memory. Data is written to memory until the defined number of values are assigned.
      • This gives a very large performance improvement
      • e.g. alter sequence @sequenceName cache @numberTillDiskWrite
    • Doesn’t guarantee there’ll be no gaps between values.
      • If an insert fails, the value isn’t reset, so the value the sequence returns is lost.

 Merging Data

  • The merge statement allows you merge data from a source table or expression to a target table.
    • Has the following clauses
      • Merge Into @Target_Table: Defines the target table for the operation.
      • Using @source_Table: Defines the source table for the operation. operates similarly to a from clause. 
        • You can define joins, refer to table expressions, apply, pivot, unpivot or use table functions. The output is a table result that is used as the source of the merge
        • you can use any table as the source.
      • On @merge_Predicate: specify a predicate that matches rows between the source and target and defines whether a source row is or isn’t matched by a target row.
      • When Matched [and @predicate] Then @action: Defines an action to take when a source row is matched by a target row. 
        • Only Update and Delete are allowed because since a target row exists, you can’t insert.
        • you can apply 2 when matched if you want different predicates to cause either update or delete.
      • When not matched [by target] [and @predicate] then @action: This defines what action to take when a source row is not matched by a target row. 
        • You can only use insert here as the row doesn’t exist.
      • When not matched by source [and @predicate] then @action: This defines an action to take when a target row exists, by isn’t matched by a source row.
        • Only Update and Delete are allowed because since a target row exists, you can’t insert.
    • At minimum, only one clause is required and it can be any of the when clauses.
    • It’s best to compare all the columns in the predicate of the when matched clause to ensure unnecessary updates aren’t carried out 
    • Format is
              
                    
The output Clause
  • The output clause for modification statements, which you can use to return information from modified rows for auditing and archiving.
  • Similar to Select
    • To select the previous values from the columns, use specify the modification before the column name. You can use for
      • Insert
        • Used to return information from the inserted rows.
        • Add an output clause after the insert and select the attributes you want to return.
        • Format

                                             insert into @Destination_Table(@Attribute1, @Attribute2, @Attribute3)

                         output inserted.@Attribute1, inserted.@Attribute2, inserted.@Attribute3
      • Delete
        • Used to return information from deleted rows in a delete statement. you need to prefix the columns with the keyword deleted.
        • Add an output clause after the delete and select the attributes you want to return.
        • Format is 

                                             delete from @Destination_Table

                         output deleted.@Attribute1, deleted.@Attribute2, deleted.@Attribute3
      • Update
        • Used to return information from modified rows
          • You can return both the old and new images of the modified rows
          • To return columns before the update, use the prefix deleted
          • To return columns after the update, use the prefix inserted
          • Format is
Update @Destination_table
set @Attribute1 = @Expression1
output

 deleted.@attribute1, inserted.@Attribute2

      • Merge
        • Merge can carry out inserts, updates and deletes.
          • You can refer to columns from both the target and source.
          • Use $action to determine the action the merge carried out.
            • $action returns a string("insert", update or delete") to signify the action that was taken.
          • You use coalesce(inserted.@attribute1, deleted.@attribute1) to return the applicable one
          • Format is
@Mergestatement
output $action as The_action, coalesce(inserted.@Attribute1, deleted.@Attribute1)
      • Composable DML
        • to capture output from a modification statement when you only want a subset of the output rows and not all of them.
        • To do this
          • You define a table based on a modification with an output clause.
          • You have an outer insert select against a target table, with the source table being the derived table. 
        • Format is
Insert into @ArchivingTable(@Attribute1, @Attribute2, @Attribute3)
select @Attribute1, @Attribute2, @Attribute3
from(@ModificationStatement
          output $action as the_action, inserted.@Attribute2, inserted.@Attribute3) as @Alias
where the_action = ‘Insert’
  • You can use an into to write to a table.
    • If you use into, the target cannot participate in a foreign key relationship and cannot have triggers on it. 
    • Add the into after the output clause.
  • Format is

                    insert into @Destination_Table(@Attribute1, @Attribute2, @Attribute3)

output inserted.@Attribute1, inserted.@Attribute2, inserted.@Attribute3
into @ArchiveTable(@Attribute1, @Attribute2, @Attribute3)
Select @Attribute1, @Attribute2, @Attribute3
from @SourceTable

 

Implementing transactions, Error handling and dynamic SQL
SQL maintains transactional control over all changes to database data. this helps ensure the integrity of database data will never be compromised by partially completed transactions, constraint violations, interference from other transactions and service interruptions
Understanding Transactions
  • A transaction is a logical unit of work: either all of it completes as a whole unit, or none of it does.
  • Controlling the result of transactions
    • Commit: When the work of a transaction has been approved by the user, SQL server completes the transaction by committing them
    • Rollback: If an unrecoverable error occurs or the user decides not to commit, then the transaction is rolled back.
  • Properties of a transactions
    • Atomic: Every transaction is an atomic unit of work, either all database changes in the transaction succeeds or none of the succeed.
      • SQL server treats every DML or DDL command individually and will not allow any command to only partially succeed.
      • If more than 1 command is present in a transaction, SQL server will not allow the entire transaction to be committed unless all commands are successful
        • This behavior can be controlled using error handling techniques.
    • Consistent: Every transaction whether successful or not, leaves the database in a consistent state as defined by all object and database constraints. If an inconsistent state results, SQL server will roll back the transaction to maintain a consistent state.
      • It ensures that all constraints in the database are enforced.
    • Isolation: Every transaction looks as though it occurs in isolation from other transactions in regards to database changes. The degree of isolation can vary based on isolation level.
      • It ensures that when a transaction makes multiple changes to the database, none of the objects being changed by that transaction are allowed to be changed by any other transaction. If another wants to change the data, it must wait for the first to conclude.
      • Maintained by locking and row-versioning
    • Durable: Every transaction endures through an interruption of service. when service is restored, all committed transactions are rolled forward( committed changes to the database are completed) and all uncommitted transactions are rolled back(Uncommitted changes are removed).
      • Maintained using the transaction log. 
      • Every database change is first written to the transaction log, with the original version of the data before being written to the database data. when a transaction is committed, the fact that it has been committed is written to the log.
  • Transaction commands
    • Begin transaction(Begin tran): marks the start of a transaction.
    • Commit transaction/work(commit): Used to commit a transaction
    • Rollback transaction/work(rollback): Used to roll-back a transaction
  • Transactions can be nested 
  • Transaction levels and states
    • @@Trancount: This can be queried to find the level of transaction
      • 0 means the code is not within a transaction
      • >0 means that there is an active transaction
      • >1 indicates the nesting level of nested transactions
      • Doesn’t report uncommittable transactions
    • XACT_STATE(): this can be queried to find the state of the transaction
      • 0 indicates there’s no active transaction
      • 1 indicates that there is an uncommitted transaction and it can be committed, but the nesting level is not reported.
      • -1 means there’s an uncommitted transaction, but it cannot be committed due to a prior fatal error.
      • Doesn’t report the transaction nesting level.
  • Transaction modes: There are 3 ways of working with transactions
    • Autocommit:
      • It is the default mode
      • Single data modification and DDL are executed in the context of a transaction that will be automatically committed when the statement succeeds or automatically rolled back if the statement fails.
      • No transactional command is issued.
      • Changes to the database are automatically handled statement by statement as transactions.
    • Implicit transactions:
      • When you issue DML, DDL or select, A transaction starts, increments @@trancount for the session, but doesn’t automatically commit or roll back the statement. You must issue a commit or rollback to finish the transaction.
      • Turned on with : SET IMPLICIT_TRANSACTIONS ON
      • Running a command sets @@Trancount to 1(you are 1 level deep in the tran). You must manually issue a commit or rollback to finish the trans.
        • Any locks taken by your command are held till you commit. if you don’t commit, you can leave locks open.
        • It is set per session.
        • You can rollback a transaction if you notice mistakes.
    • Explicit transactions
      • Begins when you issue the begin tran command to start a tran.
        • This causes the value of @@Trancount to be 1.
        • When used with implicit transactions, @@trancount is set to 2 when Begin tran is run.
          • This makes it a nested transaction.
      • Nested Transactions: This occcurs when explicit transactions are nested
        • The behavior of rollback and commit changes.
        • Every new Begin Tran increments @@trancount by 1. A commit decrements @@trancount by 1 while a rollback resets @@trancount to 0. Hence, only the commit that reduces @@trancount to 0 is relevant.
      • Marking a transaction: 
        • You can name an explicit transaction. 
          • The name must be a valid identifier
          • SQL only records tranName for the outermost tran
          • E.g. begin tran @TranName
        • Named transactions are used for placing a mark in the transaction log to which the database can be restored.
          • E.g. Begin tran @TranName with Mark
        • To restore to the mark, reference the mark when restoring the transaction log.
          • Use StopAtMark to reference the TranName.
          • Use StopBeforeMark to restre to just before the transaction.
          • E.g. 

                                                       Restore Database @DBName from disk = ‘@Backuppath\backup.bak’ with noRecovery
                                                       Restore Log @DBName from disk = ‘@Backuppath\backup.trn’ with StopAtMark = ‘TranName’

  • Transactions can span batches.
  • Transaction Options
    • Savepoints: Locations within transactions that you can use to roll back a selective subset of work.
      • Define a savepoint using the Save Transaction command.
      • Rollback must reference the savepoint. 
    • Cross-database transaction
Basic Locking
  • Locking is used to preserve the isolation of transactions.
  • There are 2 general modes
    • Shared: For sessions that read data
      • Shared locks are compatible with each other
        • Several sessions can issue shared locks on an object
    • Exclusive: For sessions that write data
      • Used to exclusive lock an object so no other transaction can change the data until the locking tran commits or rolls back.
      • In most isolation levels, most sessions cannot even read data from exclusively locked objects.
      • Exclusive locks aren’t compatible with other locks
        • No other tran can read or write to the object
      • Blocking: 
        • If a session has an exclusive lock on an object, it blocks any other session from using the object until the first session either commits or rolls back.
        • If a session has a shared lock on an object, It blocks any others attempting an exclusive lock until the first session is done.
          • Exclusive locks aren’t compatible wth shared locks
          • Transactions with READ COMMITTED isolation ensures shared locks are dropped as soon as a read is completed.
      • Deadlocking
        • A result of mutual blocking between 2 or more sessions. 
        • Occurs when sessions cannot be resolved by waiting for another tran to finish because of a cyclical relationship between several commands. 
        • SQL server aborts one of the trans and returns error 1205 to the client.
Transaction Isolation Levels
  • This is a method of controlling the isolation requirements for readers.
  • Can reduce or increase blocking level based on whether they require locks, how long locks are held
  • This is configured per session
    • A transaction can read other trans data or allows data to be changed by other trans while the current trans only reads
  • Read Committed:
    •  Default. 
    • All readers in the session will only read data changes that have been committed
    • Select will use shared locks and exclusive locks will block sessions with Read Commited.
    • TO read uncommitted data, queries need to use the (nolock) or (readuncommited) table hints.
  • Read Uncommitted
    • Allows readers to read uncommitted data.
    • Removes the shared locks taken by select statements so readers no longer block writers.
    • Dirty reads can occur.
      • the result of the select can include information that was rolled back by another tran.
  • Read Committed Snapshot(RCSI)
    • Default for Windows Azure SQL database.
    • A method of using Read Committed.
    • Uses row versioning in tempdb to store original versions of changed data. 
      • The versions are stored as long as needed to allow readers read data in its original state. hence shared locks aren’t needed on the underlying resource while reading committed data
    • Set at the DB level.
  • Repeatable Read
    • guarantees that data read in a tran can be re-read later in the tran.
    • Updates and deletes or rows already selected are prevented.
    • Shared locks are kept until the end of a tran.
    • Phantom reads can occur
      • New data can be added after the first read.
  • Snapshot
    • Uses row versioning in tempdb.
    • Configured at the DB level and set per tran.
      • A tran using snapshot isolation will be able to repeat any reads and will not see any phantom reads.
      • New rows may be added, but the tran will not see them.
    • Doesn’t require shared locks as it uses row-versioning.
  • Serializable
    • Strongest isolation level
    • set per session
    • Doesn’t allow any inserts that would satisft the select.
  • Format is

               Set Transaction Isolation level @TranIsolationLevel.

Error Handling
Detecting and raising Errors
  • When an error is encountered, An error condition is generated and SQL server takes an action.
  • If T-SQL code isn’t in a try/catch block , error messages will be passed through to the client and cannot be intercepted in T-SQL code. 
  • When SQL generates an error condition, the function @@ERROR will have a positive integer value indicating the error number.
  • Analyzing error messages
    • Error messages are usually in the form 

                         @ErrorNumber, @SeverityLevel, @State, @Line_Number

                         @Error Message
    • E.g.

                    

    • Error Number: 
      • SQL server error messages are numbered 1 to 49999
      • Custom error messages are 50001 and higher
      • 50000 is for a custom message without an error number.
    • Severity Level
      • There are 26 levels(0 -25)
      • Errors with severity  >=16 are logged to SQL server log and eventVwr
      • Errors with severity >= 19 can only be specified by members of sysadmin
      • errors with severity >=20 cause the connection to be terminated and any open transactions to be rolled back
      • errors with severity between(0 -10) are informational only
    • State: 
      • Maximum value of 127.
      • Used internally by Microsoft
    • Error message: 
      • 255 unicode characters long.
      • all stored in sys.messages.
      • You can add using sp_addmessage
  • You can raise your own errors using RaisError and Throw
    • Raiserror()
      • Message(messageID, string or string variable) along with severity and state are required. E.g. raiseerror(@ErrorMessage, 16, 0);
      • TO issue errors with severity >20, you need to include the WITH LOG option(you must be a sysadmin). 
        • The connection will be terminated
      • Use NOWAIT to send messages immediately to the client.
      • Doesn’t terminate the batch
        • Code after it will continue to run
      • Format is
                         
    • Throw command
      • Similar to Raiserror except for the following differences.
        • Parameters are error_number, message, state
          • Can be used without parameters in CATCH
        •  the error number doesn’t require a matching defined message in sys.messages.
        • Any parameter can be a variable
        • state can be a maximum of 255
        • The severity is always 16
        • It always terminates the batch except when in a try block.
      • Format is

                                   

  • Try_Convert and Try_Parse
    • Try_Convert: Attempts to cast a value as a target data type.
      •  if it succeeds, returns the value, and returns a null if it doesn’t succeed. 
      • E.g. Try_convert(@destination_data_Type, @ValueToConvert);
    • Try_parse: Attempts to convert data of a data type into a specified data type. 
      • If it succeeds, it returns the value, and returns a null if it doesn’t succeed.
      • E.g. try_parse(@ValueToConver as @DestinationDataType)
  • Handling Errors after detection:
    •  There are 2 error handling methods available
      • Unstructured
        • You handle errors as the happens by accessing @@Error.
          • Consists of testing individual statements for their error status immediately after the execute. 
            • When an error occurs, the error number is recorded in @@error 
            • If no error occurs, the value in @@error is 0.
          • Querying @@error causes it to reset to a new number as it reports the status of the command last executed.
            • copy the value into a variable and test the variable.
        • Using XACT_ABORT with transactions
          • Affects the entire batch
            • you can make an entire batch fail if any error occurs by beginning it with SET XACT_ABORT ON.
          • Set per session.
            • All other transactions are subject to it until it’s turned off
          • Causes a transaction to roll back based on any error with severity >10
          • you cannot trap the error or the error number.
          • None of the remaining code is executed
      • Structured
        • You designate a central location (the catch block) to handle error.
        • Done using TRY/CATCH
          • You wrap the code to test for errors in a try block
            • Every try block must be followed by a catch block where you handle its errors. they must be in the same batch
          • If an error is detected in a try block, control is passed to its catch
            • Remaining statements in the try aren’t executed
            • after the catch, control is passed to the first statement after the try_catch
          • No error message is sent to the client
            • Any error with a severity >-20  that do not close connection are handled by the catch.
            • to test for the state of a transaction, query XACT_STATE()
              • XACT_STATE values
                • 1: An open transaction that can either be committed or rolled back
                • 0: no open transaction. same as @@trancount = 0
                • -1: an open transaction that can not be committed
          • You cannot trap error that are due to compilation errors, syntax errors or non-existent objects.
          • Functions to report error details
            • Error_number(): returns the error number
            • Error_Message(): returns the error message
            • Error_Severity(): returns the severity of the error
            • Error_line(): returns the line number of the batch the error occurred
            • Error_procedure(): Returns the function, trigger of procedure executing when the error occurred.
            • Error_state(): Returns the state of the error.
          • Using raiserror in try must have a severity between(11 -19) to throw control to the catch.
          • In the catch, you can use 
            • Raiserror
            • throw with params
              • Causes execution in the catch to end.
            • Throw: raises the original error message and passes it to the client.
              • Causes execution in the catch to end.
            • XACT_ABORT
              • Transfers control to the Catch.
              • The transaction is uncommitable in the Catch.
                • XACT_State = -1
Dynamic SQL
The Technique of using T-SQL to generate and potentially execute other T-SQL. Your code dynamically constructs code that will dynamically be executed batch by batch.
  • Used in scenarios where SQL doesn’t let you specify a variable in a clause. e.g FROM clause, USE clause, column names, contents of list in IN and PIVOT
  • Execute Command is used to Execute Dynamic SQL
    • You can use string variables in the execute command
      • The string variables must be a single batch(no GO)
  • SQL Injection
    • Dynamic SQL puts your Database at risk of SQL injection as the user can simply put his code instead of yours into the application
    • To prevent it, parameterize the dynamic SQL generation and execution using sp_executesql.
      • Sp_executesql is an alternative to Exec.
      • It support parameters. The parameters must be unicode.
      • The format is

                                   

Designing and Implementing T-SQL routines
Stored Procedures
  • This are routines that reside in a DB and encapsulate code.
  • A Stored procedure is a single batch of code.
  • Called using Execute.
  • They can be passed data through input params and receive data using output params.
    • Params can be passed by position
    • Params can be passed by association.
    • Default parameters(parameters that are assigned default values) do not need to be entered.
  • Can encapsulate T-SQL code.
  • They cannot have the Use statement
  • Created using create Proc command
  • Declare a variable as output for it to be returned to the caller
    • output variables are always optional.
    • after declaring the parameter, use the keyword output(out) to make it an output parameter.
    • During execution, the output keyword must be used after setting the value of the output variable.
  • Set nocount on
    • This lets you prevents messages that show how many rows were returned from being sent to the client.
  • Return is used only if you want to return an output parameter. to return a result set, do not specify a return in the procedure definition.
  • Return and return codes
    • This is used to cause a procedure to exit. 
    • Can be used with conditional statements to control the flow of queries
    • There can be several returns in a SP.
    • Return alone causes the return of 0 for successful and a negative number for an error. 
  • You can call one stored procedure from inside another.
    • temporary tables of a stored procedure are visible down not up.
      • A SP it calls can see it’s contents
      • The calling SP cannot see the called SPs contents
    • Variables and parameters are invisible to called SPs.
  • To Declare, Format is
               create proc @SP_Name
               @variable1 as @variable1type, @variable2 as @variable2type=@DefaultValue, @variable3 as @variable3type output
               AS
               Begin
               Set Nocount on
               @Select_Statement_including_Variables
               [Return]
               END
  • To Execute, Format is

               Exec @SP_NAME @variable1 = @Literal1, @Variable2 = @Literal2, @variable3 =@variable3 output

Branching Logic
  • Branching is used to control the flow of your code. They are used t enable your code handle complex situations that require different actions based on inputs.
  • Statement blocks are statements inside a Begin.. END that allow branching statements execute more than one statement after evaluation of input expressions.
    • E.G.

                         

  • Statements used for branching logic
    • IF/ELSE
      • Allows you conditionally execute code.
      • If the expression after the if keyword is true, the statement after the if is executed. If it isn’t, the statement after the else is executed. 
        • You can have multiple else statements
    • While
      • Allows you create loops to execute a statement block as long as a condition continues to evaluate to true.
      • The condition must be modified so that it eventually leads to a false.
      • You can use the break statement to end a while statement and go to the next block of code.
      • You can use the continue statement to end a loop and cause control to start at the beginning of the statement block again
    • WaitFor
      • It causes the execution of statements to pause for a specified period of time
      • There are 3 options
        • WaitFor Delay: used to specify how much time to wait for
          • E.G. Waitfor delay ’00:00:00′
        • Waitfor Time: used to specify the time execution should continue
          • E.G. Waitfor time ’23:00:00′
        • WaitFor Recieve: used only with service broker
    • Goto
      • Causes your code to jump to a defined T-SQL label.
Triggers
  • Triggers are a kind of Stored procedure associated with DML events on a table or view. 
    • It cannot be explicitly executed
  • Triggers are fired when a DML event the trigger is associated with occurs. 
  • There are 2 types of trigger events
    • DML triggers: A TSQL batch associated with a table that is defined to respond to a particular DML event such as insert, update or delete, or a combination of those events.
      • A trigger executes only once for each DML statement, no matter how many rows were affected
      • The schema must be the same as that of the table or view it is associated with.
      • Triggers execute as part of the transaction associated with the event statement.
        •  Issuing a rollback in a trigger causes all changes to rollback. 
      • Exit with a return statement.
      • You can access the inserted and deleted tables.
        • They contain rows affected by the modification that caused the trigger to fire.
      • There are 2 types
        • After
          • Fires after the event it is associated with finishes 
          • Can only be defined on permanent tables
          • Checking if @@rowcount =0 in the first line of a trigger improves performance
          • The trigger executes after the DML statement has passed all constraints check
          • After triggers can be nested to a depth of 32.
          • Format is 

                                                       Create trigger @Trigger_name

                                                       On @Table_name
                                                       After @DML_event1, @DML_Event2, @DML_Event3
                                                       As
                                                       Begin
                                                       If @@rowcount = 0 return;
                                                       Set NOcount on
                                                       END
        • Instead of
          • Fires instead of the event it is associated with.
          • can be defined on temporary and permanent tables and views.
          • Executes a batch of code instead of the DML statement.
          • commonly used on views to take care of constraints that prevent direct actions from being taken(e.g. foreign constraint).

User defined functions(UDF)

  • Encapsulate reusable code and return a scalar value or table to the caller.
  • They accept parameters which are accessed in the function as variables.
  • they are embedded in SQL and they execute as part of a SQL command. 
    • They do not require execute command
  • They cannot perform DDL, or change data in permanent tables using DML statements
  • They can be nested
  • There are 5 options
    • Encryption: obfuscates the source code
    • Schemabinding: binds the scemas of all referenced objects
    • returns null on null input: any null parameters cause a scalar UDF to return Null without executing the body of the function.
    • Called on null input: default, Function will execute even if null is passed as a parameter
    • Execute as: execute under various contexts
  • There are 2 types
    • Scalar UDFs: return a single value to the caller
      • They can appear anywhere an expression that returns a single value can appear.
      • They are run for every row in the result set and need to be very efficient.
      • to create them, the Format is

                                   Create function @FunctionName

                                   (
                                        @param1 @param1DataType
                                        @param2 @param2DataType
                                   )
                                   returns int
                                   AS begin
                                   return @Param1 @operator @param2
                                   End
      • TO execute them , just call them from a capable clause(Where, select etc) e.g
select @Attribute1, @Attribute2, @functionName(@Attribute1, @Attribute2)
From @Relation1
    • Table-valued UDFs: return a table to the caller. 
      • They can appear anywhere a table is expected in the from clause.
      • there are 2 types
        • Inline: 
          • They contain a single select statement that returns a table.
          • Format for creation is 
                                   Create function @FunctionName
                                   (
                                        @param1 @param1DataType
                                        @param2 @param2DataType
                                   )
                                   returns table as return
                                   (
                                   @Select_list From @relation @whereIncludingParams
                                   )
          • To execute, simply select from the function 
            • e.g. Select @Attribute1, @Attribute2, @Attribute3 from @functionName(@param1, @param2).
        • Multistatement
          • You define the table to be returned as a table variable and insert data into the table variable.
          • Return only ends the function and isn’t used to send data to the caller.
          • Format for creation is
Create function @functionName
(
@param1 @param1DataType
@param2 @param2DataType
)
returns @tableVariable table
(
@Attribute1 @Attribute1DataType,
@Attribute2 @Attribute2DataType
)
As
Begin
     insert @TableVariable
        @Select_list From @relation @whereIncludingParams
     Return
End
GO
Analyzing Query performance
Query Optimization
  • The number of execution plans grows with query complexity
  • Query Execution phases
    • T-SQL query:
      •  this it the statement to execute.
      • Performed by the relational engine
        • This uses logical operators(e.g. Join)
    • Parsing: 
      • checks whether the query is syntactically correct. 
      • The output is a parse tree( a tree of logical operators)
      • Performed by the relational engine
    • Binding: carries out name resolution for the objects in the query, binding objects to logical operators. The output is an algebrized tree( a parse tree associated with objects)
      • Performed by the relational engine
    • Optimization: generation of candidate plans and selection of a plan: the output is the actual execution plan(a single tree with physical operators)
      • Performed by the relational engine
    • Execution: 
      • This executes the query, compiling the code to binary before execution.
      • The output is the result set.
      • SQL can cache an execution plan so it doesn’t have to do optimization again.
      • Performed by the storage engine
        • This uses physical operators(e.g. merge join operation)
  • The query optimizer is a cost-based optimizer.
    • It assigns a cost to each possible plan. A higher cost means a more complex plan which means a slower query. The optimizer tries to find the one with the lowest cost. 
      • The cost of a query is calculated by determining the algorithm used by a physical operator and the cardinality estimation(estimating the number of rows to be processed).By calculating the cost of all operators in a plan, it can calculate the cost of the whole plan.
        • The number of rows is gotten from optimizer statistics
  • Caching the selected execution plan in the plan cache can speed up the execution of the same query or of an equivalent query from the execution perspective. SQL server parameterizes the query so one plan can serve multiple similar queries
    • Use of Stored procedures lets SQL parameterize queries.
  • SQL caches the execution plan separately from the actual value(the execution context) allowing SQL server reuse the execution plan many times.
System Monitoring for query optimization
  • Monitoring has an impact on the system being monitored.
  • SQL server Extended events is a very lightweight performance monitoring system
    • allows you correlate data from SQL server with data from the OS and application.
    • Server as a container for all extended event objects
      • Events:
      • Targets
      • Actions
      • Predicates
      • Types
      • Maps
  • SQL trace: an internal server mechanism for capturing events.
    • The source for a trace event can be a batch or an event like a deadlock.
    • After an event, the trace gathers the event information
    • SQL server profiler is the UI for trace. 
      • When monitoring a Production system, use trace without profiler as profiler uses system resources.
Using SET session options and analyzing query plans
  • Set Session Options
    • Set Statistics IO command
      • One of the targets of optimizing a query is to lower disk I/O
      • Lowering the number of pages SQL server has to read.
      • A session command( stays on till you complete the session or disconnect from SQL Server)
      • Sample result is 
      • returns the following data
        • Scan count: the number of index or table scans performed
        • Logical reads: the number of pages read from the data cache.when you read a whole table, it shows you an estimate about table size.
        • Physical reads: the number of pages read from the disk. Lower than the number of pages because some pages are cached
        • Read-ahead reads: the number of pages SQL reads ahead
        • LOB logical reads: the number of LOB pages read from the data cache.
        • LOB physical reads: the number of LOB pages read from the disk
        • LOB read-ahead reads: the number of LOB objects read ahead
    • Set Statistics time command
      • this returns statistics of time including CPU time and total time needed for the operation. it shows time for execution and pre-execution phases.
  • SQL caches data in memory to prevent it having to go to disk to pull data.
  • Execution Plans
    • This shows how a query is executing. 
      • the physical operators used during the execution
      • the relative cost of each operator in the total query cost as a percentage of the total query cost.
      • Arrows showing how data flows from one physical operator to another.
      • The total and estimated cardinality of rows. 
        • Their difference shows errors in cardinality estimation
      • to see more details about an operator press F4.
    • the Execution plans(estimated and actual plans) are output as text, XML and graphically.
      • To see text
        • Set Showplan_text and Set Showplan_all for estimated plans
        • Set statistics profile for actual plans
      • To see XML
        • Set showplan_XML for estimated plans
        • set statistics XML for actual plans.
      • Graphically
        • Ctrl + L for estimated plans
        • Ctrl +M for actual plans
    • Some of the operators shown
      • Table scan: scan of a whole table stored as a heap(a table can be stored as a heap or a clustered index).
      • Clustered index scan: Scan of a whole table as a clustered index/ Indexes are stored as balanced trees
      • Clustered index seek: SQL server seeks for teh first value in the seek argument(e.g. a column value in where) in a clustered index and performs a partial scan
      • Index scan: scan of a whole non-clustered index
      • Index seek: SQL server seeks for the first value in the seek argument(e.g. a column value in where) in a non-clustered index and performs a partial scan.
      • RID lookup: Lookup for a single row in a table stored as a heap using the row identifier.
      • Key lookup: Lookup for a single row in a table stored as a clustered index using the key of the index.
      • Hash match join: a join that uses the hash algorithm.
      • merge join: a join that uses the merge algorithm.
      • Nested loops: Joins that use the nested loops algorithm
      • Stream Aggregate: Aggregation of ordered rows.
      • Hash Match Aggregate: Hash algorithm used for aggregating
      • Filter: a filer of rows based on a predicate
      • Sort: a sort of incoming rows.
Dynamic management objects
  • SQL monitors itself and gathers information useful for monitoring the health of an instance, find problems such as missing indexes and optimize queries. The items are stored in Dynamic management objects(function and views).
  • DMOs are in the sys schema and start with dm_.
  • DMOs are virtual objects that give you access to data SQL server stores in memory.
  • They store cumulative data which are reset after the restart of an instance.
  • Common DMOs for query tuning
    • SQL server OS(SQLOS) related DMOs: manages OS resources specific to SQL server
      • sys.dm_os_sys_info: returns info about the system.
      • sys.dm_os_waiting_tasks: info about sessions that are currently waiting on something. E.g. they’re waiting on a block.
    • Execution-related DMOs: Info into queries that have been executed, including their query text, execution plan, number of executions. 
      • The sql_handle column is a hash map of the batch text that is executed. you can use it to retrieve info from the sys.dm_exec_sql_text.
    • index-related DMOs: info about index usage and missing indexes.
Implementing indexes and statistics
Indexes
  • The logical structure of data
    • A page is an 8Kb unit and belongs to a single object. It is the smallest unit of reading and writing.
    • An extent is 8 consecutive pages
      • Mixed extent: If an extent consist of pages belonging to multiple object.
      • Uniform extent: if an extent consists of pages belonging to a single object. 
    • SQL server traces which pages and extents belong to an object through Index Allocation Map(IAM) pages. An IAM page can point to ~4GB of data.
      • Every table/index has a first IAM page.
      • IAM pages are organized as a doubly linked list(each page points to its descendant and antescedent.
      • SQL server stores a link to first IAM pages in its own tables
      • When a table is created, no pages are allocated. After the first insert, the first page and IAM are assigned.
      • Internal fragmentation means that pages are not full. The more rows you have on a page, the fewer pages SQL server must read to retrieve these rows. and the less memory used for cached pages for the same number of rows.
    • SQL server organizes tables as heaps or as balanced trees(a clustered index/table).
      • Heaps
        • Data stored in any logical order.
        • If you do not create a clustered index, then the table is a heap.
        • Data is found by scanning the whole heap in physical or allocation order. 
        • New rows are stored anywhere in a heap. this leads to their becoming fragmented.
      • Clustered index
        • A table is organized as a balance tree when you create a unique index.
        • a balanced tree has a single root page and one or more leaf pages. 
          • It can have zero or more intermediate pages
          • Data is stored in leaf pages.
          • Data is stored in logical order of the clustering key.
          • Intermediate pages point to lead-level pages. they contain a clustering key value and a pointer to a page where this value starts in logically ordered leaf level.
          • If a single page can point to all leaf-level pages, then only a root page is allocated. 
          • If more than one page needs to point to a lead-level page, an intermediate-level page is added.
          • The root page points to intermediate level pages. If the root-page cannot point to all first-level intermediate pages, a new intermediate level is created.
          • Pages on the same level are organized as a doubly linked list
            • SQL server can find the previous and next page in logical order for any specific page.
          • SQL server uses the IAM pages to track the physical allocation of the balanced tree pages.
        • Columns with unique or non-unique values can be the clustering key(key of a clustered index).
          • Uniqueness is maintained by adding a sequential integer to the clustering key’s value if it occurs again.
        • To seek for a row, SQL only needs to read 3 pages instead of the whole heap(root and intermediate pages).
        • Inserting a new row into a full page causes SQL server to split the page into 2 pages and move half of the rows to the second page as it needs to maintain the logical order of the row. 
          • This leads to internal fragmentation.
        • New pages can be reserved anywhere in a data file as physical order of pages needn’t correspond to the logical order.
          • This means the pages are out of order and are logically/ externally fragmented.
        • You can control internal fragmentation with the FillFactor option for leaf_level pages and Pad_index for the higher level pages of the create index statement. 
        • You can rebuild or reorganize an index to get rid of the external fragmentation using alter index…reorganize and alter index…rebuild.
        • A short clustering key means more rows can fit on pages above leaf level and fewer lower levels are needed. 
          • Fewer levels mean a more efficient index as SQL server needs to read fewer pages to find a row. Hence shorter keys are better for seeks
          • Good for OLTP environments
        • A large clustering key is useful where queries read huge amount of ordered data.
          • This is better for scans.
          • Good for warehousing environments
    • Non-clustered indexes
      • Indexes that do not contain all of the data and serve as pointers to table rows for quick seeks are called non-Clustered indexes.
      • Similar in structure to clustered indexes.
        • Root and intermediate levels look the same as a clustered index.
      • The leaf level is different because it doesn’t hold all the data.
        • What is stored depends on the underlying table organization(either balanced tree or heap).
        • It contains the index keys and row locators.
          • Row locators point to a row in the underlying table. 
            • If the table is a heap, the row locator is called a row identifier(RID) 
              • A 8-byte pointer containing DB file ID and Page Id for the target row, and the target row ID on that page
              • To seek for a row, SQL server needs to traverse the index to the leaf level, and read the appropriate page from the heap , retrieving the row from the page with a RID lookup.
                • RID lookups are efficient when your query searches for a small amount of rows only. 
              • If a row moves in a heap, all non-clustered indexes need to be updated to reflect their new physical location. However, SQL server is optimized to leave a forwarding pointer if a row has to move to another page.
            • If the table is a balanced tree, then the row locator is the clustering key.
              • To seek for a row, SQL server has to traverse all levels on a non-clustered index and also all levels of a clustered index in a key lookup
              • If a row moves , non-clustered indexes don’t need to be updated to reflect their new physical location as the row-locator pints to a logical location.
      • If the clustering key is narrow, SQL server can accommodate more rows on a leaf-level page than when RID(8-bytes) is used as a row locator.
      • You can have up to 999 non-clustered indexes on a table.
      • Filtered non-clustered indexes
        • This span a subset of column values only. and apply only to a subset of table rows. 
        • Useful when some values in a column occur rarely, whereas others occur frequently. You could just create a filtered index over the rare values only
        • SQL server only updates this index for a change in rare values.
        • Can be used to maintain uniqueness of values for a row with values and lots of nulls.
        • Created by adding a where to a create index statement. 
      • Columnstore Indexes
        • a method of storing index data column by column.
        • You need to create a column store index on the table.
        • It is stored compressed. when a query references a single column that is part of a columnstore index, SQL server fetches only that column from disk, not the entire row like in row storage.
        • However, to return rows, they must be reconstructed
        • Useful for queries that aren’t very selective. especially in data warehouse queries.
        • Tables with columnstore indexes are read only because rows must be reconstructed. To update a table, you must drop the columnstore index.
        • Divided into segments which are stored as LOB
          • it is the unit of transfer from disk to memory
          • Segments consist of multiple pages
          • Has metadata that stores the min and max value for each column for that segment. Only segments requested by a query are loaded into memory.
    • Indexed views
      • An index view is a materialized view. 
      • The view must be created with schemabinding
      • You must use count_big() aggregate function
    • The clustering key’s uniqueness is maintained using a uniquifier(which is added to repeating values). this ensures lookups don’t return wrong values.
    • If you update the clustering key or change the table from a balanced tree to an heap, all non-clustered indexes must be rebuilt.
Using search arguments
  • This is a way of writing predicates to enable SQL server use indexes.
    • Indexes are used only when the arguments in the predicate are searchable. The column name is alone on one side of the expression and the calculated value appears on the other side.
    • A column that has an index on it appears in the phrase alone, not as a function’s parameter. 
  • Efficient queries use the WHERE clause to filter rows.
    • The WHERE clause benefits from indexes. 
    • Where doesn’t guarantee that indexes will be used.
    • The clause has to be supported by an appropriate index and it must be selective enough. 
      • If it’s not very selective, it is less expensive to do a table/clustered index scan than to do a non-clustered index seek and then RID or key lookups. 
  • Join can benefit from indexes as well.
  • For queries that aggregate data and use group by, you should support group by with a index. 
    • Aggregating data is done wither with a hash or a stream physical aggregate operator.
      • Stream is faster, however, it needs a sorted input.  
  • Index usage can be tracked using sys.dm_db_index_usage_stats.
  • If SQL server finds all data in nonclustered indexes, then the query is covered by the nonclustered indexes and the indexes are covering indexes.
    • covering more queries by adding more columns causes the key to be longer and the index less efficient. 
      • You can include a column in a nonclustered index on a leaf level only not as part of the key.
      • You do this by using the include clause of the create index statement.
Statistics
  • SQL server maintains statistics of the distribution of key values in system statistical pages.
    • SQL optimizer uses this pages for cardinality estimation.
  • Auto-created statistics
    • SQL server maintains statistics automatically by default.
    • Statistics are created for indexes, and for columns used as search arguments.
    • There are 3 database options that influence auto creation of statistics
      • Auto_create_statistics: When On, SQL server creates statistics automatically.
        • On by default
      • Auto_update_statistics: When on, SQL server auto update statistics when there are enough changes in the underlying tables and indexes. SQL server updates out of date statistics during optimization. On by default.
      • Auto-update_statistics_async: determines whether SQL server uses synchronous or asynchronous statistics update during query optimization. If statistics are update asynchronously, SQL server cannot use them for the optimization of the query that triggered the update and doesn’t wait for the statistics to be updated. 
    • Statistics are stored in a statistics binary LOB and is created on one or more column. It contains a histogram with the distribution of values in the first column. 
  • Manual maintenance of statistics is done using the 
    • Create statistics command
    • Drop statistics command
    • update statistics command
    • sys.sp_updatestats: to update stats for all tables in a DB
    • Useful after a bulk import to maintain statistics, or after an upgrade of the DBMS.
  • To see the histogram of a statistics, use the DBCC show_statistics command
    • e.g. dbcc show_statistics(N’@TableName’, N’@Index_Name’) with histogram

Coursera: Getting and cleaning data (My Course Notes)


Goal of this course is the first 3 of this sequence

Raw Data-> Processing script-> tidy data -> data analysis -> data communication
Raw and processed data(1)
  • Data are values of qualitative or quantitative variables, belonging to a set of items.
    • variables are a measurement or characteristic of an item
  • Raw data
    • The original source of data
    • Has not been subjected to processing or any other manipulation.
    • No software has been run on the data
    • the numbers on the data have not been manipulated
    • Data has not been removed from the data set
    • The data hasn’t been summarized in any way.
  • Processed/Tidy data
    • Data that is ready for analysis
    • processing can include merging, subsetting, transforming
    • All steps should be recorded.
    • Each variable must be in one column
    • each different observation of a variable should be in a different row
    • There should be one table for each kind of variable
    • If you have multiple tables, they should include a column in the table that allows them to be linked.
Components of Tidy data(1)
  • When you go from raw to tidy data, you should have
    • The raw data
    • A tidy data set
    • A code book describing each variable and its values in the tidy data set
    • An explicit and exact recipe used to go from raw data to tidy data
  • Code book
    • Includes information about the variables(including units) in the data set not contained in the tidy data
    • Includes information about the summary choices you made
    • Information about the experimental study design you used
    • Should have a section called "Study Design" that has a thorough description of how you collected the data
    • There should be a section called "Code book" that describes each variable and its units
  • The instruction list
    • A script in R or python
    • The input for the script is the raw data
    • the output for the script is the processed, tidy data
    • there are no parameters in the script.
    • It might be impossible to script every step. Instead provide instructions like this
      • Step 1= Take the raw file, run version @3.1.2 of @summarize software with parameters a=1, b=2, c=3
      • Step 2= run the software separately for each sample
      • Step 3= take column three of outfile.out for each sample and that is the corresponding row in the output data set.
Downloading Files
  • R can be used to download files
  • This allows the downloading process to be put in the R script.
    • Getting/Setting your working directory
      • The 2 main commands are getwd() and setwd()
        • They both work with absolute and relative paths
          • Absolute paths: "C:\\Users\\Andrew\\Downloads"
          • Relative Paths: ".\\data" or "..\\"
            • \\ is used in windows instead of / in other OSs
    • Checking for and creating directories
      • file.exists("@directoryName") checks if the directory exists
      • dir.create("@directoryName") creates a directory if it doesn’t exist.
  • To download from the internet
    • usedownload.file()
      • parameters are url, destfile, method.

                         e.g. download.file(fileurl, destfile="./@Directory/@File.@Format". [method = "@method"])

      • @method is curl for Https websites, not necessary
    • helps with reproducibility
    • Note the time the file was downloaded
      • use date()
Reading local Data Files
  • read.table() is the main function for reading data int R
  • Read the data into RAM
  • parameters are file, header, sep, row.names, nrows
    • quote: Used to tell R whether there are any quoted values. quote="" means no quotes
      • Sometime ‘ or " are placed in data values and cause issues when reading files, setting quote="" resolves this.
    • na.strings: sets the character that represents a missing value
    • nrows: how many rows to read from the files
    • skip: number of lines to skip before starting to read
  • Related functions are read.csv and read.csv1
    • read.csv sets sep=, and header = TRUE by default
Reading Fixed Width files
  • use read.fwf().
    • the widths option specifies an integer vector that is used to specify the width of each column.
Reading Excel files
  • requires the xlsx package
    • Install using install.packages("xlsx")
  • read.xlsx() 
    • parameters are files, sheetIndex, header
      • Sheetindex is used to state which Excel sheet the data is stored on.
    • Allows you read specific rows and columns
      • Use parameters colIndex and rowIndex
        • They accept vectors showing the columns and rows to read.
  • Write.xlsx() writes out an Excel file with similar arguments
  • XLConnect package has more options for writing and manipulating excel files.
    • XLConnect.pdf
Reading XML
  • XML consists of markup and content
    • markup are labels that give the text structure.
    • content are the actual text of the document.
  • Tags, elements and attributes
    • Tags correspond to labels. Applied to particular parts of the text so it’s structured.
      • start tags <section>
      • end tags </section>
      • empty tags <line-break /> 
    • Elements are specific examples of tags
      • <Greeting> Hello, world </Greeting>
    • Attributes are components of the label
      • <img src="jeff.jpg" alt="instructor"/>
        • src and alt are attributes
      • <step number = "3"> connect A to B </step>
  • Requires the xml package
    • xmlTreeParse(@FileURL, useInternal=TRUE/FALSE) to parse out the XML file.
      • Loads the document into memory for parsing
      • If you’re reading an html file, use htmlTreeparse() instead
      • useInternal = TRUE if you want to get all the nodes inside the document.
    • xmlRoot() is used to remove the XML version header of the XML document. output is a parsed xmlObject
    • xmlName() is used to get the tag of the document’s root.
    • use names(@RootTag) to get the tags that are just below @RootTag
  • You can directly access parts of an XML document
    • Accessed the same way a list is accessed
      • using [[@ElementNumber]]
      • To access subelements, 
        • [[@elementNumber]][[@subelementNumber]]
  • Can be done programatically
    • use xmlSApply(@ParsedXMLObject, @FunctionToApply)
      • @FunctionToApply can be xmlValue if you want to return all the contents of each children elements as a single string.
    • Xpath programming can be used to specify further what element to return.
      • XML.pdf
      • Use xpathSApply(@ParsedXMLObject, "@XpathNotationOfNodeToExtract", @FunctionToApply)
        • @FunctionToApply usually is xmlValue
        • @XpathNotationOfNodeToExtract
          • /node: Used to extract top level node
          • //@node: used to extract a node at any level
            • e.g. to extract the node named house use //house.
          • @nodePath[@attr-name]: @Node with an attribute name
          • @nodePath[@attr-name=’@Value’] @Node with tag @attr-name that has a value @Value.
            • @nodePath is the path to the node that contains the tag @attr_name. usually in the format //@OutputOfxmlName()/@ParentTag/@ParentTag
Reading JSON
  • JavaScript Object Notation(JSON) is a lightweight data storage format often used for APIs.
  • It’s similar to XML but has a different syntax/format.
  • requires the jsonlite package
  • use fromJSON(@FileToRead) to read a JSON file.
    • returns a data frame
  • To access a variable
    • @dataFrame$VariableName
  • To write data frames to JSON
    • use toJSON()
      • the pretty parameter indents the JSON document
Using data.table
  • inherits from data.frame.
    • All functions that accept data.frame accept data.table
  • Faster than data.frame at subsetting, grouping and updating
  • requires data.table package.
  • use tables() to see all the data tables in memory
  • data tables allow subsetting rows
    • @DataTable[@RowNum,]
      • To return just the @rowNum row 
      • The same as @DataTable[@RowNum]
    • @DataTable[@DataTable$@columnName=="@value",]
      • To return all rows where the column @columnName has a value @Value
  • To subset columns in Data tables,
    • you cannot subset data tables the same way you subset data frames
      • @DataTable[,c(@columnIndex1, @columnIndex2)] doesn’t work.
    • requires expressions 
      • They can be used to summarize datasets
        • @DataTable[,@OutputDataClass(@function1(@ColumnName),@function2(@columnName2))]
          • @OutputDataClass can be list, table, vector e.t.c
          • as
  • To add a new column to a Data table, use := operator

                    @DataTable[,@NewColumnName:=@ExistingColumnName @Operator @Value]

    • This allows you carry out multi step operations as expressions are involved.

                         @DataTable[, @NewColumnName:= { tmp<- (@ColumnName1+ @ColumnName2); @Function(tmp @Operator @value)}]

    • Allows you carry out plyr like operations

                    @DataTable[,@NewColumnName:=  @ColumnName1 > Value]

      • Creates a new column that has boolean values
      • also allows you use the by keyword

                                   @DataTable[,b:= mean(x+w), by=a]

        • Where a is a column with boolean values.
        • hence rows where a = TRUE will have the same value in b i.e. the calculation for the mean will only considers x and w values from those rows. also rows where a=FALSE will have the same values in b
  • Data tables must be copied explicitly using copy(). 
  • Special variables
    • .N
      • Means count the number of times something occurs
      • E.g.

                                   @dataTable[, .N, by=x]

        • this counts the number of times the distinct elements in the x column occurs.
  • Keys
    • Allows rapid subsetting and sorting.

                         setkey(@dataTable,x)

                         @dataTable[‘a’]

      • The above automatically subsets the datatable based on the key.
        • returns all rows where the value of x is a
  • Joins
    • Joins are based on keys
    • Data tables can be joined based on a similar element being set as the key.
  • For fast reading
    • done using fread()
Reading from MySQL

                         @ConnectionName<- dbConnect(MySQL(),user="@userName", host="@Hostname")

    • To connect specifically to a database on an instance

                         @DBName<-dbConnect(MySQL(), user="@UserName", db="@DBName", host="@HostName")

    • To run a command

                         dbGetQuery(@ConnectionName, "@DBQuery")

    • To disconnect from an instance

                         dbDisconnect(@ConnectionName)

    • use dbListTables() to show the tables in a Database

                         dblistTables(@DBName)

    • To see the column names of a table, use dbListFields()

                         dbListFields(@DBName, "@TableName")

    • To run a query on a table

                         dbGetQuery(@DBName, "@SQLQuery")

    • To read a whole table to a R object

                         @RObjectToStoreTable<- dbReadTable(@DBName, "@TableName")

      • To read a subset of the Table

                                   @RObjectToSendQuery<- dbSendQuery(@DBName, " select * from @TableName @whereConditions")

                                   @RObjectToContainResult<- fetch(@RObjectToSendQuery[,[n=@NoOfRowsToRead] );

        • To clear the result of a query on the instance’s memory.

                                             dbClearResult(@RObjectToSendQuery);

Reading data from HDF5
  • Hierarchical Data Format is used for storing large data sets.
    • Data is stored in groups containing zero or more data sets and their metadata.
      • Each group has a group header with a group name and a list of its attributes.
      • Has a group symbol table with a list of objects in the group.
    • Data sets are multidimensional arrays of data elements with metadata
      • They have a header with name, datatype, dataspace and storage layout.
      • Have a data array with the data.
  • To interface with HDF5, you need to install packages from bioconductor

               source("http://bioconductor.org/biocLite.R&quot;)

    • Install the bio base packages

                         biocLite("rhdf5")

  • Use h5createFile() to create a hdf5 file.

               @RObjectForH5=h5createFile("@Filename.h5")

    • To create group

                         @RObjectForH5 = h5createGroup("@Filename.h5", "@GroupName[/@SubGroupName]")

  • To list the contents of an hdf5 file use h5ls()

               h5ls("@FileName.h5")

  • to write to a group, use h5write()

               h5write(@Robject, "@FileName.h5", "@GroupName[/@SubGroupName]")

    • If the destination Group(@GroupName[/@SubGroupName]") doesn’t exist, then it is created.
    • You can write different data classes e.g. dataframes, arrays, matrices.
    • You can direct data to be written to specific places using the index option

                         h5write(@ObjectToWrite, "@FileName.h5", "@GroupName[/@SubGroupName], index=list(@FirstRowNo:@LastRowNo, @FirstColumnNo:@LastColumnNo)

  • To read data, use h5read()

               readA=h5read("@FileName.h5", "@GroupName[/@SubGroupName]")

    • You can direct data specific data to be read using the index option

                         h5read("@FileName.h5", "@GroupName[/@SubGroupName]", index=list(@FirstRowNo:@LastRowNo, @FirstColumnNo:@LastColumnNo) )

  • The HDF5 tutorial rhdf5.pdf
Reading data from the Web
Scope is data scraping, authentication and interfacing with APIs
  • Web scraping is programmatically extracting data from the HTML code of websites
    • Read the data into R using
      • readLines()

                                   @ConnectionName<- url("@WebAddress")

                                   @RObject<- readLines(@ConnectionName)
        • Close the connection after you’re done

                                             close(@ConnectionName)

      • htmlTreeParse()

                                   @HTMLobject <- htmlTreeParse(@url, useInternalNodes=T)

        • htmlTreeParse in in XML package
        • You can specify the nodes to read using xpathSApply()
      • GET() in the httr package

                                   library(httr)
                                   @RObjectToStoreTheHtml <- get(@URL)

                                   @RobjectToStoreTheContent <- (@RObjectToStoreTheHtml , as="text")
                                   @RobjectToStoreParsedHTML<- htmlParse(@RobjectToStoreTheContent ,asText=TRUE)

        • GET() is in httr package
  • Accessing websites with passwords
    • GET allows you authenticate yourself for websites.

                         @DestinationRObject<- GET("@URL",authenticate("@UserName","@Password"))

    • Handles allow you save authentication to a website.

                         @RobjectForURL=handle(@URL)

                         @DestinationRObject= GET(handle=@RobjectForURL , authenticate("@UserName","@Password")

      • Handle() is in the httr package
      • To access the website, simple refer to it using the handle and it’ll use the previous authentication via cookies.

                                   GET(handle=@RobjectForURL , path="/")

Reading data from APIs
  • Done using the httr package
  • You need to create an account with the API organization.
  • To access twitter form R

               myapp=oauth_app("Twitter", key= "@yourConsumerKeyHere", secret="@YourConsumerSecretHere")

               sig=sign_oauth1.0(myapp, token="@yourTokenHere", token_secret="@yourTokenSecretHere")
               homeTL=GET("https://api.twitter.com/1.1/statuses/home_timeline.json&quot;,sig)

Reading from other sources
  • Interacting with files
    • file(): to open a connection to a text file
    • url(): to open a connection to a url
      • To read a specific line from a url, use readLines()

                                   @connection<- url(@URL)

                                   @Robject<- readLines(@connection)
                                   @robject[@LineNumberToRead]

    • gzfile(): to open a connection to a .gz file
    • see ?connections for information on other connection mechanisms
    • always 
  • The foreign package allows you interact with other statistical programming languages
    • Loads data from minitab(read.mtp), SAS(read.xport), SPSS(read.spss), stata(read.dta)
    • library(foreign)
  • RODBC provides interface amultiple DBMSs
  • You can read images using jpeg(), readbitmap(), png() directly into R and manipulate them
  • Reading Geographic Information Systems(GIS) data using rdgal(), rgeos(), raster()
  • You can read MP3 files using tuneR packages or seewave packages
Subsetting and sorting
  • Subsetting can be done based on rows and columns
    • Based on Columns

                         @Data[, @ColumnOrVectorOfColumns]

      • @columnOrVectorOfColumns specifies the columns to select from the dataset
    • Based on rows

                         @data[@rowOrVectorOfRows,]

      • @rowOrVectorOfRows specifies the rows to select from the dataset
      • You can also subset on rows based on conditions

                                   @data[@Data$Column @ConditionalOperator @VectorOfValues,]

    • Based on rows and columns

                         @data[(@Column1 @Operator @value @CombinationOperator @column2 @operator @value),@Column1]

      • @CombinationOperator can be & for and or | for or.
    • Dealing with NA’s
      • Using which is the most efficient way of subsetting without including NAs

                                   Data[which(@Column1 @Operator @Value),]

  • Sorting 
    • To sort a column, use sort()

                         sort(@column)

      • To sort in decreasing order, use the decreasing=TRUE option
      • To sort with NA’s at the end, use na.last=TRUE
    • To sort a data frame use sort()
      • This sorts the dataframe based on the values in the column. 

                                   @Data[order(@Column1[,@column2),]

        • You can specify a tie breaker
    • You can also sort based on arrange() in the plyr package
  • Adding columns to a dataframe

                    @Data$@NewColumnName<- @VectorOfValueToInsertIntoColumn

    • you can use cbind

                         cbind(@Data, @VectorOfValuesToInsert)

  • Adding rows to a dataframe
    • Use rbind()
Summarizing Data
  • summary() can be used to summarize dataFrames.
  • str() can also be used for summarization. It shows the structure of the data.
  • quantile() can be used to summarize a column
    • To view the default of 5 quantiles(0,25, 50,75 & 100)

                         quantile(@Data$@Column, na.rm=true)

    • To view specific quantiles

                         quantile(@Data$@Column, probs=@vectorOfQuantiles

      • sample of @vectorOfQualities is c(0.5, 0.75, 0.9)
        • 0.5 is median or 50% percentile
  • Table can be used to make tables to summarize data

                         table(@Data$@Column) 

      • This returns a count of the distinct elements in @column.
      • Use the useNA="ifany" option if you want to see a count of NA values.
    • To view the relationship between the columns of @Data, input 2 values into table

                         table(@Data$@Column1, @Data$Column2)

      • Column1 becomes the row and @Column2 becomes the column. 
    • To view a count of all elements of a column that meet a particular condition

                         table(@Data$@Column @ConditionalOperator @VectorOfValues)

      • @conditionalOperator can be %in% to check if @vectorOfValues exists in @data
  • To count the number of NAs in a column

               sum(is.na(@Data$@Column))

  • To check if missing values exist in a column.

               any(is.na(@Data$@Column))

    • It returns true is an NA exists
  • To check if every value in a column satisfy a condition, use all()

               all(@Data$@Column @Condition)

  • Row and column sums
    • You can calculate the sum of all the columns in a dataset using colSums()

                         colSums(@Data)

      • To check if any of the columns in a dataset contains an NA value,
        •  apply colSums to the is.na of the dataset

                                             colSums(is.na(@Data))

                          rowSums(@data)

  • Crosstabs
    • used to identify where the relationships in a dataset exists.

                         xtabs(@VariableToDisplayInResult ~ @Variable1ToGroupBy [+ @Variable2ToGroupBy], data=@Data)

      • To group by all other variables, set @Variable1ToGroupBy to .
  • FlatTables
    • Often used to summarize data output of xtabs() into a compact form.

                         ftable(@Data)

  • To see the size of a dataset

               object.size(@Data)

    • to specify the units of the output, use the unit=@Units command

                         object.size(@Data, unit="Mb")

Creating New variables
  • Sometimes variables are missing in a dataset. 
    • You can transform the dataset to get the values you want
    • Common missing variables are
      • Missingness indicators
      • cutting up quantitative variables
      • applying transforms
  • Creating sequences
    • You use seq()

                         seq(@FirstValue, @LastValue)

      • use the by = @value option to indicate the increment value to use in creating the sequence
      • use the length =@value option to indicate how many values should be created for the sequence.
    • To create an index of the values in a column/vector

                         seq(along = @vector)

      • The output is a sequence from 1 to n(@vector)
  • Subsetting variables

               @Data$@NewColumn= @Data$@Column %in% @vectorOfValues

               table(@Data$@NewColumn)

    • This returns  a count of true and false observations of the conditions compared to create @NewColumn
  • Creating binary variables
    • use ifelse() to apply a value when a condition is true, and another when it is false

                         @Data$@NewColumn= ifelse(@Data$@Column @ComparisonOperator @Value, True, False)

      • To view a count of how many observations do not meet the condition,  use table() to compare @NewColumn to @Data$@Column @ComparisonOperator @Value

                                   table(@Data$NewColumn, @Data$@Column @ComparisonOperator @Value)

  • Creating categorical variables
    • Done using cut()
    • To split a variable based on it’s value, e.g. into quantiles 

                         @Data$NewColumn = cut(@Data$@Column, breaks=@MeasureToSplitData)

                          table(@Data$NewColumn) 

      • e.g. @MeasureToSplitData can be quantile(@Data$Column) to split the data into 5 quantiles.
      •  table(@Data$NewColumn) is Used to view the count of the groups @Column is split into.
      • You can compare @NewColumn to @Column using table() to view the relationship between them.
    • To split a variable into group based on quantiles
      • use cut2() in the Hmisc package

                                   @Data$@NewColumn = cut2(@Data$@Column, g=@NoOfQuantilesToSplitInto)

    • The result of a cut/cut2 is a factor variable.
  • Creating Factor variables
    • You can convert a variable into a factor

                         @Data$@ColumnToRepresentFactors<- factor(@Data$@Column)

    • To convert factor variable to numeric, use as.numeric(@Data$Column)
    • Alternatively, use transform to convert the variable type to factor

                         @Data<- transform(@data, @VariableToConvertToFactor= factor(@VariableToConvertToFactor))

      • You can use labels to change the value of the observations to a character that represents the factor

                                   @Data$Variable<- factor(@Data$variable, label=c("LabelForValue1", "LabelForValue2"…))

        • Sometimes, the number of variables are continuous(e.g decimal numbers with a variable of 0.1 and a range of 10.). You can split this into a limited number of factors using cut(). This involves creating tertiles.
          • First split the data into quantiles

                                                       @variableToStoreCutpoints<- quantiles(@Data$variable, seq(0,1, length=@NumberOfFactors), na.rm=TRUE)

          • Cut the data and define a new variable using the quantiles

                                                       data$newVariable <- cut(@Data$variable, @variableToStoreCutpoints)


Reshaping Data
  • melt is used on a data frame to select which columns are id and which are measure.

               melt(@Data, id=@VectorOfColumnsThatAreID, measure.vars= @VectorOfColumnsThatAreMeasures)

    • The output is an object with the @VectorOfColumnsThatAreMeasures where 2 new columns are made named variable and value. For each of the rows, the value of the column is put as the variable and it’s value set to the value of that column. 
  • A melted dataframe can be reshaped to a dataframe using dcast() or an array using acast()

               dcast(@Data, @Column ~variable[,@AggregationFunction])

    • This compares distinct(@column) to all the columns from the melt that are used as measure.vars and a summarized comparison showing just the count for each variable.
    • If @AggregationFunction is set to mean, it returns the mean of the variable for each distinct(@Column).
  • Averaging values
    • You can use tapply()

                         tapply(@Data$@ColumnToApplyFunctionTo, @Data$@ColumnToGroupBy, @functionToApply)

      • The result contains the sum of all ColumnToApplyFunctionTo grouped by ColumnToGroupBy if functionToApply is sum
    • You can use split()

                         split(@Data$@ColumnWithDataToMeasure, @Data$ColumnToGroupBy )

                         %>% lapply(sum)
                         %>% unlist

      • This returns a list with @ColumnWithDataToMeasure  grouped by @ColumnToGroupBy 
      • you can use lapply(sum) to get the sum of the groups
      • You can use unlist to convert the list to a vector.
    • You can use ddply
      • Part of the plyr package

                                   ddply(@Data, .(@columnToGroupBy), summarize, sum= sum(ColumnWithDataToMeasure ))

reshapingdata20110406-110506090255-phpapp02.pdf

Managing data frames in R using dplyr
  • dplyr is a package developed to make it easy to work with data frames
  • dplyr expects
    • There is one observation per row
    • Each column represents a variable or measure or characteristic
    • An optimized version of plyr
  • The dplyr functions have a similar format
    • The first argument is a data frame
    • the subsequent arguments describe what to do with the first argument. 
    • You can refer to columns in the data frame directly without using $ operator
    • the result is a new data frame
    • Da
  • The dplyr verbs
    • select()
      • returns a subset of the columns of a data frame
    • filter
      • extracts a subset of rows from a data frame based on logical conditions
    • summarize
      • generates summary statistics of different variables in the data frame.
    • mutate
      • add new variables/columns or transform existing columns
    • arrange 
      • reorders rows of a data frame
      • preserves the order of the columns
    • rename
      • rename variables in a data frame
Merging data in R
  • Similar to linking as done in RDBMSs
  • merge() can be used to merge 2 dataframes x and y.
    • merge() has the following parameters which are used to specify which variable to use for the merge
      • x/y/by/by.x/by.y/all
    • By default it merges by the columns with a common name

                         merge(@DataFrameX, @DataFrameY, [by.x="@ColumnInXThatMatchesY", by.y="@ColumnInYThatMatchesX"]) 

      • if by.x and by.y aren’t specified, it’ll merge all columns that have the same name. If it’s possible for the value’s in both datasets to be different for the same named variable, by.x and by.y should be used to specify the column to use for joining.
      • all=TRUE means if a value appears in one of the datasets but not the other, it should include the value with NA for the variables in the other dataset
  • join()
    • in the plyr package
    • default is a left join.

                         join(@DataFrame1, @DataFrame2)

      • It joins columns with the same name.
    • join_all allows you join multiple dataframes.

                         list(@DataFrame1, @DataFrame2, @DataFrame3) %>% join_all

      • First combine the dataframes into a list
      • Then perform a join_all on the dataframe
      • They all need a column with the same name
Editing text variables
  • Manipulating text programmatically 
  • To see the column names of a dataframe,

               names(@Dataframe)

    • use tolower() to convert it to small characters.
  • To split a string when a character is encountered, 

               strsplit(@Data, "@CharacterToSplitOn")

    • @CharacterToSplitOn is a special character, precede it with \\ e.g. "\\.".
    • strsplit() returns a list
    • You can use sapply to get the first elements from the list
      • Create a function that gets the first character

                                   firstElement<- function(x){x[1])

      • Use sapply to get the first elements from the list by applying a function to every element of the list.

                                   sapply(@list, firstElement)

  • To substitute out characters in a vector for others
    • To replace the first appearance of the character
      • use sub()

                                   sub("@CharacterToRemove", "@CharacterToPut", @Data)

    • To replace multiple appearances of the character
      • use gsub()

                                   gsub("@CharacterToRemove", "@CharacterToPut", @Data)

  • To search for specific values in variables
    • use grep()

                         grep("@StringToSearchFor","@Data")

      • It returns the row numbers of the observations with the string
      • to return the actual observation, use the option value=TRUE

                                   grep("@StringToSearchFor","@Data",value=TRUE) 

      • to see the number of elements that satisfy the function, use length(@grepCommand)
    • use grepl()

                         grepl("@StringToSearchFor","@Data") 

      • This returns a logical vector with true for the observations with the string and false for observations without the string
      • You can use this to subset only cols where a string appears.

                                   @Data[grepl("@StringToSearchFor","@Data") ,]

  • The stringr package contains a lot of string manipulation functions
    • nchar(@Data)
      • To get the number of characters in a variable
    • substr(@Data, @firstPosition, @LastPostition)
      • This returns a string of the characters from the first to the last position
    • paste(@firstString, @lastString)
      • This joins 2 strings together and puts a space in between
    • paste0(@firstString, @lastString)
      • Joins the 2 strings together without a separating space
    • str_trim(@Data)
      • Removes the space characters at the end of a string
  • Characteristics of text in data sets
    • Names of variables should be
      • All lower case when possible
      • Descriptive
      • Not duplicated
      • Not have underscores, dots or white spaces
    • variables with character values
      • should be made into factor variables(depends on application)
      • should be descriptive(use TRUE/FALSE instead of 0/1)
Regular Expressions
  • They are often used with grep, grepl, sub,gsub and other string manipulation functions
  • Regular expressions can be viewed as a combination of literals and metacharacters
    • Metacharacters provide a way to represent
      • whitespace word boundaries
      • sets of literals
      • the beginning and end of a line
        • ^ represents the start of a line. 

                                             ^@String checks for observations with @String at the beginning of the line

        • $ represents the end of a line

                                             @String$ checks for observations with @String at the end of the line

      • alternatives("war" or "peace")
  • Character classes with []
    • To match a sequence of characters regardless of their cases, enclose the upper and lower cases in [] for each letter
      • [Bb][Uu][Ss][Hh] matches bush regardless of which of it’s alphabets being capital or small.
    • You can specify a range of characters i.e numbers or letters
      • to specify numbers 0 to 9, use [0-9]
      • to specify small case a to z, use [a-z] 
      • to specify capital case a to z, use [A-Z]
        • to specify either small or caps [a-zA-Z]
    • To match characters that do not satisfy a condition, use [^@C]
      • e.g [^?.]$ matches characters that do not end with ? or .
    • You can combine e.g. 
      • ^[Ii] matches only observations that the first character is I regardless of it’s case.
      • ^[0-9][a-zA-Z] matches only observation that the first character is a number and te second is an alphabet regardless of it’s case.
  • MetaCharacters
    • "."
      • The . means any character/ wild character
        • e.g. 9.11 matches 9A11, 9-11, 9.11, 9911.
    • "|"
      • | means or. It can be used to combine expressions that serve as alternatives to each other
        • e.g. food|water matches either food or water.
        • flood|earthquake|hurricane|wildfire matches either of any of the elements.
      • ^[Gg]ood|[Bbad] matches either Good, good at the beginning of the observation or Bad and bad somewhere in the observation
      • ^([Gg]ood|[Bbad]) matches either Good, good or Bad and bad at the beginning of the observation.
    • "?"
      • This indicates that the expression is optional
        • e.g. [Gg]eorge( [Ww]\.)? [Bb]ush matches "george bush", "george W. Bush"
    • "*" and "+"
      • They’re used to indicate repetition
      • "*"
        • This means the character can repeat any number of times including not appearing at all
          • e.g. (.*) matches both elements that don’t have any characters between () as well as those that have characters there
            • it matches () or (a) or (asdfas)
        • It matches the longest possible string that satisfies the regular expression.
          • e.g. ^s(.*)s returns a string containing the characters between the first and last s in an observation. 
          • To prevent greediness, use *? instead.
      • "+"
        • This means the character can repeat any number of times but must appear once.
          • e.g. [0-9]+(.*)[0-9]+ matches at lease one number,followed by any number of characters followed by at least one number.
    • "{@minimumNumberOfRepetition[, @maximumNumberOfRepetition]}"
      • They are interval qualifiers and are used to specify the minimum and maximum number of matches of an expression.
      • specifying {@minimumNumberOfRepetition} means exactly @minimumNumberOfRepitition time
      • specifying {@minimumNumberOfRepetition,} means at least @minimumNumberOfRepitition times
      • e.g. ( +[^ ]+ +){1,5} means space, not space, space between 1 to five times. 
        • e.g. " Apple Camera food " 
      • e.g. [Bb]( +[^ ]+ +){1,5} debate matches bush with up to 5 words till debate
    • "()"
      • Used to remember text matched by the subexpression enclosed
        • we refer to the matched text as \1,\2
        • E.g. +([a-zA-Z]+) +\1 + to refer to a repeat of the same word matched with the expression.
  • To search for a meta character, escape it with "\"
    • e.g. to search for ".", use "\."
Working with dates
  • date() is used to return the date and time. It’s of class character
  • Sys.Date() returns the date. It’s of date class
  • formatting dates
    • use format(@dateObject , "@FormatString")
      • The Format string values are 
        • %d is day as number(0-31)
        • %a is abbreviated weekday(mon – sun)
        • %A is unabbreviated weekday(Monday to Sunday)
        • %m is month(00-12)
        • %b is abbreviated month(jan-dec)
        • %B is unabbreviated month(January to December)
        • %y is 2 digit year
        • %Y is four digit year
    • e.g. format(@DateObject, "%d %m %y")
  • as.Date()
    • used to convert a character string to a date.

                         as.Date(@dateObject, "@DateFormatOfCharString")

      • @DateFormatOfCharString tells as.Date what format to expect. e.g. for "1Feb1999", @DateFormatOfCharString is "%d%B%Y"
  • You can find details about a day
    • weekdays(@dateObject) shows you what day of the week it is (Monday – Sunday)
    • months(@dateObject) shows you what month it is (January – December)
  • you can find difference between days if they’re of date class

               @DateObject1 – @DateObject2 

  • Lubridate
    • A package for handling date objects
    • You can use it to convert a number to a date. 
      • it reads the number in the format of the function
      • use the format of the number variable as a function.
        • ymd(@Number)
        • mdy(@Number)
        • dmy(@Number)
        • ymd_hms(@Number)
      • You can specify timezone using the tz option

                                   ymd_hms(@Number, tz="@timeZone")

        • @timeZones can be checked using ?Sys.timezone
    • Use wday(@Number) to get weekday
      • This returns the number of the day in the week.
      • use option label = TRUE to return the abbreviated weekday.
      • Date and time are preferred in the Date or POSIXct or POSIXly classes.
  • Data resources
    • Open Government sites

                          

                         

    • Some APIs with R interfaces

                         

Swirl
  • Manipulating Data with Dplyr
    • You need to install and load the dplyr package
    • In Dplyr, when selecting columns, you don’t need the Data$@Column format to select @column. Just use @column
    • Dplyr allows you work with tabular data from a variety of sources
    • To work with data in dplyr, load the data into a tbl_df(data frame tbl)

                         tbl_df(@Data)

      • Data of class tbl_df prints better than data of other classes.
    • dplyr supplies 5 verbs that cover most data manipulation tasks
      • select()
        • Select keeps only the columns you specify
        • To select Columns @Column1, @Column2 from tbl_df @data

                                             select(@Data, @Column1, @Column2…)

          • To select a sequence of columns, use the : operator to separate the first and last columns

                                                       select(@Data, @FirstColumn, @DataColumn)

          • To remove specific columns from @data, use -@ColumnToRemove

                                                       select(@Data, -@ColumnToRemove)

          • To remove a sequence of columns 

                                                       select(@Data, -(@FirstColumn:@LastColumn)

      • filter()
        • filter keeps only rows you specify
        • To filter columns where @column = @value

                                             filter(@Data, @column1 == @value[,@column2 ==@value…])

          • To filter on multiple conditions, separate the columns with ,
          • To filter with an OR condition, separate the columns with |.

                                                       filter(@Data,@column1==@value| @column2==@value)

        • You can filter on the value of a column being NA using the is.na().
          • To filter only where the column is not NA, use !is.na(@column).
      • arrange()
        • This allows you sort a tbl_df based on the values in a column or multiple columns in ascending order

                                             arrange(@Data, @Column1ToSortBy[,@Column2ToSortBy])

          • To sort in descending order, use desc(@ColumnToSortBy)

                                                       arrange(@Data, desc(@Column1ToSortBy))

      • mutate()
        • Allows you create a computed column based on the value of an existing column in a tbl_df and inputs it as the last column.

                                             mutate(@Data, @NewColumn1 = @ExistingColumn @Operator @value)

        • You can compute multiple columns in a single statements and reference a column computed in the same statement

                                             mutate(@Data, @NewColumn1=@ExistingColumn1+1, @NewColumn2 = @NewColumn1+8)

      • summarize()
        • Collapses the dataset into a single row.
        • Can be used to give you a value based on groups in a dataset.
  • Grouping and chaining data with dplyr
    • Grouping allows the breaking up of datasets based on the values of one or more variables.
      • Done using group_by()
        • Used to group a tbl_df by one or more variables
                                   group_by(@Data, @GroupingColumn)
          • In the result set, all operations are applied on a per group basis.
            • Performing summarize() with mean() gives the mean per group                                                       
        • n() is used to count the number of observations in a group
        • n_distinct() is used to count the number of unique observations in a group
                                   summarize(@GroupedTbl, @count=n(), @Unique=n_distinct(@column))
      • To find the value that can be used to split data into a particular percentage, use quantile().
        • The value is called @percentage quantile.
          • For the 99% quantile
                                             quantile(@Data$@Column, probs= 0.99)
      • To show all the rows in a tbl_df, use View(@Tbl_df)
    • Chaining allows you string multiple function calls in a way that is compact and readable.
      • %>% is the operator used to chain functions.
        • The value before the operator is used as the input for the function on the right. 
      • This allows us work without saving the results of all the operations for each step.
      • E.g. To select columns from a tbl_df and print,

                                   @tbl_df %>% select(@Column1, @Column2, @Column3)%>%print

  • Tidying Data with tidyr
    • Tidy data is formatted in a standard way that facilitates exploration and analysis.
    • Tidy data must satisfy 3 conditions
      • Each variable forms a column
      • Each observation forms a row
      • Each type of observational unit forms a table
    • Messy data has one or more of the following characteristics
      • data-wrangling-cheatsheet.pdfColumn headers are values, not variable names
        • This problem is solved using gather()
          • Gather() takes multiple columns and collapses them into key-value pairs, duplicating all other columns as needed
          • Used when you notice you have columns that are not variables.

                                                       gather(@Data, @Key, @Value, -@ColumnToExclude)

            • @key is the name of the output column that will store a variable that can have multiple values. e.g. sex variable with values Male and Female
            • @value is the Value that stores the measure of the key in the observation.
            • @column to exclude is the column in @Data that is already stored properly
            • If their are NA values, use option na.rm=true to remove them while doing gather()
            • to gather a sequence of columns, use : between the first and last columns
      • multiple variables are stored in one column
        • This problem is solved using separate()
          • This uses a regex or a vector of character positions to turn a single char column into multiple columns

                                                       separate(@Data, col=@ColumnToSplit, into=@vectorofColsToSplitInto, sep=@valueToUseToSplit)

            • @vectorofColsToSplitInto is a vector that shows the names of variable to split @ColumnToSplit into.
            • @valueToUseToSplit contains the value in @ColumnToSplit to use to split it into the destination number of variables. By default, it splits by any character that is not alphanumeric.
      • Variables are stored in both rows and columns
        • This requires use of gather() and separate()
        • Also involves the use of spread()
          • Use to spread a key-value pair across multiple columns.

                                                       spread(@Data,@ColumnWithVariable,@ColumnWithValu)

            • distinct(@ColumnWithVariable ) is used to create new columns with each one becoming a column. It is the key in this relationship.
            • @columnWithValue is used to assign a value to each of the column created. It is the value in this relationship.
      • a single observational unit is stored in multiple tables
        • To solve this problem, you need to merge both datasets.
          • First add an additional column to the datasets to signify their difference.

                                                       @Dataset1<- mutate(@Dataset1, @ColumnToDistinguish = @value)

                                                       @Dataset2<- mutate(@Dataset2, @ColumnToDistinguish = @value)

          • Use bind_rows() to join the 2 dataset

                                                       bind_rows(Dataset1, Dataset2)

      • multiple types of observational units are stored in the same table
        • We can select out values that occur multiple times like a relational table and lookup the the other values using a unique identifier.
  • To extract a number from a character string, use extract_numeric()

               extract_numeric(Value5) returns 5

  • To aggregate a data frame

               ag <- aggregate(. ~@groupingColumn, @Data, function(x) c(mean=mean(x),sd=sd(x))

    • this calculate the mean and sd for all variables in a data frame

Coursera: An Introduction to R (My Course Notes)

Setting up your Working Directory(WD)
  • To get your WD, use getwd()
  • To list the contents of your WD, use ls().
  • To change WD, look around your GUI.
  • R scripts can be created in any editor.
    • To import to R
      • Copy the text and paste in the Console
      • Create the RScript and import it to the session
        • Save the script as a .r file in the WD.
        • Use source(@.rFile) to import the file to the session.
    • To see the contents of a session, use ls()
  • To see the contents of the WD, use
R Input and Evaluation
  • Entering Input
    • We type expressions at the R prompt.
    • <- is the assignment operator
    • # indicates a comment. Anything after # is ignored
  • Evaluation
    • This occurs when you press enter after an expression and causes the result of the evaluation to be displayed.
    • To print the contents of a vector
      • Auto-printing: Enter the name of the vector and it’s contents are shown on the screen
      • use print(@vector)
    • When a vector is printed, the value in [@numberOfChar] at the beginning of the line shows the ordinal position of the first value on that line in the vector.
R Data types and operations
  • R has 5 basic classes of objects
    • Character
    • numeric(real numbers)
    • integers
    • complex: 1+4i
    • logical: TRUE/FALSE
  • The most basic object is a vector
    • It contains objects of the same class
    • Created by vector() if it should be empty, else the assignment operator (<- ) can be used.
  • A list is a vector that can have objects of multiple classes.
  • Numbers
    • Numbers are numeric objects(double precision real numbers).
    • If you want to input an integer, you must specify by putting L after the value
      • x <- 1L
    • The number Inf represents infinity. it can be used in calculation
    • The value NaN(undefined value/ not a number).
      • Also used to represent a missing value
  • Attributes
    • Every object has attributes
      • names/ dimnames
      • Dimensions(e.g. matrices, arrays)
      • class
      • length
      • other user-defined attributes
    • The attributes of an object can be accessed using attributes()
R data types( vectors and lists)
  • Creating Vectors
    • c() can be used.
      • c() concatenates
        • e.g. c(@value1, @value2)
      • A shortcut for inputting a series of numbers is @firstnumber:@lastnumber
        • e.g. @vector <- 9:19
    • vector()
      • Used to create an empty vector
      • Has the format vector(@class, length= @LengthOfVector)
      • All the values of such a vector are 0.
  • Creating a vector with mixed objects
    • R coerces so every element of the vector is of the same class.
    • The order is logical > number > character.
      • TRUE is 1 and false is 0.
    • Explicit coercion
      • as.@destinationClass() is used to convert an object to a specific class.
      • Nonsensical coercion results in NAs
        • e.g. coercing a vector of characters to numeric.
        • This results in a warning
  • Elements of a vector have [@Element] format
  • Lists
    • A vector that can contain elements of different classes
      • G <- list(1, "a", TRUE, 1+4i)
    • Elements of a list have [[@Element]] format
    • Created using list()
    • The elements of a list can be named when they’re populated
      • list(@ElementName1 = @Element1, @ElementName2 = @Element2, @ElementName3 = @Element3)
Matrices
  • They’re vectors with a dimension attribute.
    • The dimension attribute is a vector of length 2 (nrow, ncol)
    • They’re created using
      • matrix()
        • matrix(nrow = @NumberOfRows, ncol = @NumberOfColumns): This creates an empty matrix
        • matrix(@listOfValues, nrow = @NumberOfRows, ncol = @NumberOfColumns)
        • e.g. matrix(1:6, nrow = 2, ncol = 3)
      • Transforming a vector using dim()
        • dim(@vectorToBeConverted) <- c(@NumberOfRows, @NumberOfColumns)
      • Column binding and row binding
        • done using cbind() or rbind()
          • cbind(@VectorToBeCol1, @VectorToBeCol2 ):
            • The vectors are converted to columns
          • rbind(@VectorToBeRow1, @VectorToBeRow2):
            • The vectors are converted to rows
    • dim(@matrix) is used to show the dimensions of a matrix
  • Every Attribute must be of the same class
  • Matrix are constructed column wise
    • Entries start in the upper-left and go down the column. When the column is filled, they start on the next column
  • Matrices can be named using dimnames()
    • dimnames(@matrix) <- list(c(@row1Name, @row2Name), c(@element1Name , @element2Name))
Factors
  • They’re vectors used to represent categorical data.
    • Ordered factors: Factors where their values have orders: e.g. small, medium, large
    • Unordered factors: Factors where their values do not have orders e.g. Male and Female
  • They’re treated specially by modeling functions like lm() and glm()
  • Factors are best represented with labels instead of values
    • E.g. use Male and Female instead of 1 and 2
  • Created using
    • factor(c(@value1, @value2, @value3))
  • Functions
    • Levels(@Factor) is used to show the distinct values of the factor
    • table(@Factor) is used to show the count of the distinct values of the factor
    • unclass(@Factor) is used to show the numeric value of the values of the factor. The numeric value is the same for each level.
  • A factor is basically a numeric vector over the levels attributes.
  • The order of the levels can be set using the levels argument to factor()
    • This is important because in linear modeling, the first level is the baseline level. Otherwise, the baseline level is the first level if the levels are ordered.
    • factor(c(@value1, @value2, @value3), levels = c(@value1, @value2, @value3))
Missing values
  • Denoted by either NA or NaN( For undefined mathematical operation)
  • is.na() is used to test objects to see if they’re NA or NaN
  • is.nan() is used to test for NaN
  • Na values have a class also
    • They can be integer NA, character NA etc
  • A NaN value is also NA but NA is not always true
Data Frames
  • Used to store tabular data
  • Represented as a special type of list where every element has the same length
  • Each element of the list can be thought of as the column, and the length of an element shows the number of rows
  • Unlike a matrix, data frames can store a different class of objects in each column
  • They have a special attribute row.names
    • It is used to annotate the data.
  • Created using
    • read.table()
    • read.csv()
    • data.frame(@FirstElement = @Values, @SecondElement = @Values)
  • A data frame can be converted to a matrix using data.matrix()
    • Implicit conversion is applied to convert the values to the same class
Names
  • Objects in R can have names
    • Useful for writing self-describing objects and readable code
  • created by populating names(@object)
    • names(@object) <- c(@element1Name, @element2Name)
  • names(@object) prints the names of the elements of the object
Reading and writing Data in R
For reading Data into R
  • read.table: Read text files that contains data in rows and columns and return a data frame
    • For small and moderate data sizes, Usually called with only the file parameter. some of the other parameters assume their default values.
    • For large data sets,
      • The size of the data frame musn’t exceed the size of your system’s memory
      • Some parameters must be specified as they help read operations perform faster and more efficiently.
        • nrows
          • Enable R allocate memory better
        • colClasses
          • This prevents R from scanning every column to determine their data type
          • An easy method to get this is to read in the first 100 rows, let R calculate the classes for those and save it to a vector, and then use the vector as colClasses
                                                       Data<- read.table("@File", nrows =100)
                                                       classes <- sapply(Data, classes)
                                                       DataFrame <- read.table("@File", colClasses = classes)
  • You can calculate the memory requirements for a data frame
    • using the formula nrows *( (@ncolsOfColClassA * @SizeOfColClassA) +  (@ncolsOfColClassB * @SizeOfColClassB))
      • This gives the size in bytes
        • To convert to MB, divide by 220
          • To convert MB to GB, divide by 210
      • You need 2 times the amount of memory as the size of the data frame to import a data set
    • Has the following params
      • file: name of a file or a text-mode connection
      • header: logical param indicating if the file has a header line or not. if true, the first line is used to set variable Names
      • sep: string param indicating how the columns are separated
      • colClasses: character vector, length is the number of elements in the data set. indicates the class of each column in the dataset. default is for R to calculate.
      • nrows: the number of rows in the data set. default is for R to calculate it
      • comment.char: string param indicating the comment character. default is #
      • skip: the number of lines to skip from the beginning
      • stringsAsFactors: logical param stating whether variables should be coded as factors. The default is TRUE.
  • read.csv
    • same notes as read.csv, except the sep value is ","
  • scan: For reading matrices
  • readLines:
    • Reads a file and returns text
    • Can also be used to read a webpage
  • source: For reading R code e.g. functions
  • dget: For reading R codes for R objects that have been deparsed to text files
  • load: For reading in saved workspaces
  • unserialize: For reading single R objects in binary form.
For Writing Data from R
  • write.table
  • writeLines: used to write to a file
    • Each element is written to the file one line at a time
  • Dump and Dput
    • dput
      • see notes for dump.
      • Used to deparse a single R object.
        • The output of this is the code required to reconstruct the R object
          • dput(@vector, file = @destinationFile)
        • dget is used to read the data back.
          • dget(@OutputFileOfDput)
    • Dump
      • Used to deparse multiple R objects
        • dump(c("@vector1","@vector2"), file = @destinationFile)
        • source is used to read the data back
          • source(@OutputFileOfDump)
    • They’re text formats so it’s editable
    • They preserve the metadata of the data set so it doesn’t have to be specified again
    • They’re not very space efficient
  • save
  • serialize
Connection
  • Used to interface external objects.
  • Data is read in using connection interfaces.
    • file: standard uncompressed file
      • accepts the following parameters
        • description: The name of the file
        • open: a code indicating what operations are allowed on the file
          • r : read only
          • w: writing
          • a: appending
          • rb, wb, ab: reading, writing or appending in binary mode
            • This work only in windows
    • gzfile: file of gzip compression
    • bzfile: file of bzip2 compression
    • url: webpage
  • A connection is usually inferred most times e.g. when a read.csv is performed
  • Connections have to be closed after they’ve been used
               con<- file(@FIle, @allowedOperations)
               close(con)
  • Connections can be used to read part of a file
                    con <- gzfile(@File)
                    x<- readLines(con, 10)
Subsetting
Subsetting are operations to extract subsets of R objects
  • There are 3 operators for performing subsetting
    • []
      • This returns an object of the same class as the original
      • can be used to select more than one element except when referencing a matrix where it returns a vector
    • [[]]
      • is used to extract elements of a list or a data frame
      • Can only extract a single element
      • the class of the returned object will not necessarily be a list or a data frame
    • $
      • used to extract elements of a list or a data frame by name
      • similar to [[]]
  • There are several ways to perform subsetting
    • Use a numeric index
      • Enter the ordinal position of the element in the object to extract that element
        • @vector[3]
      • can be used to extract a sequence of objects
        • @vector[1:5]
    • Use a logical index
      • Enter a logical proposition to select elements in the object that satisfy the proposition
        • @vector[@vector > "a"]
        • @newLogicalVector <- @vector > "a"
          • The output of this is a logical vector that shows a logical element for all the elements of the object based on the proposition.
          • @vector[@newLogicalVector] returns an object of all the elements that satisfy the proposition
  • Subsetting lists
    • Reference list is x<- list(food=1:4, bar = 0.6, baz = c(3.14, 2.81))
    • [] returns a list
      • x[1] returns list
        • $food 1,2,3,4
      • x["elementName"] returns a list
      • Allows you return more than one element of the list
        • x[c(1,3)] returns food and baz
    • [[]] returns a sequence without the name
      • x[[1]] returns 1,2,3,4
      • x[["bar"]] returns an object of class numeric
      • Allows you work with computed index(where the name of the element in the object is the result of a computation)
        • e.g. if name <- "food"
          • x[[name]] returns the food object.
      • Can be used to subset nested elements of a list
        • x[[c(Element,subelement)]s
          • E.g. x[[c(1,3)]] or x[[1]][[3]] returns 3( the third element of the first element)
    • $@elementName: returns a sequence without the name
      • x$food returns a numeric vector object
  • Subsetting Matrices
    • Reference matrix is x<- matrix(1:6,2,3) a 2 row by 3 column matrix
                         1     3     5
                         2     4     6
    • Format is @matrix[@row, @column]
    • To select the whole row use @matrix[@row,]
      • e.g. x[1,] returns 1,3,5
      • To select the whole column use @matrix [,@column]
        • e.g. x[,1] returns 1,2
    • The Drop parameter
      • The drop parameter drops the matrix orientation when a subset operation is carried out on a matrix.
        • Default is drop = TRUE
        • When a single element of a matrix is retrieved, it returns a vector of length 1 instead of a 1X1 matrix
          • x[1,2 ,drop = FALSE]
        • When a single row or column is retrieved, it returns a vector not a matrix
          • x[1, , drop = FALSE]
  • Subsetting data Frames
    • To subset only observations(rows) 
      • use subset(@vector. @column @operator @value)
        • E.g. subset(iris, Species == ‘virginica’)#This returns all rows where species is virginica
      • use the [] operator
        • @vector[@column @operator @value,]
        • E.g. iris[iris$species ==’viriginica’]
    • To subset only variables(columns)
      • @vector[, @vectorOfColumnsNumbersToReturn]
        • E.g. iris[, 1:3]
        • iris[, c(1,2)]
  • Subsetting with partial matching of names
    • Partial matching of name is allowed with [[]] and $
    • It allows you return objects with a name that start with the characters you supply and saves you typing in the command line
    • Reference object is x<- list(aardvark = 1:5)
    • $ looks for a name in the list that matches the element
      • x$a returns aardvark
    • [[]]
      • using the exact = FALSE parameter on [[]] allows it to carry out partial matching
      • x[["a", exact = FALSE]] returns aardvark
        • Exact = TRUE by defaul t
  • Removing NA values
    • The most common way is to create a logical vectors showing what elements are NA
      • use is.na() on the object
      • return the non NA objects
        • e.g.
                                                  x<-c(1,2,NA,4,NA,5)
                                                  bad <- is.na(x) #bad is a logical vector F,F,T,F,
                                                  x[!bad] #returns 1,2,4,5
                                   good <- complete.cases(@DataFrame)
                                   @dataFrame[good,][1:6,] # this does it for only rows
Vectorized operations
  • Operations are vectorized to make coding more efficient and easier to read
    • Carrying out operation on vectors of the same length
      • Elements at the same position in both vectors interact based on the operation applied to the vector
        • E.g.
                                             x<- 1:4; y<-6:9
                                             x+y # returns 7     8       9      10
                                             x*y # returns 6     14     24     36
  • Carrying out comparison operations
    • Compares all the vectors returning a logical vector
      • E.g.
                                             x<- 1:4; y<-6:9
                                             x>2 # returns a logical vector F     F     T
  • Carrying out operations on vectors of different lengths
  • Vectorized matrix operations
    • Reference is x<- matrix(1:4,2,2); y<-matrix(rep(10,4),2,2)
    • x*y or x.y carries out element-wise multiplication
                         x*y #multiplies all elements in the same position together and returns a matrix
  • To carry out matrix multiplication
    • x %*% y
seq_along(): takes a vector as input and creates an integer sequence equal to the length of that vector
seq_len(): takes an integer and creates a sequence from 1 to the value of the integer
Control Structures in R
  • Used to control the flow of execution of a program, depending on runtime conditions
  • The types of control structures are
    • If…else: Allows you to test for a condition
      • If(@Condition) { Do something} else if(@Condition) {Do something} else{ Do something}
        • E.G.
                                   y <- if(x>3){10} else {0}
  • for: execute a loop a fixed number of times
    • Takes an iterator variable and assigns it successive values from a sequence or vector.
      • For loops can be nested.
    • while: Execute a loop while a condition is true
      • Begin by testing a condition. If it is true, then execute the loop body. Once the loop body is executed, test the condition again.
      • Can execute infinitely if the condition never gets met.
      • conditions are evaluated from left to right
    • repeat: execute an infinite loop
      • Initiates an infinite loop. Can only be exited with break
    • break: break the execution of a loop
    • next: skip an iteration of a loop
    • return: exit a function
      • Signals that a function should exit and return a given value.
  • The above are primarily for writing programs, for interactive work, use the *apply functions
Writing functions
  • Functions are usually prefaced with the keyword function
               function(@Argument1, @Argument2)
               { @Argument1 @Operator @Argument2 }
  • Functions are R objects
    • Functions can be passed as arguments to other functions
    • Functions can be nested.
  • Functions have named arguments which potentially have default values.
    • Formal arguments are the arguments included in the function definition
      • Formals() returns a list of all the formal arguments of a function
    • Function arguments can be missing or might have default values
  • Function arguments can be matched positionally or by name
    • You can mix positional matching with matching by name.
      • When an argument is matched by name, it is taken out of the argument list and the remaining unnamed arguments are matched in the order they are listed in the function definition
  • Function arguments can be partially matched( only first few characters are used to match the arguments)
    • The order of operations when given an argument is
      • Check for exact match for a named argument
      • Check for a partial match
      • Check for a positional match
  • Arguments value can be set to NULL
  • Arguments in R are evaluated lazily
    • They are evaluated only as needed. If the argument is never used, it’s presence is never tested.
    • No error is thrown until the compiler notices a problem with an argument.
  • The … Argument
    • Used to indicate a variable number of arguments that are usually passed on to other arguments
    • used when extending another function so you don’t copy the entire function list of the original function
                         myplot<- function(x,y, type = "l",…){
                              plot(x, y, type=type,…)} ##This copies the remaining argument list of plot into myplot
  • Generic functions use … so that extra arguments can be passed to methods.
    • Necessary when the number of arguments passed to the function cannot be known in advance
      • E.g see args(paste), args(cat)
    • Any arguments that appear after … on the argument list must be named explicitly and cannot be positionally or partially matched.
Scoping rules
  • How R binds values to symbols
  • R searches through a series of environments(list of objects/symbols and values) to find the appropriate value.
    • In command line, when attempting to retrieve the value of an R object, the order is
      • Search the global environment for a symbol name matching the one requested
      • Search the namespaces of each of the packages on the search list(see search())
    • The global environment or the user’s workspace is always the first element of the search list and the base package is always the last( with regards to namespaces, see search()).
      • The order of packages on the search list matters
      • User’s can configure which packages get loaded on startup so you cannot assume a set of packages are available
      • When a user loads a package with library, the package is moved to position 2 in search() and everything gets shifted down.
      • R has separate namespaces for functions and non-functions.
        • you can have an object and function named the same.
  • Scoping Rules
    • They’re the main difference between S and R
    • They determine how a value is associated with a free variable in a function
    • R uses lexical/static scoping
      • The scoping rules of a language determine how values are assigned to free variables.
        • Free variables are not formal arguments and are not local variables( assigned inside the function body).
      • E.g.
                                   f<-function(x,y)
                                   {
                                        g <-9 ##g is a local variable
                                        g*(x^2) + y/z ##x and y are formal arguments, z is a free variable. The scoping rules state how we assign a value to z.
                                   }
    • Lexical scoping states that the values of free variables are searched for in the environment in which the function was defined.
      • An environment is a collection of symbol,value pairs
      • An environment has a parent environment which can have multiple children. The only exception is an empty environment
      • A function + an environment = a closure or function closure.
      • To search for the value of a free variable
        • If the value of a symbol is not found in the environment in which a function was defined, then the search is continued in the parent environment.
        • The search continues down the sequence of parent environments until we hit the top-level environment; this is usually the global environment (workspace) or the namespace of a package.
        • After the top-level environment, the search continues down the search list until we hit the empty environment is reached and an error is thrown
      • Typically a function is defined in the global environment, so that the values of free variables are just found in the user’s workspace
        • In R you can have functions inside other functions 
          • In this case, the environment in which a function is defined is the body of another function.
            • E.G.

                                                                 

                                                                   n is a free variable with respect to pow. hence it’s value is found in the parent environment.

          • To get what’s in a function’s environment( formal variables). 
            • run ls(environment(function))
              • e.g. ls(environment(cube))
          • To get the closure environment,
            • run get("@variable", environment(function))
    • Difference between Lexical and dynamic scoping
      • In lexical scoping, the value of a variable is looked up in the environment in which the function was defined
      • In dynamic scoping, the value of a variable is looked up in the environment from which the function was called( calling environment). 
      • E.G.

                                   y<- 10

                                   f<- function(x)
                                   {
                                        y<-2
                                        y^2 + g(x)
                                   }
                                   g<- function(x)
                                   {
                                        x*y
                                   }
                                   
                                   #Using lexical scoping, the value of y in f() is 10. y is looked up in the global environment
                                   # Using dynamic scoping, the value of y in f() is 2. 

    • Consequence of Lexical scoping
      • All object must be stored in memory
      • All functions must carry a pointer to their respective defining environments, which could be anywhere
      • Makes it easy for objective functions to be built
        • Functions that carry around all their data and you’re sure the values used are always the same
Dates and times in R

  • Dates are represented by the date class
    • Dates are stored internally as the number of days since 1970-01-01
    • Format is YY-MM-DD
    • Dates can be coerced from a character string using as.Date()
  • Times are represented by the POSIXct and POSIXlt class
    • Times are stored internally as the number of seconds since 1970-01-01
    • They keep track of leap year, seconds, daylight savings and time zones
    • POSIXct is a very large integer. useful to store times in a data frame
    • POSIXlt is a list underneath and stores a other useful information e.g. day of week , day of year, month, day of month.
      • A component can be extracted using $.
        • @POSIXltVariable$sec extracts the value of seconds in the variable.
    • Times can be coerced form a character string using as.POSIXlt() or as.POSIXct()
    • Sys.time() is used to print the system time.
  • There are generic functions that work on dates and times
    • weekdays(): gives the day of the week
    • months(): give the month name
    • quarters(): gives the quarter number("Q1", "Q2","Q3","Q4")
  • unclass() can be used to show the content of date and time classed variables in R
    • Date class
    • POSIXct
      • the output is an integer showing the number of seconds since 1970-01-01
    • POSIXlt
      • The output is a list showing name() for the contents of the variable.
  • strptime() is used to correct the format of a date/time variable if it’s written in a different format from YY-MM-DD
    • A format string is used to indicate the format the date is stored in.
    • To get the format string, get the format from ?strptime.
  • Operations on dates and times
    • You can perform mathematical operations + and –
    • You can perform comparison operations 
    • For time objects, They both have to be of the same class
Loop Functions
  • They are usually in the *apply format
    • lapply(): loop over a list and evaluate a function on each element of the list.
      • Always returns a list
      • implemented using
        • lapply(@list, @function, …)
          • if the first argument is not a list, the the function applies as.list, else an error occurs.
        • To pass an argument to @function, specify them in the …
          • lapply(x, runif, min=0, max=10) ## this passes arguments min and max to runif.
    • sapply(): same as lapply except it simplifies the result
      • Simplifies the result of lapply if possible.
        • If the result is a list where every element is length 1, then a vector is returned
        • If the result is a list where every element is a vector of the same length(>1), a matrix is returned
        • if it can’t figure things out, a list is returned.
        • implemented using
          • lapply(@list, @function, …)
    • apply(): apply a function over the margins of an array
      • Used to apply a function to the rows or columns of a matrix
      • Can be used with general arrays e.g. taking the average of an array of matrices
      • Implemented using
        •  apply(@array, @marginToBeRetained, @function, …)
          • @marginToBeRetained is a vector that is gotten from the dimensions of the array i.e. @array(@Dimension1, @Dimension2)
            • When @marginToBeRetained = 1, It applies the function over the second dimension of the array and returns a list with length(@Dimension1) number of items
            • When @marginToBeRetained = 2, It applies the function over the first dimension of the array and returns a list with length(@Dimension2) number of items
            • If the array has more than 2 dimensions, @marginToBeRetained can return an array of the dimensions specified. 
              • e..g. apply(x, c(1,2),mean) returns a 2 dimensional array(matrix).
      • Shortcuts already exists for calculating the sum and mean of arrays, and they operate much faster than apply()
        • rowSums, rowMeans, colSums, colMeans.
    • tapply(): apply a function over subsets of a vector
      • implemented using 
        • tapply(@Vector, @listOfFactors, @function, …,[SIMPLIFY= TRUE])
        • len(@Vector) must be == len(@listOfFactors)
    • mapply(): multivariate version of lapply()
      • It applies a function in parallel over a set of arguments.
      • Can be used to apply a function to multiple lists.
      • implemented using 
        • mapply(@function, …, @MoreArgsToFunction, [SIMPLIFY = TRUE])
          • … contains the arguments to apply over
          • the number of arguments that @function takes must be equal to the number of lists passed to mapply()
      • Used to vectorize a function 
        • Allows a function that doesn’t usually take vectors to do so.
  • split(): used to split objects in groups determined by a factor or a list of factors
    • used in conjunction with lapply() and sapply()
    • implemented using
      • split(@vector, @ListOfFactors,[drop = FALSE], …)
      • the split vector splits @vector into len(@ListOfFactors) groups
  • loop functions use anonymous functions
    • Anonymous functions are functions without names
      • implemented in format
        • @LoopFunction(@variable, @FunctionDefinition)
        • e.g. 
          • lapply(x, function(elt) elt[,1]) #This creates a function that accepts the argument elt and returns the first column of the argument
Debugging
  • For figuring out what’s wrong.
  • Indications that something is not right
    • message: A generic notification/diagnostic message produced by the message function; execution of the function continues
    • warning: An indication that something is wrong but not necessarily fatal; execution of the function continues; generated by the warning function
    • error: An indication that a fatal problem has occurred; execution stops; produced by the stop function
    • condition: Programmer created.
  • Debugging tools
    • traceback(): Prints out the function call stack after an error occurs; does nothing if there’s no error. 
    • debug
      • Flags a function for debug mode which allows you to step through execution of a function one line at a time
    • browser
      • suspends the execution of function wherever it is called and puts the function in debug mode
    • trace: allows you to insert debugging code into a function at specific places
    • recover: allows you to modify the error behavior so that you can browse the function call stack.
  • str()
    • A diagnostic function and an alternative to summary().
    • Answers the question what’s in this object? in a compact format.
    • str means structure.

Simulation

  • Generating random numbers
    • Functions for probability distributions in R
      • rnorm(): generate random normal variates with a given mean and standard variation
      • dnorm(): evaluate the normal probability density( with a given mean/SD) at a point(or vector of points)
      • pnorm(): evaluate the cumulative distribution function for a normal distribution
      • rpois(): generates random poisson variates with a given rate
    • probability distribution functions usually have four functions associated with them. Their prefix shows their use
      • d for density
      • r for random number generation
      • p for cumulative distribution
      • q for quantile function
    • set.seed(@integer) is used to ensure reproducibility
      • It is best to set this so the randomization is reproducible.
  • Generating random numbers from a linear model.
    • to generate binary number, use rbinom()
  • Generating random numbers from a generalized linear model
  • Random sampling
    • sample() allows you draw randomly from arbitrary distributions
      • implemented using sample(@vector, @NumberOfSamples, [replace = FALSE])
        • the replace argument allows you to simulate simulation with replacement of a factor. 
R profiler
  • Useful for understanding why an operation is taking time to complete
  • A way to examine how much time is spent in different parts of a program
  • Useful when trying to optimize code.
  • system.time()
    • Takes a R expression as input and returns the amount of time taken to evaluate the expression
      • returns an object of class proc_time that includes 
        • user: the time charged to the CPU(s) for this expression
        • elapsed: duration the function ran for.
    • Usually user time and elapsed time are close for straight computing tasks
      • elapsed time may be greater than user time if the CPU spends a lot of time waiting around
      • elapsed time may be smaller than user time if your machine has multiple cores and is capable of using them
        • multi-threaded libraries
        • parallel processing via the parallel processing
    • implemented using system.time(@Expression)
  • Rprof()
    • Start the profiler in R
    • summaryRprof() summarizes the output from Rprof()
      • tabulates the output and calculates how much time is spent in which function
      • There are two methods for normalizing the data
        • by.total: divides the time spent in each function by the total run time
        • by.self: subtracts time spent in functions below in the call stack then does by.total after
          • shows the actual time spent per function
    • Do not use system.time() and Rprof() together 
    • Keeps track of the function call stack at regularly sampled intervals and tabulates how much time is spent in each function.
      • default sampling interval is 0.02 seconds

Psychology of Human Misjudgement

Being aware of psychological ploys is not a perfect defense

  1. Reward and punishment superresponse tendency:
    • If you would persuade, appeal to interest and not to reason.
    • People’s actions will line up with what they’re rewarded for and away from what they’re punished for according to the rule of nature “repeat behavior that works”.It is extremely stupid to reward behavior that one doesn’t want.
    • Prompt rewards work better than delayed rewards in changing and maintaining the behavior.
    • The behavior is best reinforced with a random distribution of withdrawal of rewards( rewards must not be consistent).
    • Bad behavior is extremely habit-forming when it is rewarded
    • Granny’s rule( Do DAILY the unpleasant and necessary tasks first before rewarding yourself with the pleasant tasks) is an extremely example of reward superpower.
    • Consequences
      • Incentive Caused bias: Man is driven both consciously and subconsciously by incentives and will carry out whatever behavior is necessary to get that incentive. He’ll also rationalize his bad behavior towards getting that incentive.
    • Antidotes: For Consultants
      • Fear professional advice when it is especially good for the adviser
        • Learn and use the basic elements of your advisor’s trade as you deal with him
        • Double check, disbelieve, or replace much of what you’re told, to the degree that seems appropriate after objective thought
    • Antidotes: For employers
      • The use of sound accounting theory and practice. Accounting systems should accurately cater for operational reality. If you create an accounting system that encourages frauds, your employees will commit fraud because of their incentive caused bias.
      • The incentive structure shouldn’t intensely trigger incentive caused bias.
      • Prevent bad behavior from starting as it often becomes the case of “whose bread i eat, his song i sing”.
      • Tough internal audit systems
      • Severe public punishment for identified miscreants to ensure they do not occur anymore.
      • Misbehavior preventing routines and systems involving anti-gaming features are a necessity in system design “avoid rewarding people for what can be easily faked”.
  2. Liking/Loving Tendency:
    • Human beings are genetically programmed to like/love being liked and loved.
    • The phenomenon of liking and loving causing admiration also works in reverse. Admiration also causes or intensifies liking or love, often creating a vicious cycle.
    • A man constructed to love admirable persons and ideas with a special intensity has a huge advantage in life.
    • Consequences
      • It acts as a conditioning device that makes the liker or lover tend to
        • ignore faults of the object of his/her affection
        • favor people, products and actions associated with the object of his affection
        • distort other facts to facilitate love
  3. Disliking/Hating Tendency
    • The Opposite of the above.
    • Consequences:
      • It acts as a conditioning device that makes the disliker/hater tend to
        • Ignore virtues in the object of dislike
        • Dislike people, products and actions merely associated with the object of his dislike
        • Distort other facts to facilitate hatred.
      • Distortion of facts often make mediation between opponents locked in hatred impossible
  4. Doubt-avoidance Tendency:
    • Man is programmed with a tendency to quickly remove doubt by reaching some decision.
    • It is usually triggered by some combination of puzzlement and stress.
  5. Inconsistency-Avoidance Tendency
    • It is important not to put one’s brain in chains before one has come near his full potentiality as a rational person.
      • Man’s brain conserves space by being reluctant to change decisions we’ve earlier made.
      • This is evident in all habits(good and bad).
      • Works alongside the doubt-avoidance tendency to ensure hastily made conclusions and decisions are stuck too even when dis confirming evidence comes along.
      • This tendency also maintains previously made conclusions, loyalties, reputations, commitments etc
      • New ideas are often not accepted because they were inconsistent with old ideas in place(ideas often brain-block their holders from  accepting newer/better ideas).
      • Pretending to have an identity, habit or conclusion will cause it to prevail.
      • Consequences
        • A person making big sacrifices will intensify his devotion to the identity as it would be inconsistent to make big sacrifices for something that was not good.
        • Initiation ceremonies are often public in nature to intensify the commitment made. They’re often also tough to increase their costs to the individual.
      • Antidote:
        • “A drop of prevention is worth a river of cure”.
        • Prevent first conclusion bias by listening to arguments from both sides and consider all counterarguments.
        • Prevent confirmation bias(. Charles Darwin trained himself to) intensively consider any evidence tending to dis-confirm his hypothesis especially the good ones.
        • Be prepared to destroy your own ideas.
  6. Curiosity Tendency
    • Man is programmed with a curiosity tendency.
    • Culture greatly increases the effectiveness of curiosity in advancing knowledge.
    • Curiosity providesfun and wisdom after formal education has ended.
    • Consequences
      • Curiosity enhanced by the best of modern education helps man prevent or reduce bad consequences arising from other psychological tendencies.
  7. Kantian Fairness Tendency
    • Kant’s categorical imperative requires humans follow behavior patterns that if followed by all others would make the surrounding human system work best for everybody.
    • Consequences
      • Fair sharing is often expected and when it doesn’t occur, hostility often results. e.g. At a bus-stop, people often form a queue(fair sharing) and are hostile to people who don’t get on the queue(hostility)
  8. Envy/Jealousy Tendency
    • Occurs when something we desire is seen with another person
    • The world is driven by envy not greed.
    • Kantian fairness tendency often contributes to it.
  9. Reciprocation tendency
    • The natural automatic tendency of man to reciprocate both favors and disfavors is extreme and often works subconsciously.
    • It facilitates group co-operation for the benefit of members and commercial trade.
    • Reciprocation tendency causes many extreme and dangerous consequences all the time.
    • Guilt is the conflict triggered in one direction by reciprocate-favor tendency and in the opposite direction by reward super-response tendency pushing one to enjoy one hundred percent of some good thing
    • Consequences
      • Hatred to a level causing brutal conduct as seen in wars.
      • Joins forces with inconsistency avoidance tendency in causing 1. the fulfillment of promises made as part of a bargain 2. correct behavior expected from people in a role e.g. priests, doctors, engineers e.t.c.
      • Gift giving by salesmen to purchasing agents of companies to make them carry out “reciprocatory” purchases.
      • Concessions on the part of requester causing the requestee to make a “reciprocatory” concession. This is also used when requesting something outlandish where a request for something more outlandish is initially made and the requester concedes to the initial outlandish request he earlier intended.
      • Antidotes
        • Defer reaction
        • Do not or let employees accept any favors or gifts from vendors.
  10. Influence from mere association tendency
    • A conditioned reflex where mere association triggers a response.
    • Consequences
      • People associate a high priced item with its having higher quality. Companies will also raise the price of items to ensure this association takes place
      • Adverts target things the have extreme peculiar effects on the purchasers of products. E.G. Power equipment being advertised with beautiful women.
      • Some of the most important miscalculations come from what is accidentally associated with one’s past success, or one’s liking and loving, or one’s disliking and hating.(e.g. Because you wore a cap today, the ladies are feeling you and you associate their reactions to your cap and you wear a cap every time.)
      • Hating and disliking something because of a precious negative association
      • Liking and loving because of a previous positive association.
      • Persian messenger syndrome: Killing the bearer of bad news. This often causes people to hide negative outcomes or let them fester instead of bringing them into the limelight for fast resolution.
      • It sometimes works against reciprocate favor tendency. A man in misfortune who gets help from someone fortunate may experience an envy driven dislike for the fortunate person. He then associates the favor giver with his misfortune and dislikes him
      • The use of stereotypes.
    • Antidotes
      • Carefully examine each past success looking for accidental, non-causative factors associated with such success that might mislead an appraisal of a new venture.
      • Look for dangerous aspects of a new venture that weren’t there when the past success occurred.
      • Objectively examine before making decisions.
      • Develop through exercise of will the habit of welcoming bad news.” Always tell us the bad news promptly, it is only the good news that can wait”. It also helps to be so wise and informed that people fear not telling you bad news because you are so unlikely to get it elsewhere.
      • To prevent stereotypes, One must learn that trend doesn’t always predict tendency and the average dimension in some group will not reliably guide him to the dimension of some specific item.
  11. Simple, pain-avoiding psychological denial
    • The reality is too painful to bear, so one distorts the facts until they become bearable. It usually has the worst outcomes when mixed with love, death and chemical dependency.
    • Consequences
      • Addicted people tend to believe they’re still in respectable condition with respectable prospects even as they deteriorate further.
    • Antidotes
      • Flee from all activities that can cause addiction
  12. Excessive Self-Regard Tendency
    • Man mostly mis-appraises himself on the high side.
    • Consequences
      • People appraise things they own higher than things they don’t own.
      • Man’s tendency for over-appraise-your-own-possessions phenomenon ” A man’s decisions are suddenly regarded by him as better than would have been the case just before he made them”. Even when it is obvious to him that it was a bad decision.
      • Man’s preference for people like himself. This explains cliquishness. People in an organization often select people like themselves, hence fraudulent people selecting fraudulent people making a problem resistant to change and worse.
      • A higher proclivity to gamble when a person can pick his number or can make complicated conclusions(e.g. football betting).
      • An over-appraisal of one’s talent/skill causing one to bet against higher skilled/talented players.
      • Bad hiring decisions as employers often over-appraise their own conclusions that rely on impressions in face-to-face contact. This often favors candidates who are marvelous presenters.
      • Man always makes excuses for his fixable poor performance instead of providing the fix. The worst criminals don’t appraise themselves as all that bad. They either believe
        • They didn’t commit their crimes
        • that considering the pressures and disadvantages of their lives, it is understandable and forgivable that they behaved as they did and became what they became.
      • Success from overconfidence that happens to cause success “never underestimate the man that overestimates himself”.
    • Antidotes
      • A willingness to “clean house” when one is in control of a group of cliquish problematic people.
      • Employers should under-weigh face to face contacts and over-weigh the applicant’s past record.
      • Be more objective when you are thinking about yourself, your family and friends, your property and the value of your past and future activity.
      • On a personal level, a  man should face this simple facts
        • fixable but unfixed bad performance is bad character and tends to create more of itself causing more damage to the excuse giver with each tolerated instance
        • In demanding places, you will surely be discarded if you keep giving excuses instead of behaving as you should.
      • On an institutional level
        • A fair meritocratic, demanding culture plus personnel handling methods that build up morale
        • Severance of the worst offenders and when that is impossible, fix them as best as you can.
  13. Overoptimism tendency
    • “What a man wishes, that also will he believe”.
      • Consequences
        • Man is usually excessively optimistic.
      • Antidotes
        • The trained habitual use of the simple probability math of Fermat and Pascal.
  14. Deprival-Super-reaction Tendency
    • Loss seems to hurt more than gain seems to help. If a man almost gets something he greatly wants and it’s taken from him at the last moment, he reacts as if he he had long owned the thing and it was taken from him.
    • This includes the reactions to both kinds of loss experience
      • The loss of the possessed reward
      • The loss of the almost possessed reward.
      • Consequences
        • Man often incurs disadvantage by mis-framing his problems often comparing what is near instead of what really matters.
        • Man reacts with irrational intensity to even a small loss, or threatened loss of property, love, friendship, or any other valued thing. As seen in bureaucratic infighting which is usually to the detriment of an organization as a whole.
        • This protects ideological or religious views by triggering dislike and hatred towards vocal nonbelievers as the ideas of nonbelievers if allowed to spread will diminish the influence of the believers view.
        • It works with inconsistency-Avoidance tendency to ensure that extremes of ideology are maintained with great intensity and with great antipathy to non-believers, causing extremes of cognitive dysfunction(Also called group-think).
        • Labors resistance to reduced wages even when it’s clearly in their interest to make a different decision.
        • Ruin from compulsion to gamble.
          • It causes the gambler to have a passion to get even once he has suffered loss, and the passion grows with the loss
          • The most addictive forms of gambling provide a lot of near misses with each triggering Deprival-Super-reaction tendency
        • Overpriced purchases at open outcry auctions as man is convinced that the last price from another bidder was reasonable and this tendency prompts him to top the last bid.
        • Combines with inconsistency-Avoidance tendency to cause a man to use all his good assets to attempt to rescue a big venture going bad.
        • Takeaways are difficult to get, causing  less rationality
    • Antidotes
      • To prevent group-think,
        • Keep in place an extreme culture of courtesy in spite of ideological differences.
        • Bring in able and articulate disbelievers of incumbent groupthink.
      • Never go to open outcry auctions
      • Prevent its combination with inconsistency avoidance tendency by learning poker.
  15. Social-Proof Tendency
    • The automatic tendency to think and act as he sees others around him thinking and acting.
    • It is genetic for young people to emulate their peers and not their parents. Hence it is more effective to manipulate the quality of their peers than on manipulating their children.
    • It is mostly triggered in the presence of either puzzlement or stress or both of them.
      • Consequences
        • Companies doing similar things as their peers
        • As it is triggered by a combination of stress and isolation, disreputable sales men use it. This is also used by cults.
        • Advertisers willing to pay for their products in movies( this combines mere-association tendency(brad Pitt!!!!!) with social proof
        • Combines with envy/jealousy and deprival-superreaction tendency in scenarios where several people are fighting for one instance of something when their are several instances of the same available.
    • Antidotes
      • Stop any bad behavior before it spreads
      • Foster and display all good behavior
      • learn how to ignore the examples from others when they are wrong.
  16. Contrast-Misreaction Tendency
    • Man’s senses do not measure in scientific units. They mostly register changes by noting contrast with their last recollected state( a $1000 addition to a $50000 car is barely noticeable). Cognition tends to follow this perception.
    • A little leak will often sink a great ship
    • Consequences
      • Small steps towards disaster will often cause a person to go too far towards disaster to avoid it.
      • Cognition, misled by tiny changes involving low contrast, will often miss a trend that is destiny.
      • An embezzler never embezzles a large amount his first time around. He often starts small and goes a little larger than he and his peers.
    • Examples
      • A salesman showing a prospect some extremely ridiculously priced items then showing him a moderately ridiculously priced item.
      • Advertisers putting a false before sales price to cause the buyer to make a contrast (ASOS!!!!).
      • Force your mind to notice the small leaks.
  17. Stress-Influence Tendency
    • Sudden stress causes a rush of adrenaline prompting faster and more extreme reactions. It also makes social-proof tendency more powerful.
    • Consequences
      • Light stress can slightly improve performance whereas heavy stress causes dysfunction (because it causes an extreme of pessimism, extended in length and accompanied by activity stopping fatigue causing stress depression).
      • Pavlov’s last work(Adequate stress can  change an individual’s behavior by causing stress induced nervous breakdowns)
        • Individuals can be classified to predict how easily they’ll breakdown.
        • The individuals hardest to break down were the hardest to return to their pre-breakdown state
        • Any individual could be broken down
        • A breakdown couldn’t be reversed without reimposing stress
  18. Availability-Misweighing Tendency
    • The brain over weighs what is easily available( Only the factors i (can see/know) are the ones that matter).
    • Man’s brains easily drifts into working with what’s easily available to it.
    • The brain cannot use what it can’t remember or what it is blocked from recognizing because it is heavily influenced by one or more psychological tendencies bearing on it
    • An idea or a fact is not worth more merely because it is easily available to you
    • Consequences
      • Extra-vivid evidence, being so memorable and consequently more available in cognition are usually over weighed. (use this to persuade someone to reach a correct conclusion and as a device for improving one’s memory by attaching vivid images one after the other to items one doesn’t want to forget)
    • Antidotes
      • The use of procedures e.g. Checklists to ensure other factors are taken into account
      • Emphasize dis-confirming evidence: emphasize factors that don’t produce lot’s of metrics.
      • Include skeptical, articulate people with far-reaching minds to advocate ideas that are the opposite of the status quo.
      • Over-weigh less vivid evidence and under-weigh less vivid evidence.
  19. Use-It-or-Lose-It Tendency
    • All skills attenuate with disuse
    • Consequences
      • The fewer the possessed skills practiced, the fewer the number of skills retained. This is the cause of man with  a hammer tendency.
      • Learning capacity shrinks as gaps are created in the latticework of theory needed for understanding new experiences.
      • If an individual is fluent in a skill, the skill will be lost more slowly and will come back faster when refreshed with new learning.
    • Antidotes
      • The use of simulators to practice rarely used skills
      • The dutiful practice of all useful, rarely used skills especially those outside his discipline. Some skills require daily practice
      • Assemble skills into a checklist that is routinely used. This is the only way to prevent missing important stuff
      • Never stop learning an important skill until you’re fluent at it.
  20. Drug-Misinfluence Tendency
    • See Simple, pain-avoiding psychological denial
  21. Senescence-Misinfluence Tendency
    • There is a natural cognitive decay that comes with old age.
    • Consequences
      • Complex new skills are very difficult to learn at an old age
    • Antidotes
      • Only intensely practiced skills are maintained in old age(refer to Use-It-or-Lose-it Tendency)
      • Continuous thinking and learning done with joy can delay its effect
  22. Authority-Misinfluence Tendency
    • Follow-the-leader tendency is a consequence of the dominance hierarchy we’re born into
    • Consequences
      • Man suffers greatly when his leader is wrong or when his ideas are misunderstood
    • Antidotes
      • Be careful who you appoint to power because a dominant authority figure will often be hard to remove, aided as he will be by authority-mis-influence tendency.
  23. Twaddle-Tendency
    • Man has a tendency to prattle and pour out twaddle that does much damage when serious work is being attempted.
    • The work of an administrator is to keep the people who don’t matter from interfering with the work of the people that do.
    • Antidotes
      • Keep prattling people, pouring out twaddle far away from serious work
  24. Reason-Respecting Tendency
    • Man has a natural love of accurate cognition and a joy in its exercise.
    • Few practices are wiser that not only thinking through reasons before giving orders but also communicating these reasons to the recipient of the order.
    • Consequences
      • Man learns well when a teacher gives correct reasons for what is taught instead of teaching without reasons given.
      • Ideas get through best when reasons for the ideas were meticulously laid out.
      • Learning is most easily assimilated and used when, life long, people consistently hang their experience, actual and vicarious, on a lattice work of theory answering the question: Why?
      • If a person gives incorrect or meaningless reasons, it still increases compliance with his orders and requests.
    • Antidotes
      • You had to tell who was to do what, where, when and why
  25. Lollapalooza Tendency
    • The tendency to get extreme consequences from confluences of psychological tendencies acting in favor of a particular outcome.

Prescription for guaranteed misery in life

  • Ingesting chemicals
    • Addiction can happen to anyone through a subtle process where the bonds of degradation are too light to be felt until they’re too strong to be broken.
  • Envy
  • Practice intense ideology
    • The antidote to this is an iron prescription. It state that I am not entitled to have an opinion unless I can state the arguments against my position better than the people who are in opposition.
  • Self serving bias
    • Thinking you are entitled to do what you want to do.
    • Thinking what’s good for you is good for everyone else and rationalizing foolish or evil conduct, based w your subconscious tendency to serve yourself.
    • Don’t allow for the self-serving bias of others.
  • Self-pity
    • Pretty close to paranoia and very hard to reverse. Very counterproductive
  • Be subject to perverse incentives and associations
    • Working in  perverse incentive system that rewards more for more foolish behavior. Perverse incentives are so powerful as controllers of human cognition and behavior that one should avoid their influence.
    • Working under someone you don’t admire or want to be like.
  • Resentment
    • For people who can’t quit resentment, they should practice the disreali compromise( give up vengeance as a motivation for action, but maintain a list of people who’ve wronged you and check up on them from time to time to note the way the world has taken down your enemies without your assistance)
  • Be unreliable
    • Do not faithfully do what you have been engaged to do. This single habit will counterbalance the combined effect of all your virtues however great.
    • It also guard tees distrust and exclusion from the best human contribution and company.
    • If however you’re reliable, it’ll be hard to end up miserable even if you start disadvantaged.
    • It is essential for progress in life.
  • Learn everything you possibly can from your own experience minimizing what you learn vicariously from the good experience of others, living and dead.
      • They that won’t be counseled can’t be helped.
      • Experience keeps a dear school, but fools will learn in no other and scarce in that, for it is true, we may give advice bit we cannot give conduct.
      • Do not learn from the best work done before yours.
  • Go down and stay down when you get your first, second or third severe reverse in the battle of life.
  • Ignore the saying "I wish I knew where I was going to die, and and then I’d never go there"
    • Approach the study of how to create X by turning the question backward and studying how to create non X
    • Inverted, always invert. It is in the nature of thungs that many hard problems are best solved only when they are addressed backwards.
  • Minimize objectivity
    • Einstein said that his successful theories came from curiosity, concentration, perseverance, and self criticism(the testing and destruction of his well-loved ideas)
    • It helps minimize the compromises and burden of being wealthy and you keeping you in poverty.
P. S. Darwin’s life result was largely due to his working method which violated all the above rules and emphasized inversion by giving priority attention to evidence tending to disconfirm his cherished and hard-won theories

On their shoulders

  • The most valuable result of all education is the ability to make yourself do the thing you have to do, when it ought to be done, whether you like it or not.
  • Intense interest in a subject is indispensable if you’re really going to excel in it. If possible, maneuver yourself into doing something in which you have an intense interest.
  • Take a simple idea and take it seriously
  • An appetite for hard work, insatiable curiosity and patient demeanor will enable a man overcome most of life’s challenges and become successful.
  • The most important characteristics are
    • Preparation
    • Patience: The art of waiting without tiring of waiting
    • Discipline
    • Objectivity
  • Einstein’s 4 causes of his achievements
    • Self-criticism
    • Curiosity
    • Concentration
    • Perseverance
  • People who specialize in some niche usually do well.
  • Winning systems usually go ridiculously far in maximizing or minimizing one or a few variables
  • Just as multiple factors shape almost every system, multiple models from a variety of disciplines, applied with fluency, are needed to understand that system
  • Any year you do not destroy one if your best loved ideas is probably a wasted year
  • A willingness even an eagerness to identify and acknowledge my mistakes and learn from them
  • Look at your ideas and approaches as tools. When better ones come along, swap the existing ones for them
  • The Lollapalooza effect
    • When several forces form the multiple models framework combine and they’re all operating in the same direction.
    • The forces coming out of the models sometimes conflict causing tradeoffs. It is important i can recognize the tradeoffs and how they combine.
    • If there are twenty factors and they interact, you’ll have to learn to handle it. It won’t be hard if you go at it step by step with curious persistence.
  • Hard work, intellect and experience are insufficient for greatness.
  • Eliminate unpromising parts of the chessboard so you can focus on the productive ones
  • It’s not the bad ideas that do you in, it’s the good ones. They tend to be overdone
  • Quickly eliminate the big universe of what not to do followed up with a fluent, multidisciplinary attack on what remains, then act decisively when and only when the right circumstances appear. You’ll consistently out think people that are smarter than you as you’ll be more objective and multidisciplinary.
  • Investing is the equivalent of going out and betting against the parimutuel system. Look for a horse with one chance in two of winning and which pays you three to one(Basically a mis-priced gamble). Value investing is knowing enough to know whether the gamble is mis-priced.
  • You need to have a passionate interest in why things are happening. That state of mind gradually improves your ability to focus on reality. Without this cast of mind, you’re destined for failure.
  • The structure of the human mind requires that the way to complete competency of any kind is to learn it all to fluency.
  • The habit of committing far more time to learning and thinking than to doing is no accident.
  • You must be synonymous with honesty and integrity. Doing the right thing pays dividends personally and professionally.
  • Example is not the main thing in teaching, it is the only thing.
  • Don’t go to auctions of any kind
    • Open-bid:social proof tendency ensures you believe other bids are fair assessments of the value of the item even when they’re not. This combines with deprival tendency t cause great mispricing errors
    • Closed-bid: they’re often won by people making a technical mistake. There is an opportunity for big mispricing errors.
  • Compound interest is the magic that turns all your lead to gold.
  • It is occasionally possible for a tortoise, content to assimilate proven insights of his best predecessors, to outrun hares that seek originality or don’t wish to be left out of some crowd folly or ignores the best work of the past. Enormous long-term advantages can be used gotten from consistently trying not to be stupid, instead of trying to be very intelligent
  • There should be a huge area between what you’re willing to do and what you can do without significant risk of criminal penalty or causing losses. There should be things that you won’t do even though they’re perfectly legal.
  • Tell the truth so you won’t have to remember your lies
  • Anytime you seat waiting a deal to come by, you’re in a very dangerous seat
  • Passion is more important than competence/brain power
  • It is dangerous to rely on special talents. It’s better to own lots of monopolistic businesses with unregulated prices
  • On investing: For successful investing, you need
    • Temperament: the disposition to own stocks without fretting/ overt worrying.
    • Patience: being very patient but aggressive when it’s time.
    • Curiosity: I need a passionate interest in why things are happening. This interest over time improves my ability to focus on reality. Without it, you’re destined for failure regardless of the advantages bestowed upon you by nature. The more hard lessons you learn vicariously instead of by experience the better.
    • Focus investing: It is so hard to find good investments so concentrate investments in a few. I need the discipline of waiting for a fat pitch.
  • The game is to recognize a big idea when it comes along, when they don’t come along very often. Opportunity comes to the prepared mind.
  • Learning: the game is to keep learning. you’ve gotta like the learning process
  • Once you start doing something bad, it’s easy to take the next step and in the end you’re a moral sewer.
  • People with loose accounting standards invite perfectly horrible behavior in others
  • Multiple mental models
    • You must know the big ideas in the big disciplines and use them routinely. All of them not just a few.
    • You have to learn the models so they become part of your ever used repertoire.
  • Don’t fool yourself
  • Every person is going to have a circle of competence. And it’s going to be very hard to enlarge that circle. You can rise quite high by developing a circle of competence
  • You need to figure out what your own aptitudes are. If you play games where others have aptitude and you don’t, you’re going to lose.
  • It’s ridiculous the way a lot of people cling to failed ideas. It’s not bringing the new ideas that’s so hard. It’s getting rid of the old ones.
  • Refuse to delude yourself that your ideas would change the world. Instead seek to learn from my betters practical entail tricks that help avoid some of the worst miscognitions of my age cohorts.
  • Organized common sense is a very powerful tool. There is too much calculation and too little thinking.
  • Reading a lot isn’t enough. You need the temperament to grab ideas and do sensible things. Most people don’t grab the right ideas or don’t know what to do with them.
  • You must be cautious of the effect of legislation(and government activities) on your business and industry.
  • The five essential entrepreneurial skills for successful are concentration, discrimination, and organization, innovation and communication.
  • The fundamental full attribution ethos of hard science
    • You’ve go to know all the big ideas in all the disciplines more fundamental than your own.
    • Never make any explanation that can be made in a more fundamental way in any other way than the most fundamental way.
    • Always take with full attribution to the most fundamental ideas that you are required to use.
  • Extreme success is likely to be caused by some combination of the following factors
    • Extreme maximization or minimization of one or two variables
    • Adding success factors so that a bigger combination drives success, often in nonlinear fashion, as defined by the concept of breakpoint and critical mass. Results are often non-linear. You get a little more mass then you get a lollapalooza effect.
    • An extreme of good performance over many factors.
    • Catching and riding some sort of big wave.
  • Worldly wisdom: you learn to make fewer mistakes than other people and how to fix your mistakes after when you do make them.
    • One trick of life is to get so you can handle mistakes without psychological denial.
    • Beware of ideology based mental misfunctions: It distorts cognition terribly. It pounds ideas in better than it convinces out. The only heavy ideology you need are in favor of accuracy, diligence and objectivity.
    • You must have the confidence to override people with more credentials than you whose cognition is impaired by some of psychological bias that is obviously present. But there are instances when you have to recognize that you have no wisdom to add and that your best course is to trust some expert.
    • When you don’t know and don’t have any special competence, don’t be afraid to say so. No body expects you to know everything about everything. Flee people that always confidently answer is questions about which they don’t have any real knowledge.
    • To teach effectively, instead of just pounding it in, teach in a way that requires a slight mental reach(indirection) . Because you have to reach for it, you hold it better. This takes advantage of commitment and consistency tendency.
    • The right lessons are easily learned if you’ll work at it. And if you learn to, you can be especially useful at crucial moments when others fail.
    • The human mind is not constructed so that it works well without having reasons. You need to hang reality on a theoretical structure with reasons.
    • Multiple mental models: You can’t really know anything if you just remember isolated facts and try and bang them back. If the facts don’t hang together in a lattice work of theory, you don’t have them in a usable form. You need multiple models upon which to hang your experience.
      • You need multiple models because if you have just a few, you’ll torture reality till it fits your models, or at least you’ll think it does. (To the man with a hammer, everything is a nail).
      • Take all the main models and use as a checklist in reviewing complex systems. It’s Essential to get the main models and use them in a multimodular way.
      • For best results, we need to have multidisciplinary coverage of immense amplitude, with all needed skills raised to an ever-maintained practice based fluency, including considerable power of synthesis at boundaries between disciplines, with the highest fluency levels being achieved where they are most needed with forward and reverse thinking techniques being employed in a manner reminding one of inversion in algebra, and with "checklist" routines being a permanent part of the knowledge system.
      • For enhanced multidisciplinarity, we need to
        • Make more courses mandatory not optional.
        • Practice problem solving across several disciplines, including practice that prevents the loft of skills through disuse.
        • Increased use of the best business periodicals to prompt practice in relating events to multidisciplinary causes often as intertwined
        • Avoid ideology of every form
        • Intensely Imitate the fundamental organizing ethos of hard science( the fundamental 4 discipline combination of maths, physics, chemistry and engineering).
          • You must both rank and use discipline in order of fundamentalness
          • You must master to tested fluency and routinely use the truly essential parts of all fours constituents of the fundamental four-discipline combination, with particular intense attention given to discipline more fundamental than your own.
          • You must never practice either cross-disciplinary absorption without attribution or departure from a "principle of economy" that forbids explaining in any other way anything readily explainable from more fundamental material in your own or any other discipline
          • When the above approach doesn’t produce much new and useful insight, you should hypothesize and test to establish new principles, by using methods similar to those that created successful old principles. However you may not use any new principle inconsistent with an old one unless you can now prove that the old one is true.
      • The models need to come from multiple disciplines- all the wisdom of the world is not to be found in one little academic department.
      • If you want to be a good thinker, you must develop a mind that can jump the jurisdictional boundaries of fields. Take in the best big ideas from all these disciplines.
      • The models that are the most reliable come from the hard sciences and engineering ring
      • The models and techniques MUST include
        • Mathematics:
          • Compound interest
          • Probability
          • Permutations and combinations
          • Decision tree theory
        • The five Ws
          • Tell Who was going to do What, Where, When and Why.
          • If you ask people why, they’ll understand it better, they’ll consider it more important and they’ll be more likely to comply. Even if they don’t understand the reason, they’ll be more likely to comply.
          • In communicating with other people about everything. You must include why
        • Accounting: the language of practical business life
          • Double entry bookkeeping
          • Accounting has it’s limitations(useful life, depreciation rate are just estimations) . It’s a starting place.
        • Engineering
        • Physiology
        • Biology
        • Psychology: the perceptual apparatus of man has shortcuts in it due to its limited circuitry.
          • Helps prevent manipulation from cognitive dysfunction. You must know the limitations of your cognitive apparatus.
          • You must understand the psychology of misjudgement that cause the human mind to reach wrong conclusions . They always interact and they’re all very important.
        • Microeconomics: less reliable
          • A free market economy or partly free market economy is the equivalent of an ecosystem.
          • People who narrowly specialize can get very good at occupying some little niche. And the often find good economics they wouldn’t get any other way.
          • Scale:
            • Cost reduction along the experience curve: Doing something complicated in more and more volume enables people, who are trying to improve and are motivated by the incentives of capitalism to do it more and more efficiently.
            • If you get a whole lot of volume through your operation, you get better at processing that volume.
            • Scale let’s you get greater value from volume. It also grants huge purchasing advantages.
            • Grants success in an all or nothing scenario.
            • Scale grants an informational advantage: you get more customers if you’re the well known.
            • Social proof: If everybody’s doing something, we think it’s better.
            • It allows great specialization within the firm, allowing each person to become better at what he does.
            • Sometimes, scaling down and intensifying gives you the big advantage. Scale lets you go to a narrower specialization as it enables more efficiency.
            • The biggest problem of scale is that as you get bigger, you get bureaucracy which comes with territoriality: you think the work is done when it leaves your desk for anothers. You also get corruption
            • Building scale let’s you have a fighting chance with the incumbents.
        • Beware of the effect of competition on an industry. In some scenarios, and everyone competing makes money and does well in spite while in others, shareholder funds are wasted just to get by.
        • Be able to distinguish when technology is going to help you and when it’s going to kill you. Is the advantage of new technology going to the owner or the customers. Employees will always show how much you’ll save without showing how much of the savings will stick.
        • Competitive destruction: when new technology comes that totally makes old technology invalid.
        • Surfing: when you’re the early bird in a business, you get huge advantages. If you can ride the wave, you get long runs.
      • Charlies 4 type of checklists
        • Two track analysis
          • What are the factors that really govern the interests involved, rationally considered.
          • What are the subconscious influences where the brain at a subconscious level is automatically doing these things which by and large arenusefil but which malfunction.
          • You can do these by
            • Rationality: by evaluating the real interests, real probabilities etc
            • Evaluate the psychological factors that cause subconscious conclusions – many of which are wrong.
        • Ultra-simple, general problem solving notions
          • Decide the big no Brainer questions first
            • It is usually simple to simplify problems by deciding big no-brainer questions first.
          • Apply numerical fluency
            • Scientific reality is revealed only by math as if math was the language of God.
          • Invert
            • Invert, always invert. It is not always enough to think through problems forward. You must also think in reverse.
          • Apply elementary multidisciplinary wisdom, never relying entirely upon others
            • The best and most practical wisdom is elementary academic wisdom. However you must think in a multidisciplinary manner using all the easy-to-learn concepts from the freshman course in every basic subject. If in your thinking, you rely too much on others, you will suffer much calamity.
          • Watch for combinations of factors- the lollapalooza effect
            • The really big effects, lollapalooza effects, will often come only from large combinations of factors.
        • Psychology based tendencies: see note psychology of human misjudgement.
        • Investing and decision making checklist
          • Risk: all investment evaluations should begin by measuring risk, especially reputational.
            • Incorporate an appropriate margin of safety
            • Avoid dealing with people of questionable character
            • Insist upon proper compensation for riskl aasumed
            • Beware of inflation and interest rate exposure
            • Avoid big mistakes, shun permanent capital loss
          • Independence
            • Objectivity and rationality require independence of thought
            • People’s opinions don’t make you right or wrong. The only thing that matters is the correctness of your analysis and judgment.
            • Mimicking the herd invites regression to the mean.
          • Preparation: The only way to win is to work, work, work, work and hope to have a few insights.
            • Develop into a lifelong self-learner through voracious reading; cultivate curiosity and strive to become a little wiser every day.
            • More important than the will to win is the will to prepare.
            • Develop fluency in mental models from the major academic disciplines
            • If you want to get Smart, the question you have to keep asking is why, why, why?
          • Intellectual humility: acknowledging what you don’t know is the dawning of wisdom
            • Stay within a well defined circle of competence
            • Identify and reconcile disconfirming evidence
            • Resist the craving for false precision, false certainties etc.
            • Above all, never fool yourself, and remember you’re the easiest person to fool.
          • Analytic rigor: use the scientific method and effective checklists to minimize errors and omissions
            • Determine value apart from price, but progress apart from activity, wealth apart from size.
            • It is better to remember the obvious than to grasp the esoteric
            • Be a business analyst, not a market, macroeconomic, or security analyst.
            • Consider totality of risk and effect; look always at potential second order and higher level impacts
            • Think forward and backwards: invert, always invert.
          • Allocation: proper allocation of capital is an investor’s number one job
            • The highest and best use is always measured by the next best use(opportunity cost)
            • Good ideas are rare: when the odds are in your favor, allocate heavily.
            • Don’t fall in love with an investment: be situation dependent and opportunity driven.
          • Patience: resist the natural human bias to act
            • Never interrupt Compound interest unnecessarily
            • Avoid unnecessary transactional taxes and frictional costs; never take action for its own sake.
            • Be alert for the arrival of luck
            • Enjoy the process along with the proceeds, because the process is where you live
          • Decisiveness: when proper circumstances present themselves, act with decisiveness and conviction
            • Be fearful when others are greedy and greedy when others are fearful
            • Oppurtunity doesn’t come often, so seize it when it does.
            • Oppurtunity meets the prepared mind. That’s the game.
          • Change: live with change and accept unremovable complexity
            • Recognize and adapt to the true nature of the world around you; don’t expect it to adapt to you.
            • Continually challenge and willingly amend your best-loved ideas
            • Recognize reality even when you don’t like it- especially when you don’t like it.
          • Focus: Keep things simple and remember what you set out to do
            • Remember that reputation and integrity are your most valuable assets and can be lost in a heartbeat
            • Guard against the effect of hubris and boredom
            • Don’t overlook the obvious by drowning in minutiae
            • Be careful to exclude unneeded information of slop
            • Face your big troubles; don’t sweep them under the rug.
    • Worldly wisdom as applied to stock picking.
      • A good model to simplify what goes on in the market for common stock is the pari-mutuel system at racetracks. Everyone goes there and bets, and the odds change based on what’s bet. It’s difficult to beat the system. Also the house takes a percentage on the top. Not only do you have to beat all bettors, you’ve got to have such a large margin, that you can afford to pay the house. Intelligence alone cannot assure success. You need a combination of fanaticism and discipline. You also need to bet seldomly.
      • the market is efficient as a pari-mutuel system is efficient-with the favorite more likely than the long shot to do well in racing, but not necessarily give any betting advantage to those that bet on the favorite. once you put the price into the formula, it isn’t so clear anymore what was going to work best for a buyer choosing between the stocks. So it’s a lot like a pari-mutuel svstem. And, therefore, it gets very hard to beat.
      • It’s not given to human beings to have such talent that they can just know everything about everything all the time. But it is given to human beings who work hard at it-who look and sift the world for a mispriced bet-that they can occasionally find one. The wise ones bet heavily when the world offers them that opportunity. They bet big when they have the odds, otherwise they don’t bet.
      • The way to win is to work, work, work and hope to have a few insights.
      • With stock managers, what makes sense for the investor rarely makes sense for the manager. What determines the behavior are incentives for the decision maker, and getting the incentives right is a very important lesson.
      • To beat the market, a stock picker can do the following
        • Ben Graham’s approach( The concept of value to the private owner): Calculate the price the whole enterprise would sell for if it were available. I then take the stock price and multiply by the number of shares and if you’ve got something that was a third or less the sellout value, you’ve got a lot of edge going for you. You have a huge margin of safety, by having the big excess value going for you. It’s very difficult in a modern market.
        • Berkshire’s approach: sometimes a company that was selling at 2 or 3 times its book value could still be a huge bargain because of momentums implicit in its position, sometimes combined with an unusual managerial skill plainly present in some individual or system. Load up on the very few good insights you have instead of pretending to know everything about everything at all times. Over the long term, it’s hard for a stock to earn much better return than the business which underlies it earns even if you buy it at a huge discount. If you buy a business that earns 18 percent on capital over twenty or thirty years, even if you pay an expensive looking price, you’ll end up with one hell of a result. The trick is getting into better businesses.
          • You could find them small
          • Finding them big: get into a great business which has a great manager because management matters. When you get an opportunity to get into a wonderful business being run by a wonderful manager. Occasionally, you’ll find a human being who’s so talented that he can do things that ordinary skilled mortals can’t. However, betting on the quality of a business is better than betting on the quality of management.
        • Cancer-surgery formula: The sick business fix game. They look at the mess and figure out if there’s anything sound left that can live on its own if they cut away everything else. If they find, they just cutaway everything else.
      • Taxes. Take care to run from tax shelters. Also avoid paying transactional taxes by making good decisions where you don’t need to make another. Take care of the do-something psychological tendency:
      • The dangers in this philosophy: companies can fall off the wave and investment disasters resulting from their too high prices.
      • There are a few businesses where any manager could raise the return enormously just by raising prices – and yet they’ve not done it. There are actually people out there who don’t price everything as high at the market will easily stand.
      • The dream is to get into a situation where you had the top hand in a game that is clearly going to end up with one winner and a management with a lot of integrity and intelligence.
  • For coping with life’s challenges
    • Have low expectations
    • Have a sense of humor
    • Surround yourself with the love of friends and family
  • The best way to avoid envy is to plainly deserve the success we get.
  • The best way to try to get what you want is to try to deserve what you want. You have to deliver to the world what you would buy if you were on the other end.
  • There’s no love as right as admiration based love, and such love should include the instructive dead.
  • The acquisition of wisdom is a moral duty. Without lifetime learning, you’re not going to do very well. You can only progress when you learn the method of learning.
  • When dealing with a system that involves many factors, it’s best to look at it as a whole rather than examining the components in isolation. This brings to fore the importance of the multidisciplinary approach.
  • If you would persuade, appeal to interest, not to reason.
  • Engaging in routines that allow you maintain objectivity are very helpful to cognition. They are required to maximize correct thinking
  • Maximizing non-egality often works wonders
    • Maximizing the experience of people who have the most attitude and determination as learning machines. It’s necessary if you want to reach the highest of human achievement.
    • Provide a lot of playing time for your best players
  • Assiduity is very important to success
    • Sit down on your ass until you do it.
    • Select partners full of assiduity.
  • Life is likely to provide terrible and unfair blows. Some people recover and others don’t. Epicetus’s morals come in handy here
    • Every mischance in life creates an opportunity to behave well and learn something useful
    • My duty is not to become immersed in self-pity, but to utilize each terrible blow in a constructive fashion.
  • Manage your income with prudence
    • Net worth enables you serve better
    • Always understand your income.
  • Go through life Anticipating trouble and be ready to perform adequately if trouble comes.
  • The job of managing people often involves getting a such responsibility into the people with real knowledge and away from people who feign knowledge.
    • People who are selling something often rate their goods above average regardless of the evidence to the contrary.
  • The highest form of civilization is a seamless, non-bureaucratic Web of deserved trust
    • In life, maximize the seamless Web of deserved trust.
  • An excess of professionalism often hurts, because the careful l procedures themselves often lead to overconfidence in their outcome.
  • Extra care in thinking is not all good, but also introduces extra error. The best defense is systematic self criticism to an extreme degree. You must not fool yourself, and you’re the easiest person to fool.
  • If you must depend on external advice, you must suitably control the high talents of consultants so that the servant becomes the Useful tool of the master instead of serving itself under perverse incentives for