SQL Server Administration Review

Planning and installing SQL Server

  • Planning
    • Operating System Requirements
      • SQL Server 2012 enterprise, business intelligence and web can only be installed on Windows server 2008 and higher OSs
      • SQL Server 2012 standard can be installed only on 2008 and higher Server OSs and enterprise/professional/business versions of client OSs
      • SQL Server 2012 developer tools can be installed on any windows OS (vista/ 2008 and higher)
    • An installation of SQL server on a core OS allows only
      • Database engine services, replication, Full-text search, analysis services, client tools connectivity and integration services server
    • Hard ware requirement
      • Processor
        • x64: Processor must suppport EM64T
        • x86: Pentium 3 or higher
      • Processor speed
        • x64: min 1.4 GHz, recommended 2.0 GHz
        • x86: min 1.0 GHz, recommended 2.0 GHz
      • Minimum RAM
        • min 1GB, recommended 2 GB
    • Database File management
      • Shrinking and growing databases
        • For fixed size databases, you increase their size using
        • Decreasing Database size
          • A data base can be configured to AutoShrink
            • This causes the database to move data pages from the end of the file towards the front of the file and return free space to the OS’s file system
            • Causes index fragmentation.
            • Best practice is to manually shrink and immediately rebuild indexes.
            • DBCC SHRINKDATABASE (N’@DBName’)
      • File Types
        • SQL Server has 3 file types
          • Primary:
            • Data Files with .mdf extension
            • Contain startup information for the database and pointers to other files the DB uses.
            • Can store data and objects
          • Secondary:
            • Optional files used to spread data across multiple disks and increase the size or performance of a database .
            • Multiple files can be used.
            • extension is .ndf
            • Added using
              • Alter Database [@DBName] add file (Name = N’@FileAlias’, Filename= N’@DIrectory\@Filename.ndf’, Size = 4096KB, FileGrowth= 1024KB) To filegroup [primary]
          • Transaction Log:
            • Stores information that can be used to recover the database
            • extension is .ldf
      • FileGroups
        • Used to increase performance when you use partitioning.
          • If the files of a DB are on the same filegroup, you can create a table on the filegroup. This allows the table perform like it’s written on a striped RAID and improves performance.
        • Reduce backup time for partial backups.
        • Add Filegroup using
          • Alter database [@DBName] add filegroup [@FileGroupName]
    • SQL Server Capacity constraints
      • A DB cannot exceed 524,272TB
      • A single DB file cannot exceed 16TB
      • Maximum DBs on an instance is 32,767
      • maximum of 16 columns in a primary key
      • maximum of 253 foreign key references per table
      • maximum of 32767 user connections
    • For windows server 2008 R2 and windows 7, when using a Managed service account for a SQL Server component, You need to create the account before commencing installation.
    • Benchmarking a server for SQL Server usage
      • Run SQLIO to identify hardware or I/O configuration issues before deploying SQL Server.
        • Used to calculate the saturation point of your I/O subsystem( Your I/Os subsystem’s capacity).
      • Run SQLIOSIM to simulate read, write, checkpoint, backup, sort and read ahead activities.
    • Firewall permissions
      • Grant access to the SQL server from the Group Policy management Console.
        • Create a Group policy object and link it to the designated servers organizational Unit
        • Add all the servers to an organizational unit to ease administration
        • Computer Configuration\Policies\Windows Settings\Security Settings\Windows Firewall with an advanced security node.
        • Create an inbound Program rule and allow connection to the SQL Server path(C:\Program Files\Microsoft SQL Server\MSSQL11.MSSQLSERVER\MSSQL\Binn\sqlservr.exe).
        • Limit the scope of the inbound rule to the remote IP addresses relevant from the rule’s properties
  • Installing
    • From command prompt
      • Setup.exe /qs /IAcceptSQLServerLicenseTerms /action=@Action /InstanceName= @InstanceName /Features=@Feature /SQLSYSADMINACCOUNTS=”@Domain\@Username” /updateenabled=0
        • @Action can be Install, uninstall, patch, upgrade, rebuilddatabase(to rebuild all system DBs)
        • @Feature can be SQLEngine, IS(integration services), conn(connectivity tools), AS(Analysis services), RS(Reporting services).

Configuring and managing SQL server instances

  • An SQL server instance is a separate deployment of Analysis services, database engine and reporting services hosted on a computer.
    • There can be several instances hosted on one computer
    • Each instance has unique program files, security configuration and directory
    • Up to 50 database engine instances can exist on a host.
  • Configuring instances
    • Instance level settings
      • Memory: Contains settings for minimum and maximum server memory.
        • Limits and ensures the amount of memory is within a specified range.
        • Configured to ensure that the instances play nice with each other.
        • modified using

Exec sys.sp_configure ‘min server memory’, @MinSizeInMB;

Exec sys.sp_configure ‘max server memory’, @MaxSizeInMB;

          • This requires show advanced options to be on

exec sys.sp_configure ‘show advanced options’, 1;


      • Processor and I/O Affinity
        • Processor affinity ties an instance to a specific processor rather than the default of all processors on the host.
          • This eliminates processor reloads and reduces thread migration across processors
          • Modified using

Alter server configuration set process affinity CPU = ([Auto] or [1,5,8])

        • I/O affinity binds an instance’s disk I/O to a specific set of CPUs.
        • You cannot configure processor and affinity and I/O affinity for the same processor.
      • Fill factor
        • This determines the percentage of space on each leaf-level page that is filled with data when an index is created or rebuilt. The remaining space is used for future growth.
        • Can be modified

sp_configure ‘fill factor’, @destinationPercentage;


        • Requires advanced options to be on.
    • The model database serves as the template for all new databases that are created on an instance.
      • All the options and contents are copied over to the new database.
    • Distributed Transaction Coordinator(MSDTC)
      • A distributed transaction is a transaction that runs across multiple servers.
        • Not needed if they’re different instances on the same server.
        • Should be configured as a cluster resource before installing Database engine on the cludter.
    • Database Mail
      • User must have msdb role DatabaseMailUserRole
      • Can be enabled using

sp_configure ‘Database Mail XPs’, 1;


      • Options can be configured using msdb.dbo.sysmail_configure_sp stored procedure.
      • Database Mail profiles determine which users are able to send messages using database.
  • Managing Instances
    • Installing additional instances
      • You can have a maximum of 50 instances on a host or 25 on a failover cluster.
      • All instances have unique files.
      • An instance can contain database engine services, analysis services and reporting services.
    • Deploying patches
      • Shared features must be updated if a single instance is updated
      • Patches can applied/ removed to instances individually or to all the instances.
      • can be done using

.exe /qs /IAcceptSQLServerLicenseTerms /Action= (Patch or RemovePatch) [((/InstanceName = @Instance1, @Instance2) OR /Allinstances)]

        • /InstanceName Or /Allinstances should not be used if you intend to patch only shared components.
  • Resource Governor
    • A feature that enables you limit CPU and memory resource for Database engine in an instance.
    • Resource pools are a subset of the physical resources available to a database engine instance.
      • Hosts one or more workload groups.
        • Workload groups are containers for session request that share similar classification criteria.
        • There are 2 types
          • Internal workload group
          • Default workload group
        • Created using

Create workload group @WorkloadGroupName

using @ResourcePoolName;
Alter resource governor reconfigure;

      • Each session is assigned to a workload group and runs using the resource in the workload group through the resource pool.
        • There are 2 types of resource pools
          • The internal resource pool represents the resources the instance uses.
            • It cannot be altered and workloads in this pool are critical to the server.
            • It can seize resources from other pools
          • The User defined resource pool.
            • Hosts user sessions
        • A resource pool Consists of 2 components
          • Exclusive: doesn’t overlap with other pools. used to define minimum resource reservation for each resource pool. The sum of min across all pools cannot exceed 100%.
          • shared: used to define maximum resource consumption. The max value is 100- @TheSumOfMinAcrossAllPools
        • Created using

create resource pool @resoucepoolName

with @Option;
Alter resource governor reconfigure;

          • @option can be (min_cpu_percent = @Value) and any other option
    • Resource governor classification
      • Allocates incoming sessions to a workload group based on session properties.
      • You can classify sessions into workload groups based on classifier functions.
    • Windows System Resource Manager
      • allows you allocate system resources to specific processes
      • Used to manage resource usage across several instances.
      • Percentage based resource management allows you manage hosts with one processor.
      • Resource allocation policies are can be used to manage resource allocation.
    • SQL Server error logs
      • This stores information about processes that have completed successfully or failed.
      • Sysadmins can cycle the error log using

exec sp_cycle_errorlog;

        • For SQL server agent

exec msdb.dbo.sp_cycle_agent_errorlog

Configuring SQL server components
  • Deploying and configuring analysis services
    • Analysis services can be configured in 2 modes
      • Multidimensional and data mining mode:
        • Default
        • supports OLAP databases and data mining models
      • Tabular mode
        • Supports tabular modeling features
        • Allows hosting of solutions built in the tabular model designer
        • allows access of tabular data model data over the network.
    • The modes can be set during installation using the /ASServerMode option
      • can be set to Multidimensional, tabular or powerpivot.
    • The mode is specific to the instance. A different mode requires a different instance.
  • Deploying and configuring reporting services
    • During installation, the install and configure option is only available if you’ve installed the web server role and are installing/ have installed a database engine on the instance.
    • You can choose to install the reporting services configuration manager.
  • Deploying and configuring sharepoint integration
    • Deploying analysis services and reporting services as shared services in a sharepoint farm allows you use PowerPivot for sharepoint and power view( a reporting services report designer)
      • TO deploy RS, power view and powerPivot requires the following are installed
        • Sharepoint server 2010 sp1
        • SQL server 2012 SQLEngine
        • SQL server 2012 RS and RS add in
        • SQL server 2012 PowerPivot for sharepoint
  • Configuring SQL server integration services security.
    • Integration services is a shared component that allows you run and schedule integration services packages in SSMS.
    • Only members of the built-in administrators group can run integration services.
      • Use DCOM configuration tool to grant user access to SSIS
        • Run dcomcnfg.exe and configure access for the SSIS application.
  • Managing Full-text indexing
    • Full text indexes store information about important words and their locations in the columns of a database table.
      • Only one full-text index can be created per table/indexed view.
      • Created using

create fulltext catalog @FulltextCatalogName;

Create fulltext index on @TableName
( @columnToMonitor1, @columnToMonitor2, @columnToMonitor3)
Key index @UniqueKeyIndexName
on @FulltextCatalogName;

      • Removed using

drop fulltext index on @TableName.

  • Configuring Filestream
    • Filestream allows SQL-Server based applications store unstructured data(e.g images, documents) on the host’s file system.
    • Filestream requires that
      • objects to be stored are greater than 1MB
        • for objects smaller than 1MB, use the varbinary(max) Blob data type
      • The system architecture(x86 or x64) of SQL server is the same as that of the system.
    • To Enable
      • edit the instance in SQL server configuration manager to enable filestream access
      • Set filestream access level to 2

Exec sp_configure filestream_access_level, 2


        • setting it to 0 means no access, 1 allows TSQL only and 2 allows TSQL and win-32 streaming
      • Create a filestream filegroup for the database.

Alter database @DBname ADD filegroup @FileStreamFilegroupName contains filestream

      • Add filestream files to the filestream filegroup by specifying a folder that doesn’t currently exist

Use master

Alter database @DBName add file(
Name= @FileName, FIleName=’@NonExistentDirectory’)
To filegroup @FileStreamFileGroupName

  • Configuring filetables
    • File table are tables that allow you store files and documents within SQL server 2012.
      • Provides a hierarchy of files and directories.
    • To enable
      • Set filestream access level to 2
      • Enable Non-transactional access at the database level

Alter database @DBName set filestream (Non_transacted_access = Full, Directory_name=N’@DirectoryName’

        • Directory_name is the alias for the filestream directory.
    • To create a filetable

Create table @TableName as FileTable;

  • Configuring and managing databases
    • Designing and managing filegroups
      • Each database has a primary filegroup.
        • This host the primary data file and any secondary files not allocated to other filegroups.
        • The system tables are hosted in the primary filegroup
      • File groups are used to group together host data files.
      • When a filegroup contains more than one file, the database engine will write data across the files proportionally; depending on how much free space is available in each file
      • You can specify the filegroup a table or index is created in
      • To add a filegroup

alter database @DBName add filegroup @NewFileGroupName

      • You can move indexes from one filegroup to another.
        • You cannot move indexes created using a unique/primary key.
    • Contained Databases
      • This are databases that include all the settings and metadata required to define the database and have no dependency on the instance.
        • A database can be partially or fully contained.
          • Partially contained databases provide a high degree of isolation from the instance but aren’t fully contained.
      • They allow you move a database between instances easily.
      • To enabled contained databases on an instance,

Exec sp_configure ‘contained database authentication’,1;


      • To convert a database to a contained database

Alter database @DBNAME set containment = @Containmenttype

        • @ContainmentType is either partial or full
      • To create a contained database

create database @DBName

containment = @ContainmentType

    • Using Data Compression
      • Compression allows you save storage space by reducing the size of the database
        • It can be configured on the following
          • Clustered tables
          • Heap tables
          • Non-clustered indexes
          • Indexed views
          • Individual partitions of a partitioned table or
      • It increases the CPU usage because the data must be compressed and decompressed when being accessed.
      • Only enterprise and developer editions support data compression.
      • There are 3 types
        • Row-level:
          • Works by using more efficient storage formats for fixed-length data.
            • Stores fixed length data types as though they are VAR data types
            • not storing NULL or 0 values
            • reducing metadata required to store data
          • Requires less CPU than page-level compression
          • Maximizes the number of rows stored in a page
        • Unicode
          • Enables the engine to compress unicode values stored in page or row compressed objects.
          • used for nchar and nvarchar data types
        • Page-level
          • Compresses data by storing repeating values and common prefixes only once and then making references to those values from other locations within the table.
            • Row compression maximizes the number of rows in the page
            • Column prefix compression is applied by replacing repeating data patterns with references. the data is stored on the page header.
            • Dictionary compression scans for repeating values and then stores this information in the page header.
          • When it is applied, row compression is also applied
      • Done using
        • Tables

Alter table @TableName rebuild with (Data_compression=@compressionType)

        • Index

Alter index @indexname on @tablename rebuild partition all with (data_compression=@Compression Type)

          • @Compressiontype is either page or row.
      • Estimating compression savings can be done with sp_estimate_data_compression_savings
    • Encrypting databases with Transparent data encryption
      • Transparent Data Encryption enables you encrypt an entire database when when backed up.
        • uses a Database Encryption Key(DEK) which is stored in the database boot record.
          • The DEK is protected by the database master key which is protected by the service master key
      • To encrypt a database
        • Create the master encryption key

Use master;

Create master key encryption by password= ‘@Password’;

        • Create the certificate protected by the master key

Create certificate @CertificateName with Subject= ‘@CertificateAlias’;

          • Backup the certificate in the master database. If you lose the DB server, and don’t have the certificate, you cannot access data in the database
            • You need to create a backup of the certificate and private key both of which are required for certificate recovery
Backup certificate ServerCertificate
To file = ‘@DestinationFile’
with private key
( file = ‘@privateKeyName’,
Encryption by password = ‘@PrivateKeyPassword’
); Go
            • The files are written to \mssql\data for the instance.
        • Create a DEK and protect it with the certificate

Use @DBToBeEncrypted

Create database encryption key with algorithm = AES_128
Encryption by server certificate @CertificateName

        • Encrypt the database

alter database @DBName

set encryption on;

    • Partitioning indexes and tables
      • Partitioning divides index and table data across more than one filegroup. Data is partitioned so that groups of rows are mapped to individual partitions.
      • All partitions of a table or index must reside in the same database.
      • Available in enterprise and developer editions.
      • x64 versions support up to 15000 partitions while x86 are recommended to support up to 1000 partitions.
      • Concepts
        • Partition functions define how the rows of an index or table map to specific partitions based on the values of the partitioning columns
        • Partition scheme maps the partitions of a partition function to a collection of filegroups.
        • Partitioning column is the column of an index or table that a partition function uses to partition the index or table
        • Aligned index is an index that uses the same partition scheme as the table to which it belongs
        • Non aligned index is an index that is partitioned independently from the table to which it belongs
        • Partition elimination is the process by which the query optimizer accesses only the appropriate partitions to satisfy a query’s filter criteria.
      • To create
        • Create the filegroup(s) for the partitions
        • Create the partition function:

Create partition function @PartitionFunctionName(int) as range left for values(@ValuetoUseToSplit1,@ValuetoUseToSplit2 )

          • This creates a partition function to split into 2 partitions.
          • @ValueToUseToSplit are values use to partition
        • Create the partition scheme

create partition scheme @PartitionSchemeName

As partition @PartitionFunctionName
To(@Filegroup1, @Filegroup2);


        • Create/modify the table/index and specify the partition scheme

create table @TableName @TableSchema on @PartitionScheme(@PartitioningColumnDefinedInTableSchema)

      • You can alter and merge partitions functions using Alter partition function
        • This allows you split one partition into two or merge 2 partitions into 1.
      • You use alter partition scheme to modify an existing partition scheme.
    • Managing log files
      • Log truncation is the process by which SQL frees space in the logical log for reuse by the transaction log.
        • It occurs if
          • In simple recovery, When a checkpoint occurs.
            • Automatic checkpoints occur when the number of log records reach the number SQL determines it can process during the recovery interval server configuration option.
            • Automatic checkpoints occur when the virtual log becomes 70% full.
            • Manual checkpoints are triggered using the checkpoint statement.
          • In full/Bulk-logged recovery,
            • After a log backup, as long as a checkpoint has occurred after the previous backup.
      • Use DBCC SQLPERF (LOGSPACE) to monitor the amount of log space used.
    • Using Database Console commands(DBCC)
      • they allow you perform Server administration using queries.
      • They are divided into
        • Maintenance statements
          • DBCC Cleantable reclaims space from dropped var-length columns in indexed tables or views
          • DBCC DBreindex: rebuilds one or more indexes for a table
          • DBCC Dropcleanbuffers: removes all clean buffers from the buffer pool
          • DBCC Freeproccache: removes all elements from the plan cache, a specific plan from the plan cache or all cache entries related to a specific resource pool
          • DBCC Indexdefrag: defragments indexes
          • DBCC Shrinkdatabase: shrinks the size of all data and log files of the specified database.
          • DBCC ShrinkFile: shrinks a specified data or log file
          • DBCC UpdateUsage: updates page and row count data for catalog views to remove inaccuracies.
        • Informational Statements
          • DBCC inputbuffer: shows the last statement forwarded from a client to SQLEngine
          • DBCC Opentran: shows information about the longest running transaction
          • DBCC outputbuffer: displays the current output buffer in hex and ascii for a specific session id
          • DBCC procache:provides information about the procedure cache
          • DBCC show_statistics: provides current query optimization statistics for a table or indexed view.
          • DBCC showconfig: provides fragmentation information for tables, views or indexes
          • DBCC sqlperf: displays transaction log space usage statistics for all databases hosted by an instance.
          • DBCC tracestatus: provides information about trace flags
          • DBCC useroptions: provides information about currently set options on the connection.
        • Validation statements
          • DBCC checkalloc: performs a consistency check of disk space allocation structure
          • DBCC checkcatalog: checks catalog consistency of online databases
          • DBCC Checkconstraints: verifies the integrity of a specific constraint or all constraints on a table within the current databases
          • DBCC checkDB: Checks the physical and logical integrity of all objects in a database.
          • DBCC checkFileGroup: verifies the allocation and structural integrity of indexed views and tables in a specified filegroup
          • DBCC checkident: verifies and if necessary, changes the identity value for a specific table.
          • DBCC checktable: verifies the integrity of all pages and structures that make up a table or indexed view.
        • Miscellaneous Statements
          • DBCC dllname(Free): unloads a specific extended SP DLL from memory
          • DBCC freesessioncache: flushes the distributed query connection cache
          • DBCC Freesystemcache: flushes all unused cache entries from all caches
          • DBCC Help: provides infromation on a specific dbcc command
          • DBCC traceoff: disables specific trace flags
          • DBCC traceon: enables specific trace flags
Migrating, importing and Exporting
  • Migrating and upgrading to SQL server 2012
    • Upgrading an instance
      • From SQL server 2005.: Ensure the host operating system can run SQL 2012
        • Apply 2005 SP4
      • From SQL server 2008: Apply SP2
      • From 2008 R2: Apply SP1
    • Upgrade advisor
      • A tool that allows you identify issues that will potentially block an upgrade
      • It checks the services, but not applications or encrypted SPs.
      • Can be run only on hosts that support SQL server 2012.
    • Distributed replay utility
      • Simulates how a test server will function before upgrading to SQL server 2012 by simulating mission-critical workload.
      • Allows you determine if any incompatibilities might occur between the application and SQL server 2012.
    • Performing an upgrade
      • For X64 instances, you need to upgrade AS before database engine.
    • Migrating a Database to a SQL server 2012 instance
      • Allows you move to a SQL server of a different architecture
      • Detach a database
        • This involves removing the database from the instance while keeping the DB and log files.
        • Requires that
          • The DB is not replicated and published
          • The DB is not mirroed
          • The DB has no snapshots
          • It is not a system DB
        • Done using

exec sp_detach_db @dbname= @DBName;

      • Attach a database
        • You can attach a database only if all data files are available.
          • Done Using

create database @databaseName on(filename=’@Directory\@DatafileName), (Filename=’@Directory2\@datafilename2), (Filename=’@directory3\@logFileName) for attach;

        • attaching a database sets the compatibility level to 90 if it was less, otherwise it leaves it as is.
      • Copying a database to other servers.
        • This allows you retain a copy of the database while you deploy a copy to the target.
        • This ensures the original database is still available
        • 3 ways to copy a database
          • Copy database wizard:
            • The database becomes immediately available on the target instance after the transfer completes and is automatically upgraded.
            • Full text indexes are unavailable during the upgrade.
            • Allows you move SQL logins, jobs, SPs and custom error messages
            • Allows you schedule the copy.
            • Cannot copy
              • system DBs
              • replicated, offline, suspect, loading, recovering, inaccessible and emergency mode DBs
            • Requires SQL server agent to be started on destination.
            • There are 2 transfer methods
              • Detach and attach method: Fastest but requires the DB @ the source be taken offline.
                • The DB is reattached for the copy
              • SQL management object method
                • slower but allows the DB @ source to be online
            • Create a credential and an integration Service proxy account to use for the copy database wizard.
          • Backup and restore
            • Prior to this, create the same backup devices on the destination as you use on the source.
            • When restoring a DB backup from a previous version, use the move option
            • The principal who performs the restore becomes the owner of the restored database.
            • Backups don’t have all metadata for the database
              • You must migrate Logins and jobs
          • Publish database using generate and publish scripts wizard
            • Used to publish a DB to a web host provider.
            • Can also publish a script you can use to transfer the DB.
      • Migrating SQL Logins
        • You can use 2 methods besides the copy database wizard
          • Use the generate a script function on the login
            • This allows you to recreate an object in the DB.
            • Allows you transfer windows authenticated SQL Logins between instances
            • for SQL-authenticated logins, the account are disabled and will need to have their passwords reset.
          • Use the microsoft script.
  • Exporting and importing data
    • Large scale data import and export operations are called bulk imports
    • SQL server has several tools
      • The BCP command line utility
        • Allows you bulk export data from a DB to a file or import from a file into tables in a DB.
        • Cannot transfer from DB to DB.
        • The result is unordered as SQL server uses parallel scans unless queryout and orderby are selected.
        • Befroe importing data, ensure the destination table has columns configured that are compatible with the data type you want to import.
        • Y
        • Export done using

bcp “@QueryWithOrderByClause” queryout @DestinationFile -T

          • -T signifies you want to useuse trusted connections.
        • Import done using

bcp @Database.@Schema.@TableName in @FileContainingData -T

        • You can use a format file with bcp by specifying the -f option
      • Bulk insert statement
        • Used to load data from a file.
        • Allow you use a format file
        • Done using

bulk insert @Database.@Schema.@TableName from ‘@Directory\@File’

        • Allow you connect to an OLE DB data source to retrieve data.
        • used in a insert into select from OPENROWSET(Bulk…) clause.
          • … are the parameters
        • Most suitable for one-off importations
        • Allow you use a format file
      • Integration services import and export wizard
        • To migrate data from SQL server and OLE DB providers such as Microsoft access and MS excel
        • Creates an integration services package and allows you perform complex data transformation.
        • Allows you schedule it to run automatically using SQL server data tools.
      • Select into.
        • The into clause allows you to create a new table by using the results of a select statement.
        • The columns of the new table inherit the characteristics of the output columns.
        • Indexes, triggers and constraints aren’t transferred and must be manually created @ the target.
    • Format files are used to map data file fields to the columns of a table.
      • Can be used with bcp, BULK INSERT or OPENROWSET(BULK…) when importing data from a file that uses fixed-with or character terminated fields.
      • Format files can be XML or non-XML.
      • Must be used if you are performing bulk import from a data file with fixed width or fixed-length fields.
      • bcp is used to create a format file

bcp @TableName format nul -c -x -f @Filename.xml-t, T

    • Preparing data for bulk operations
      • To improve performance
        • Make the database logging model not full recovery for the duration of the import.
        • Perform the bulk import operations in parallel
          • If the destination table has no indexes, specify the TABLOCK option.
            • This causes the DB to perform bulk optimization and prevents blocking issues.
        • Disabling triggers and constraints
        • If the destination table has a clustered index, import data from the data file using the same ordering as in the table
SQL Server Logins, Roles and Users
  • Logins are credentials that authenticate connections to an instance.
  • SQL Logins
    • A database user must map to an existing SQL Server login except the instance supports contained databases.
    • There are 4 types
      • Windows-authenticated login
        • Instance logins in which the OS handles authentication.
        • The SQL login is mapped to a OS user account
        • Done Using

Create Login “@DomainOrHostName\@UserName” from windows;

      • SQL server-authenticated login
        • Logins are authenticated by the database engine.
        • Done using

Create Login @UserName with password = ‘@Password’;

        • Can be configure to use password expiration and complexity policies of the host OS using 2 create login option
          • Check_Expiration: use the password expiration of the host OS. can be set to on or off
          • Check_policy: use the password complexity policy of the host OS. can be set to on or off
      • Certificate
        • You can create a certificate to be associated with a login if a database master key is present.
          • To create a master key

create master key encryption by password = ‘@password’

        • To do this
          • Create a certificate and configure it’s expiry

Create certificate @CertificateName

with Subject=’@DetailsOfCertificate_E.G._TheUserOfTheCertificate’

          • Create the login from the certificate

Create login @UserName from Certificate @CertificateName

      • Asymmetric Key
        • Asymmetric keys contain both a public key and a private key.
        • If there is no database master key, you must provide a password when creating an asymmetric key.
        • To do this
          • Create the asymmetric key

create asymmetric key @KeyName with algorithm=RSA_2048;

          • Create the Login from the asymmetric key

create login @Username from asymmetric Key @KeyName

    • Altering existing logins
      • Done using

Alter login @UserName @Option

        • @Option can be disable, unlock and Must_change
    • Viewing logins
      • Query sys.server_principals
      • For sql server authenticated logins, query sys.sql_logins
    • Removing logins
      • Done using

Drop login @UserName.

      • The login musn’t be connected, own a server-level object,SQL server agent or a securable.
      • best practice is to disable the login.
    • Denying server access
      • used to prevent some windows security principals from connecting to the database engine
      • done using

Deny Connect SQL “@Domain\@User”

  • Server Roles
    • They allow you simplify the assignment of permissions at the database instance level.
    • Permissions are assigned to server roles rather than individual logins
    • TO grant a login permissions, you add the login to the role
    • There are 9 fixed server-level roles
      • Sysadmin: Members can perform all activities possible on the database engine instance
      • Serveradmin: Members can perform instance-wide configuration tasks.
      • Securityadmin: Members can carry out instance-level permissions management.
      • Processadminn: Members can terminate process running on an instance
      • Setupadmin: members can add and remover linked servers
      • Bulkadmin: members can use bulk insert on databases on the instance.
      • diskadmin: members can manage instance-related files.
      • dbcreator: Members can create, alter, drop and restore databases on the instance
      • pubic: All logins are members of this role. You can add permissions to this role.
    • You use Alter server role to modify the membership
      • use the add member option

Alter server role @RoleName add member “@Domain\@UserName”

    • To manage server roles
      • sp_helpsrvrolemember
        • Provides fixed server role membership
      • sp_srvrolepermission
        • provides fixed server role permissions
      • is_srvrolemember
        • Enables you check whether a login is a member of a specific fixed or user-defined server role
      • sys.server_role_members
        • Provides information about role members
    • User defined server roles
      • Used to create custom server roles.
      • To create User defined server roles,
        • Create the server role

Create server role @ServerRoleName [Authorization = “@domain/Username”]

          • You can use the authorization to set who owns the server role.
        • Assign permissions to the server role

grant @Permission to @ServerRoleName

          • @Permission is the permission to grant the role e.g. alter any database
        • Add users to the server role
  • Credentials
    • They store authentication information that facilitates a connection to a resource external to the DB engine instance e.g. an FTP server.
    • You can map a single credential to multiple SQL logins.
    • You can map a login to a single credential only.
    • Created using

create credential @CredentialName with identity = ‘@UserName’, secret= ‘@Password’

  • Managing Users and DB roles
    • Database users
      • Database users represent the identity of SQL server logins when connected to a DB.
      • Each DB user maps to a login except for contained databases.
      • You can map a login to one user per DB but different users in different DB
      • Add a database user using

Use @DBName

create user “[@Domain\]@UserName” for Login “[@domain\]@Username”

      • Drop a database user using

drop user “@domain\@Username”

        • Dropping a user doesn’t remove the SQL login
        • Deleting an SQL Login doesn’t delete the database users associated with the SQL login
        • Orphaned users are database users whose corresponding SQL logins has been dropped or the database been moved to a different instance
          • To get orphaned users ,use sp_change_users_login @Action=’report’
    • Database roles
      • They allow you assign permissions to DB users on a collective basis. There are 9 fixed DB roles
      • Fixed DB roles: Have specific permissions that cannot be altered
        • db_owner: Assigned to principals who need to perform all DB config and management tasks. Allows members drop DBs
        • db_securityadmin: Members manage membership of roles.
        • db_acccessadmin: Members manage database access for logins
        • db_backupoperator: Members can back up the DB
        • db_ddladmin: Members can run any DDL command in the DB
        • db_datawriter: Members can insert, delete or modify data in a DB user table
        • db_datareader: members can read all data from all user tables in a DB
        • db_denydatawriter: Blocks members from inserting, deleting or modifying data in a DB user table
        • db_denydatareader: Blocks members from reading data stored in a DB’s user table
      • Flexible DB roles
        • This have custom permissions
        • Created using

create role @RoleName [Authorization=[@Domain\]@Username]

          • Authorization is used to set ownership of the role
        • Assign permissions to a role using

grant @Permission to @role

      • To add a DB user to a role,

Use @DBName

Exec sp_addrolemember ‘@Role’, “[@Domain\]@Username”

        • You cannot use alter role @DBroleName add member as used for instance level users(Alter server role @InstanceRoleName add member)
      • msdb roles
        • msdb has a set of special DB roles in addition to the 9
          • They allow you assign permissions to IS, data collector, server groups, DB mirroring, Server agent and policy-based management
      • User and role catalog views
        • Used to determine information about users, roles and permissions at the DB level
          • sys.database_principals: provides information about DB principals
          • sys_database_role_members: provides information about roles and the security principals who are members of those roles
          • sys_database_permissions: provides information about permissions assigned to security principals at the DB level.
    • Contained Users
      • Contained DBs allow you move a DB to a different instance/ cloud and deploy without additional config
      • A contained user is a DB user who does not have a SQL server login.
        • They connect by specifying credentials in the connection string
      • To create contained users
        • Enable contained DB authentication on the instance

sp_configure ‘show advanced options’,1;

reconfigure with override;
sp_configure ‘contained database authentication’,1;
reconfigure with override;

        • Create the contained users

create user @Contained_user with password = ‘@Password’

          • Partially contained DBs can have windows authenticated users

create user [@Domain\]@Username

    • Least privilege
      • The principle of least privilege is a principle of information security that dictates granting a security principal only privileges necessary to perform required tasks.
    • Application roles
      • This enable you grant permissions to a specific application.
      • Application access data using permissions assigned the application role instead of using a DB user’s permissions.
        • Happens in this way
          • User interacts with a client application that connects to an instance with the user’s credentials
          • The application executes sp_setapprole, authenticating by using a password configured for the application
          • When authenticated, the application role is enabled, and the connection uses the permissions assigned to the application role.
        • Can be created using

Create application role @ApplicationRoleName with password=’@Password

        • can alter the name,password or default schema using alter application role
        • Can be dropped using drop application role
Securing SQL server
  • Database permissions
    • This enable you control which actions can and cannot be performed by security principals at the database level.
    • It involves understanding permissions and how they can be managed
    • Securables
      • An item for which you can assign permissions e.g. users, schemas,roles, objects
      • Permission keywords
        • Alter: provides the ability to modify the properties of a securable.
        • Backup/dump: permission to back up or dump the securable
        • control: all permissions on the securable; can grant permissions on the securable; grants control permissions on all securables within the scope
        • create: Permission to create the securable
        • delete: Permission to delete the securable
        • execute: permission to execute the securable
        • impersonate: permission to impersonate the securable principal.
        • insert: permission to insert data into the securable
        • receive: permission to receive messages from service broker
        • reference: permission to reference the securable
        • restore/load: permission to restore the securable
        • select: Permission to view data in the securable
        • take ownership: permission to take ownership of the securable
        • update: Permission to modify the data stored in the securable
        • view definition: permission to view the securable definition
        • connect: permission to connect to the securable
      • Assigning permissions on objects
        • Use the following to manage permissions on objects
          • Grant: to allow the permission to be used

grant @PermissionKeyword on @DBObject to @DBUser

          • Deny: to block the permission from being used

Deny @PermissionKeyword on @DBObject to @DBUser

          • Revoke: to remove an existing grant or deny permission

revoke @permissionKeyword on @DBObject to @DBUser

    • Managing permissions by using database roles
      • Assigning fixed DB roles
        • The permissions on the fixed DB roles cannot be changed
        • among the fixed DB roles, you can only add the public role to another database role.
          • Other Fixed DB roles do not support this feature

sp_addrolemember @RoleWithDesiredPermission, @RoleToBeEnhanced

alter role @RoleWithDesiredPermission add member @RoleToBeEnhanced

    • Using Flexible DB roles
      • This allows nesting of roles.
    • Protecting objects from modification
      • This involves configuring permissions so only authorized principals can make modifications.
      • To carry it out on the DB level, add security principals to the db_denydatawriter role.
        • This applies to the whole DB
      • To protect specific database objects, use deny with alter, control, delete, insert, take ownership and update permissions

Deny alter on @DBObject to @SecurityPrincipal

Deny control on @DBObject to @SecurityPrincipal

      • Placing objects in read-only filegroups doesn’t stop all forms of modification
    • Using Schemas
      • Schemas simplify application of permissions by enabling you collect securables into containers.
      • Objects belong to schemas and schemas belong to security principals.
      • You can create a flexible role, create a schema owned by the role then add principals to the role to grant them permissions over objects within the schema.
        • You can alter the schema of securables in the same databases

Alter schema @DestinationSchema transfer @SourceSchema.@Object;

    • Determining effective permissions
      • Use HAS_PERMS_BY_NAME() to determine a specific principal’s effective permission on a certain securable.
        • An effective permission can be one of the following
          • Permission granted directly to the security principal
          • permission is granted to a role of which the security principal is a member
          • Permission is implied by a higher-level permission held by the security principal.
          • Permission is not denied to the principal directly or indirectly through role membership.
        • HAS_PERMS_BY_NAME() returns 1 if the principal has an effective permission and 0 if it doesn’t. It returns Null if neither the permission nor securable class is valid.
        • It runs in the context of the current principal.
  • Troubleshooting SQL server security
    • This involves ensuring that the security settings configured for an instance or database match the security requirements for that instance or database.
    • Troubleshooting authentication
      • query sys.server_principals to determine if a login is disabled
      • use alter login to configure login permissions
        • use must_change parameter to force a password change upon next logon
        • use unlock parameter to unlock a user’s account
      • Authentication modes
        • Windows authentication mode disables SQL login. When selected during install, the sa account is disabled
        • Mixed authentication mode allows both windows and SQL login. This requires a strong password be assigned the sa account
    • Client connection problems
      • The browser service helps clients connect to instances that are not using fixed ports.
      • If users cannot connect to named instances, this is the most likely cause
    • Troubleshooting certificates and keys
      • A common problem is expired certificates.
      • Ensure keys and certificates are regularly backed up because encrypted data is unrecoverable if keys and certificates become corrupt.
      • To see information about them, query
        • sys.asymmetric_keys:
        • sys.certificates:
        • sys.key_encryptions: provides info about each symmetric encryption specified when using create symmetric key with the encryption by parameter
        • sys.symmetric_keys:
    • Troubleshooting endpoints.
      • You need to understand the precise configuration of the endpoint that is causing the issue. To do this, query
        • sys.database_mirroring_endpoints
        • sys.endpoints
        • sys.http_endpoints
        • sys.service_broker_endpoints
        • sys.tcp_endpoints
      • When troubleshooting security endpoints, you might need to alter the authentication or encryption settings.
      • An endpoint can be in a stopped state.
      • You can configure an existing endpoint using alter endpoint
        • To start a stopped endpoint, use state parameter
        • To change the endpoint authentication, use authentication parameter
        • To enable, disable or require encryption, use the encryption parameter
      • use the grant statement to grant permissions on an endpoint to a login.
      • When configuring permission on endpoints, ensure the remote instance can authenticate the local instance using windows authenticated logins or certificates.
    • Security catalog views
      • They are used to view the properties of server and database security. They include
        • sys.server_permissions
        • sys.server_principals
        • sys.server_role_members
        • sys.sql_logins
        • sys.system_components_surface_area_configuration
      • To view security config at the individual DB level, use the following
        • sys.database_permissions
        • sys.database_principals
        • sys.database_role_members
        • sys.master_key_passwords
  • Auditing instances
    • auditing enables you track activity on a SQL server instance.
    • SQL Server audit
      • It enables you to configure sophisticated auditing at the instance and database levels
      • It forwards audit results to a target/Audit destination. the target can be
        • A flat file
        • Application log
        • security log
      • To create a server audit

Create server audit @AuditName

To @Target
( Queue_delay=@DelayDuration, On_Failure=@AuditFailureAction)

        • @Target can be the directory to the file, application_log or security_log
        • @DelayDuration is the delay in ms before audit actions myst be processed.
        • @AuditFailureAction determines the action to take when there is an audit log failure. If set to shutdown, the instance shuts down and might require the -f service parameter to startup again.
      • SQL server audit action groups and actions
        • This are used to determine the kind of actions Server audit records.
        • you can audit the use of specific actions on DB schema and schema objects with DB audit specifications
      • To create a server audit specification

create server audit specification @AuditSpecificationName

For server audit @AuditName
Add (@ServerAuditActionGroup)

      • To alter a server audit specification

alter server audit specification @AuditSpecificationName


        • To turn on the audit specification

alter database audit specification with (State = on)

      • To create a DB audit specifciations

create database audit specification @AuditSpecificationName

For server audit @AuditName
ADD ( @Action on Object::@Schema.@Object by @SecurityPrincipal)

        • To turn on the audit,

alter database audit specification with (State = on)

        • To create, alter or drop a DB audit specification, a security principal must have permission to connect to the db and must be assigned either alter any DB audit permission or alter or control permission on the DB
      • Viewing audit views and functions
        • sys.dm_audit_actions
        • sys.server_audits
        • sys.dm_server_audit_status
        • sys.server_audit_specifications
        • sys.server_audit_specifications_details
        • sys.database_audit_specifications
        • sys.database_audit_specifications_details
        • fn_get_audit_file()
    • C2 Audit Mode
      • This configures the instance to record both successful and failed attempts to access statements and objects as per the C2 security standard
      • Enabled using

sp_configure”c2 audit mode”,1;


      • c2 audit data is store in the instance data directory.
    • Common criteria compliance
      • Common criteria compliance supersedes c2 audit mode. This information is viewed by querying sys.dm_exec_sessions.
      • To enable

sp_configure “common criteria compliance enabled”, 1;


    • Policy based management
      • This provides you with a method of managing more than one instance across multiple servers.
      • Can be used to manage, instances, databases, and other objects like backup devices
      • Objects to configure policy-based management
        • Facet: A collection of properties that can be configured
        • Condition: The settings that can be configured for the property
        • Policy: A condition that should be enforced.
        • Category: A collection of collections that should be enforced together.
        • Target: Defines the scope of policy application. can include instances, databases or DB objects
      • Policies can be exported and imported between instances.
      • You can create a central management server for applying policies on one instance to all instances.
      • To view information about policy based management
        • syspolict_conditions
        • syspolicy_policies
        • syspolicy_policy_execution_history
        • syspolicy_policy_execution_history_details
        • syspolicy_policy_group_subscriptions
        • syspolicy_policy_categories
        • syspolicy_system_health_plan
Mirroring and replication
  • Mirroring is the process of creating and maintaining an always up-to-date copy of a database on another SQL-Server instance. If a witness instance is configured, automatic fail over is possible if the principal instance becomes unavailable
  • DB mirroring
    • A strategy for increasing the availability of a database.
    • DB mirrors are paired copies of a single DB hosted on separate instances of SQL server.
    • All operations performed against the principal DB is redone on the mirror through a stream of active transaction logs the engine applies in sequence immediately after it receives them.
    • Mirroring can be configured in 2 modes
      • High-safety mode: Enables failover to occur with data loss from committed transactions. Transactions are committed on both partners after they are synchronized.
      • High-performance mode: data loss is possible. The primary instance doesn’t wait for the mirror instance to confirm it has received the log record
    • Mirroring can only occur for user databases
    • Only databases with full-recovery model can be mirrored
    • Mirroring supports only a single principal instance and a single mirror instance.
  • Mirroring pre-requisites
    • All instances must run the same version and edition, except the witness which can run any edition.
    • Mirroring restrictions
      • Mirrored databases cannot be renamed
      • Mirroring cannot work for a DB with Filestream filegroups.
      • Filestream filegroups cannot be created on the principal server.
      • Cross database and distributed transactions are not supported for mirrored databases
    • Before mirroring can occur, you must restore a database using the Norecovery option on the mirror instance.
      • Apply transaction logs with Norecovery option before mirroring is enabled.
    • Firewall rules
      • traffic must be permitted through the endpoint using port permissions for the port configured
    • Configuring mirroring with Windows authentication
      • This is possible if the service accounts on all servers are in the same active directory domain.
        • Otherwise, a login in the master database for the accounts used by the other must be created on all sides.
    • Configuring mirroring with certificate authentication(This sets the database to mirror in high-safety mode)
      • Used when the the service accounts aren’t on the same domain.
      • Create certificates and the endpoint
        • Create the certificate, back them up and create a user on the partner instance
          • Create the certificate

create certificate @CertificateName with subject=@CertificateDescription

          • Backup the certificate on both instances

backup certificate @certificateName to file=’@directory\@File.cer’;

          • Create a user on the partner instance after moving the backed up certificate to @directory on the partner

Create login @SQL_Login with password=@Password;

create user @UserName for login @SQL_Login;
create certificate @PartnerCerficateName
Authorization @UserName
From file = ‘@directory\@FileName.cer’

        • Create the endpoint on both instances

create endpoint @EndpointName

as tcp( listener_port= @PortNumber, listener_ip=all)
for database_mirroring(authentication= certificate @CertificateName, encryption = required algorithm AES, role=all);

          • Grant connection permissions to the endpoint on partner instances

grant connect on endpoint::@endpointName to @partnerSQL_Login

      • Configure mirroring on the database on partner instances

Alter database @DBName

Set partner = ‘TCP://@partnerName.Domain.com:@PortNumber

    • Changing Operating Modes
      • You can change mirroring operating mode on the mirroring page of the database’s properties. You can change to high safety with automatic failover mode only if a witness server is present.
      • Done using Alter database statement with the safety option
        • Setting safety to full configures the database for high-safety mode. This is the default mode for DB mirroring
        • Setting the safety option to off configures the DB for high-performance mode.
      • You can add a witness to a mirroring pair by creating the endpoint on the witness, creating logins for the principal and mirror instance service accounts and granting connect privileges on the endpoint to that login.
        • Also, create a login on the principal and mirror instances for the domain account used for the SQL Server service on the witness.
        • To specify the witness

Alter database @DBName

Set witness = ‘TCP://@WitnessServerName:@PortNumber

    • Role switching and failover
      • There are 3 types of role switching
        • Manual failover
          • Available if the mirroring session is configured in high-safety mode.
          • Can be initiated when the mirroring session is synchronized.
          • Initiated on the principal instance using the command

alter database @DBName set partner failover

        • Automatic failover
          • Possible is a witness is present and the mirroring session is configured for high-safety mode.
          • Requires the mirroring session is in a synchronized state
          • If all instances lose communication, automatic failover will not occur even if the witness and the mirror establish communication.
        • Forced service( with possible data loss)
          • Can be used in high-safety mode when no witness is present or in high-performance mode. The mirror is forced to become the principal when the original principal is not available.
          • Can result in data loss.
          • Initiated on the mirroring instance using the command

alter database @DBNames set partner force_service_allow_data_loss

    • Monitoring mirrored DBs
      • Query sys.database_mirroring on either principal or mirroring instance.
      • Use the Database mirroring monitor.
        • Allows you monitor how data is being transmitted between the primary and mirrored instances in a mirroring session and other details of the mirroring session
      • use SPs
        • sp_dbmmonitorresults
        • sp_dbmmonitorchangemonitoring
        • sp_dbmmonitordropmonitoring
        • sp_dbmmonitoraddmonitoring
    • Upgrading mirrored DBs
      • Requires the following
        • Change the operating mode to high-safety without automatic failover
        • Perform a full backup of the principal server
        • Run DBCC CheckDB on the principal server
        • Upgrade the mirror
        • Manually failover to the upgraded mirror
        • Upgrade the Witness
        • resume mirroring the DB
        • Switch back to high-safety without automatic failover
        • Upgrade the new instance to SQL server 2012.
Database replication
  • Replication enables SQL server distribute and synchronize data and DB objects from one DB to another
  • Replication components
    • Publisher: An instance that makes data available through publication
    • Article: A published object
    • Publication: A collection of articles
    • Distributor: An instance that manages the transmission from publisher to subscriber.
    • Subscriber: An instance that receives publications
    • Agent: A service that enables the publisher, distributor or subscriber perform replication-related tasks.
  • Replication Types
    • Snapshot replication:
      • Allows complete refreshes of data rather than incremental updates
      • Publisher synchronizes with subscriber by creating and transmitting a DB snapshot.
        • The snapshot are generated using bcp
        • They are stored in a folder that the subscriber need access to for pull subscriptions instead of push subscriptions.
      • Suitable for DBs with infrequent use
    • Transactional replication:
      • Allows subscriber to stay up to date with changes on the publisher
      • Changes to the publisher propagate to the subscriber as they occur.
      • Replication occur from the publisher to the subscriber.
      • Update occur at the publisher
      • Allows Oracle database publish to MSSQL
      • Changes must be applied to the subscriber in the order they occurred in the publisher
    • Peer to peer replication:
      • Enables peer nodes to read and write changes and have those changes propagate to other nodes in the replication topology
      • You prevent conflicts by partitioning the data so the same row will not be updated at separate locations.
        • if the data isn’t partitioned properly and a conflict occurs, the replication fails or the changes will not replicate.
      • Requirements
        • All instances must run SQL-Server enterprise edition.
        • Row and column filtering aren’t allowed
        • Publication names must be identical on all instances participating in the replication
        • Each participant should use its own distribution DB to ensure there is no single point of failure
        • Tables in multiple peer to peer publications cannot be included in the same publication DB
        • peer to peer subscriptions cannot be reinitialized. You must restore a backup at a node to force data update
        • If you add new nodes and have to perform a restore, you must only use backups created with the new nodes in the topology
    • Merge replication:
      • For scenarios where data conflict is possible and any node may issue data changes that will eventually consistent across all nodes.
      • Allows databases hosted on instances at separate locations to be updated and those changes replicated to all other DBs in the replication topology.
      • Publisher and subscribers track changes made using triggers.
      • synchronization involves exchanging all rows that have changed between the publisher and subscriber since the last synchronization.
      • Requirements
        • Subscriber can make offline changes to data and synchronize those changes back when they connect again.
        • You can detect and resolve update conflicts
        • You must support non-SQL server nodes participating in the replication
        • Applications do not require tables to be transactionally consistent
      • It uses SQL server snapshot agent and the merge agent for conflict resolution.
        • If the merge agent detects a conflict, e.g a row being updated with different values at different locations, the conflict resolver determines which data is accepted.
        • When configuring a subscription in merge replication, you specify a subscription type and a conflict resolution priority.
          • Servers assigned a higher priority override servers assigned a lower priority.
          • When you use client subscription type, the priority for conflict resolution is first to publisher wins,
          • When a conflict occurs between a publisher and subscriber, the publisher change is kept and the subscriber change is dropped.
  • Replication monitor
    • Can be used to monitor replication performance.
    • Can be used to configure alerts based on performance benchmarks.
  • Size limitations for Replication
    • 256 merge publication articles
    • 32767 Snapshot or transactional publication articles
    • 246 Merge publication columns in a table
    • 1000 snapshot or transactional publication columns in a table
    • 1024 merge replication bytes for a column used in a row filter
    • 8000 snapshot or transactional publication bytes for a column used in a row filter
  • Controlling replication of constraints, columns and triggers
    • The not for replication option enables you configure Foreign key constraints, check constraints, identity column values and triggers behave differently when an insert, update or delete operation is performed by a replication agent than they behave in normal operation.
      • They will not be enforced for the operation with that option.
  • Heterogeneous Data.
    • You can integrate to Oracle and DB2
      • Oracle can publish to DBs on SQL server
        • The data can be published on the SQL server to serve other DBs on Oracle, DB2 or SQL Server.
      • SQL can publish or distribute data for replication on oracle and DB2
    • This doesn’t support Merge or peer to peer replication.
Clustering and Always On
  • Failover clustering is an approach to high availability that ensures a database remains available in the event of a server failure.
    • A failover cluster instance is a deployment of SQL server that stores DB files on a shared storage device. If the server that hosts an instance fails, another Instance in the cluster takes control of the files and seamlessly continues to provide service.
      • The DB files must be on a SAN
      • Connect the server to the SAN using the iSCSI initiator
    • Failover clustering requires the windows failover cluster feature.
    • Installing a SQL failover cluster
      • During installation of SQL server, select advanced then advanced cluster preparation
      • Conclude installation of the instance
    • Adding a node to the cluster
      • During installation of SQL server, select advanced then advanced cluster completion
    • To perform a manual failover,
      • using powershell

move-clustergroup @ClusterResource @DestinationServer

    • Troubleshooting failover clusters
      • If failover occurs and a node is irrecoverable,
        • evict the node from the failover cluster
        • Replace the hardware and add the node back to the cluster
        • run SQL setup to admit the node to the failover cluster instance.
  • AlwaysOn is a technology that replaces DB mirroring. It allows clients read-only access to the secondary replica
    • An availability group is a collection of user DBs(availability DBs) that can fail over together.
    • Supports a set of read-write primary databases and up to four sets of secondary databases.
    • They allow you configure one or more sets of secondary databases so that they are accessible for read-only operations
    • Failover occurs on a per-replica basis, and all DBs in the replica fail over.
    • Availability group prerequisites.
      • All hosts must be nodes in a failover cluster.
    • Availability modes
      • The availability mode selected depends on data loss and transaction latency requirements.
      • There are the following availability modes
        • Asynchronous-commit : Suitable when you must place availability replicas at different location.
          • The primary will not wait for the secondaries to write log records to disk. This reduces transaction lateny.
        • Synchronous-commit : increases transaction latency but minimizes the chance of data loss in the event of an automatic failover
          • Each transaction is applied to the secondary replica before being written to the local log file.
      • You can alter availability mode using the alter availability group statement

alter availability group @AvailabilityGroupName modify replica on ‘@Host\InstanceName’ with (availability_mode=@availabilityMode)

    • Selecting failover modes
      • Availability groups fail over at the availability replica level.
      • Failover involves another instance becoming the primary replica, with the original primary replica being demoted to become a secondary replica. There are 3 types
        • Automatic failover:
          • Occurs without administrator intervention.
          • Involves no data loss
          • requires the primary and a secondary to be configured with a failover mode of automatic and be synchronized.
          • Can occur only if the primary and replica are in synchronous-commit mode.
        • Planned Manual failover
          • Triggered by an administrator
          • Involves no data loss
          • Can occur if at least one of the secondary replicas is in a synchronized state.
          • Require primary and replica instances to be in synchronous-commit mode.
          • To carry out manual failover,
            • TSQL(run the following on the proposed primary instance)

alter availability group @AvailabilityGroup failover;

            • Powershell(run on the proposed primary instance

switch-sqlavailabilitygroup -path SQLSERVER:\SQL\@HostName\@FailoverClusterName\AvailabilityGroups\@AvailabilityGroupName

        • Forced manual failover
          • This form of failover involves the possibility of data loss. Use forced manual failover when no secondary replica is in the synchronized state or when the primary replica is unavailable.
          • Can work if either node is set to asynchronous-commit mode.
          • Done using
            • TSQL(run the following on the proposed primary instance)

alter availability group @AvailabilityGroup force_failover_allow_data_loss;

            • Powershell(run on the proposed primary instance

switch-sqlavailabilitygroup -path SQLSERVER:\SQL\@HostName\@FailoverClusterName\AvailabilityGroups\@AvailabilityGroupName -AllowDataloss

    • Configuring readable secondary replicas
      • Readable secondary replicas can service read-only requests for DB access
    • Deploying Alwayson availability groups
      • Requires the following tasks
        • create a mirroring endpoint:
          • Not necessary if using a domain account for all SQL server service, as it is automatically created
        • enable alwaysOn on the instance
          • From configuration manager. Enable AlwaysOn on the instance.
          • Powershell:

enable-sqlalwayson -path SQLSERVER:SQL\@Hostname\@Instance

        • create an availability group
          • From SSMS
            • Use the always on availability node.
            • To add an encrypted DB or a DB with an encryption key, use add a secondary replica.
          • TSQL

create availability group @AvailabilityGroupName

for database @DbName
replica on ‘@ReplicaHostName1\@InstanceName’ with
( Endpoint_URL= ‘TCP://@ReplicaHostName1[.@Domain.com]:@Port’,
availability_mode= @AvailabilityMode, Failover_mode= @FailoverMode),

‘@ReplicaHostName2\@InstanceName’ with

( Endpoint_URL= ‘TCP://@ReplicaHostName2[.@Domain.com]:@Port,’
availability_mode= @AvailabilityMode, Failover_mode= @FailoverMode);
        • create an availability group listener
          • An availability group listener is a network connectivity endpoint for an availability group.
          • Clients connect to the listener, which connects them to the availability group’s primary instance.
          • To create,

alter availability group [@AvailabilityGroupName] add listener ‘@ListenerName’ (with IP((‘@ListenerIP’, ‘ListenerSubnet’)), port=@portNumber);

        • add a secondary replica
          • Requires some conditions
            • fewer than four secondary replicas
            • primary replica is online
            • Secondary replica’s instance can connect to the primary replica mirroring endpoint
            • Secondary replica instance has enabled alwaysOn
          • To add

alter availability group @AvailabilityGroupName join;

Troubleshooting SQL server
  • Performance monitor
    • A windows utility for capturing statistics about the hardware, OS and apps to expose properties and counters.
    • Collects and displays real-time performance data in the form of counters for server resources e.g. processor and memory use.
    • Data collector sets provide the ability to group data collectors into reusable elements for different performance-monitoring scenarios.
      • They organize data collection points into a single component that can be used to review or log performance.
      • Can be recorded individually, grouped with other into logs, and used to generate alerts.
      • They can contain
        • Performance counters
        • Event trace data
        • System configuration information(registry key values)
    • Counters are organized into a 4 level hierarchy
      • An object is a resource that can be monitored.
        • a component, application or subsystem within an application.
      • an object exposes one or more counters
      • a counter might have several instances if more than one resource of that type exists e.g. multiple processors
    • When you define counters to capture, you can specify criteria at any level within the hierarchy.
    • Counters are used in conjunction with other counters and environmental info to diagnose problems
    • 3 Counters indicate a system problem on their own. Their values determine how many requests are waiting for the resource to become available. When there values are greater than 0 for a long period, it indicates a hardware bottleneck
      • System processor Queue Length
        • Indicates when the processor is overwhelmed with activity
      • Network Interface output queue length
        • Indicates when the network interface is overwhelmed with activity
      • Physical disk avg. disk queue length
        • Indicates when the disk is overwhelmed with activity
  • SQL server profiler
    • Most performance issues can be improved by tuning poor performing queries. SQL server profiler helps identify where to focus query tuning efforts.
    • SQL server profiler and extended events profiler provide the ability to trace the activity that is occurring in SQLEngine.
      • Server profiler can also trace AS activity.
      • Traces captured by both can be used to diagnose performance issues and replay workloads.
      • Profiler can be configured to filter events but they have a performance cost.
        • However, they minimize the overhead that is incurred during tracing
    • Captured events are displayed graphically and can be captured to a file or DB table.
      • creating trace files is an efficient way to store traces using minimal system resources.
    • A category is a group of related event classes. An event class contains all the data columns that can be reported by an event.
      • An event is the occurrence of an action within an instance of the SQLEngine and is defined by its attributes.
    • Trace columns represent data that can be captured when an event occurs. Data columns contain the attributes of events
    • Runs on a machine and communicates to the SQLEngine using SQL Trace procedures.
    • SQL Trace is a feature that runs in SQLEngine and runs traces using sys.SPs.
      • It is lightweight
      • Trace must be defined using system stored procedure calls
      • runs inside the SQLEngine
      • Do not automatically restart if the instance is restarted. To auto start it, script it and put it in a startup procedure.
      • SQL has a default trace that starts with the instance.
        • This includes SQL server events
        • objects being created or dropped.
      • Traces can be opened using profiler and imported into a table.
  • Monitoring SQL server
    • Activity monitor is a graphical representation of monitoring information that uses DMVs and DMFs
      • It provides an overview of the system performance and information about processes, waits, I/O resources and expensive queries.
    • Dynamic management views(DMVs) and dynamic management functions(DMFs) provide insight into current and historical SQL server activity.
      • They return server state information that can be used to monitor the health of an instance, diagnose performance and tune performance.
      • There are 2 types of DMVs and DMFs.
        • Server-scoped DMVs and DMFs
          • They require View server state permission on the server
        • Database-scoped DMVs and DMFs
          • they require view Database state permission on the database
      • common DMO categories
        • sys.dm_exec_*: info about connections, sessions, requests and query execution
        • sys.dm_os_*: info about OS related stuff
        • sys.dm_tran_*: access to transaction management
        • sys.dm_io_*: provide information on I/O processes
        • sys.db_db_*: provide database scoped information.
    • Querying DMVs
      • DMV/Fs cannot be referenced using one part names.
      • They exist in the sys schema and are named in the convention dm_*
      • They expose in-memory structures, which are cleared if the instance is restarted.
    • There are 2 types of DMOs
      • Objects that provide real-time state information
      • Objects that provide recent historical information
  • Data collector
    • Provides a high-level view into aggregated trends over time and information to identify issues.
    • Allows you capture historical data and access detailed data when investigating issues.
    • Provides a central point for data collection across DB servers and applications.
    • Stores collected data in a relational database known as the management data warehouse(MDW).
      • Data can be collected constantly or on a schedule
    • When you configure the MDW, some system data collection sets are created.
    • Data collection sets
      • This define the data that needs to be collected, how often they are uploaded to the warehouse and how long they are retained in the warehouse.
      • They can query DMOs to retrieve detailed information about the operation of teh system
      • Can retrieve performance counters that provide metrics about the performance of both SQL server and the entire server
      • Can capture SQL trace events.
    • Data collector topology: The data collection system consists of 3 components
      • The data collector: a set of jobs that run on the local server and collect data from DMOs, performance counters and SQL trace events and upload to the MDW
      • The MDW which can consolidate data from multiple instances
      • 3 standard reports and several subreports to analyze data in the DMW. You and also write your own reports.
    • You create a MDW for
      • access to reports that combine information for all server instances in your enterprise
      • Offload the need to hold collected data and report on it from production servers
      • Persist data stored in DMOs
    • configuring the data collector: this involves 2 actions
      • Configure the MDW
      • configure the instances to upload the data
        • Lower volume data is sent immediately to the warehouse
        • Higher-volume information is cached locally and uploaded to the warehouse using SSIS
    • Monitoring data collector
      • configuration and log information for the data collector is written to msdb and is implemented by SPs and SSIS logging features.
    • Analyzing collected performance data
      • After performance data has been collected from a number of instances and been consolidated into a MDW, the data can be analyzed using reports built with SSRS.
      • There are 3 default reports
        • Server activity history
          • contains CPU, memory, disk and network I/O. SQL server waits and SQL server activity
        • Disk Usage History
          • Trends and details of disk and file usage
        • Query statistics history
          • Most expensive queries ranked by CPU, duration, reads and writes
  • Identifying Bottlenecks
    • Monitoring Disk Usage
      • There are 2 counters for disk activity on the OS level
        • PhysicalDisk:% Disk Time
        • PhysicalDisk:Avg. Disk Queue length
      • There are 2 counters for disk activity by SQL server
        • SQL server: Buffer manager: Page reads/sec
        • SQL server: Buffer manager: page writes/sec
        • If the values for these counters approach the capacity limit of the hardware I/O subsystem, tune you application or DB to reduce I/O Operations( better indexing, normalization).
    • Monitoring Memory usage
      • There are 2 counters for memory at the OS level
        • Memory: Available Bytes
        • Memory: Pages/Sec
          • this indicates the number of pages that were either retrieved from disk due to hard page faults or written to disk to free space in the working set due to page faults.
      • Counters for memory usage by SQL server
        • Process: working set
        • SQL Server: Buffer Manager: Buffer Cache Hit Ratio
        • SQL Server: Buffer Manager: Total pages
        • SQL Server: Memory Manager: Total Server Memory(KB)
    • Monitoring CPU Usage
      • CPU usage can be monitored using
        • Single Processor systems
          • processor: %Processor time counter
            • Measures the amount of time the CPU spends executing a thread that isn’t idle.
        • Multi processor systems
          • System: % Total processor time counter
            • This takes the average for all the processors. To monitor a single processor, use the instances of the counter
Indexes and concurrency
  • Implementing and maintaining indexes
    • A balanced tree structure consists of the start of the tree i.e. the root node.
      • Beneath the root node are 1…n intermediate level pages
      • each intermediate-level page can have additional intermediate-level pages.
      • the last level comprises the leaf-level pages.
      • Query optimizer knows at most how many page accesses are needed to reach a leaf page.
      • When a page is full, it is split to maintain this balanced condition.
    • A heap is an unorganized collection of data. It is appended to the last data page when presented for storage.
      • This makes it terrible for data retrieval but wonderful for data loading.
    • Clustered and non-clustered indexes can be partitioned to provide better performance.
    • Types of Indexes
      • Clustered
        • A structure in which the actual table data resides in the leaf-level pages of the index. The data is physically sorted in the order of the clustering key.
        • The clustered index key(CIK) is the row identifier for the table data.
          • It is added to any non-clustered indexes created on the table.
          • The key should have the following specifications.
            • Unique
            • Narrow
            • Static: Not updated.
      • Non-clustered
        • It contains only keys and no actual data.
        • In addition to the key columns, SQL server stores the clustering key for that row in the table.
          • This allows the storage engine to use a non-clustered index to reduce the number of rows returned and then use the clustering key to go to the clustered index to retrieve the data requested.
        • They can be applied to heaps or clustered table. The difference is in how rows are identified
          • In clustered tables, the non-clustered index contains the clustered index key to locate the rows in the base table
          • In heaps, The non clustered index uses a row identifier in place of the CIK.
      • Covering indexes
        • A non-clustering index that contains all the information needed to satisfy a query
        • Allow adding information to the non-clustered index data pages and avoid having to look up the row in the clustered index.
      • Filtered indexes
        • Basically an index with a where clause. This allows you limit the amount of data in the B-Tree.
      • Primary XML and secondary XML indexes
        • To provide reasonable performance when querying XML data.
        • If not created, the entire XML document must be loaded into an XML DOM and searched
      • Spatial Indexes
        • Improve performance when querying data of geography and geometry data types
      • Full-text indexes
        • Enable the efficient searching of text data, often using words that occur near each other in the text
      • Columnstore indexes
        • The index uses a column oriented storage that is very efficient for data warehouse operations.
        • Doesn’t use the B-tree structure.
    • Designing indexes for efficient retrieval
      • When constructing multicolumn indexes, list equality columns( columns instead in a where clause with =) before inequality columns. List the most selective columns for the above types first before the least selective columns.
      • When joining, index the column used to join to the other table.
      • Clustered indexes
        • A clustered index is the most efficient index for retrieval with
          • Queries that return a majority of the columns in the table
          • Queries that return a single row based on the CIK
          • Queries that return range-based data
      • Non-clustered indexes
        • A Non-clustered index is the most efficient index for retrieval with
          • Queries that return few rows
          • Queries that can be covered by the index.
        • They have a maximum limitation of 16 key columns and 900 bytes
      • Covering indexes
        • Consider using include columns for columns in the select list that aren’t already part of the key columns
      • Filtered indexes
        • A filtered index is the most efficient index for retrieval
          • When combined with sparse columns to locate specific non-null rows
          • When queries that are a small subset of the rows are selected often
      • Primary XML and secondary XML indexes
        • The first XML index must be a primary XML index type.
          • The primary XML index builds a nodes table.
          • For each element in the XML document, there will be one row in the nodes table. If you have many rows, each of which contain an XML document with many nodes, the size of the XML index can be quite large.
        • There are 3 secondary XML index types
          • Path
            • Consists of a regular non-clustered index.
            • supports operations such as .exist()
          • Property
            • Consists if a regular non-clustered index
            • Useful for searching for nodes that have particular values
          • Value
            • Consists of a regular non-clustered index
            • The reverse of the path index
            • most useful for searching for descendant nodes for a given path
      • Full-text indexes and semantic searches
        • Special indexes on char, text, image, xml and varbinary data types.
        • These are token based functional indexes that store information about significant words, where they exist within text within a column or table in the DB
        • Only one is allowed per table, and it contains all the tokens for all columns that were indexed for that table.
          • This allows you search for words in one column that have significant words in other columns in an efficient manner
        • Allows you carry out inflectional searching.
      • Columnstore indexes
        • They are a non B-tree type of index.
        • used to provide performance improvements for data warehouse type queries that perform aggregations over large data sets.
          • This queries mostly use a few of a table’s columns to satisfy queries. They use column-based index structures
        • They optimize storage by compressing repeating data values
    • Statistics
      • Statistics are objects in the database that contain statistical information about how the values in a column are distributed. Query optimizer uses this information to determine the cardinality of the rows.
        • cardinality refers to number of rows that exist for a given value.
      • When you create an index, SQL Server creates a histogram of the distribution of values in the leading column of the index.
      • Statistics can be either a sampling of the rows or all rows, based on how they were created or updated and the number of rows in the table.
    • Updating statistics
      • statistics are manually updated using update statistics

update statistics @TableName with @Option

        • @option can be
          • fullscan[[, index][,Columns]]
          • Sample [@NoOfRows Rows]|[@percent percent]
    • Index internals
      • There are DMVs that can be used to retrieve information about indexes. e.g. sys.dm_db_index_physical_stats
    • Creating Indexes

create @Indextype Index @Indexname

On @Table(@ColumnList);

    • Fill factors
      • An option when creating indexes
      • They determine how much free space will be left on the leaf data pages.
      • Default is 0(Leave no free space)
        • When considering change, determine the data access pattern of the table.
          • If the CIK is an identity column, this increases the amount of space used by the system.
          • If the CIK is a random value, it might limit page splits.
      • To specify an fill factor, rebuild the index

alter index all on @TableName @DefragmentationType

with (fillfactor=@PercentageToFillPageTo)

        • @PercentageToFillPageTo can be 80 if you want the pages to fill to 80%
        • @DefragmentationType is either rebuild or reorganize
    • Fragmentation
      • Fragmentation in index pages is when the logical ordering(based on the key value) doesn’t match the physical ordering within the data file.
      • Fragmentation information is available in DMV sys.dm_db_index_physical_stats
      • Removing index fragmentation.
        • When fragmentation is bad, rebuild or reorganize the index.
          • They are
          • Indexes should be rebuilt when fragmentation is above 30 %
          • Indexes should be reorganized when the fragmentation is between 10 and 20 %.
          • Reorganize
            • Reorganize doesn’t update the statistics histogram for the object.
            • Reorganize is online and doesn’t hold any blocking locks
            • Requires 8Kb free space in the DB
            • Single threaded
          • Rebuild
            • Rebuild is used when heavier fragmentation levels exist or when out of order extents are limiting read-ahead operations.
            • Rebuild requires long-term locks and can cause blocking unless performed online
            • Builds a second copy of the index and requires space to hold this copy.
            • This can use parallel operations
            • Requires Enterprise edition to perform online Rebuild
              • Online rebuild cannot be performed on indexes with XML and spatial data types.
    • Tracking missing indexes
      • Each time query optimizer compiles, SQLEngine tracks 500 latest indexes that the optimizer could have used to satisfy the query more efficiently had they existed.
      • It keeps tracks of queries since the last SQLEngine Restart
      • They are calculated using 4 DMOs
        • sys.dm_db_missing_index_columns
          • lists the columns that would comprise the missing index
        • sys.dm_db_missing_index_details
          • list detail on the columns used for equality, inequality and included columns
        • sys.dm_db_missing_index_groups
          • A table that joins the above 2
        • sys.dm_db_missing_index_group_stats
          • Provides the metrics on a group of missing indexes
    • Reviewing unused indexes
      • Each time a table is InsUpDeled, all indexes for the table must be maintained to reflect those changes
      • If the cost for maintaining an index is more than the overhead of maintaining it, disable the index.
        • See the cost using this script.
i.name AS [IndexName],
i.type_desc AS [IndexType],
ius.user_updates AS [UserUpdates],
ius.last_user_update AS [LastUserUpdate]
FROM sys.indexes i
INNER JOIN sys.dm_db_index_usage_stats ius
ON ius.OBJECT_ID = i.OBJECT_ID AND ius.index_id = i.index_id
WHERE OBJECTPROPERTY(i.OBJECT_ID, ‘IsUserTable’) = 1 — User Indexes
AND NOT(user_seeks > 0 OR user_scans > 0 or user_lookups > 0)
AND i.is_primary_key = 0
AND i.is_unique = 0
ORDER BY ius.user_updates DESC, SchemaName, ObjectName, IndexName
  • Identifying and resolving concurrency problems
    • Concurrency refers to the availability of SQL Server to service requests from the maximum amount of users at the same time
    • Defining transactions and transaction scope
      • A transaction is a way to group, or batch a series of updates to a data so that either all updates succeed and are committed at once or if one fails, none are committed and the entire transaction is rolled back.
        • To preserve the consistency of the data involved in the batch of updates, locks are held on the rows that are being updated.
        • The length of time that these locks are held is know as the transaction scope.
        • Concurrency can be improved by ensuring transaction scopes are as short as possible.
      • SQLEngine will lock the minimum number of resources needed to accomplish its goal(multi granular locking).
        • To do this, it must announce its intent to take locks to higher levels.
          • To modify a row, an exclusive row must be obtained for the row, and an intending exclusive to the page and higher levels.
          • This is done for lock management. If another transaction wants a lock, SQLEngine checks locks from the highest level down to determine if the resource is locked.
      • Lock modes
        • Shared: For read operations that do not change or update data. E.g. select
        • Update: Used on resources that might be updated
        • Exclusive: Used of Data modification operations. Ensures multiple updates cannot be made to the same resource at the same time. E.g. Exclusive
        • Intent: Used to establish a lock hierarchy. The types are
          • intent shared(IS),
          • intent exclusive(IX)
          • shared with intent exclusive(SIX).
        • Schema: Used when an operation dependent on the schema of a table is executing. The types are schema modification(Sch-M) and schema stability(Sch-S)
        • Bulk update(BU): Used when bulk copying data into a table and the tablock hint is specified
        • Key-range: Protects the range of rows read by a query when using a serializable transaction isolation level. Ensures other transactions cannot insert rows that would qualify for the queries of the serializable transaction if the queries were run again.


      • Lock management
        • SQLEngine uses lock management to increase concurrency.
        • Query optimizer determines the scope of locks to be taken on a database at the beginning of the transaction.
        • SQLEngine can dynamically escalate the lock type during execution.
      • Transaction Isolation levels
        • There are 4 ANSI isolation levels
          • Serializable
          • Repeatable read
          • Read committed(default)
          • Read uncommitted
        • SQL server introduces 2 custom isolation level that use row versioning in tempdb
          • Snapshot Isolation
            • This improves concurrency and has the same characteristics as the serializable isolation level.
            • It specifies that the data read by any statement in a transaction will be the transactionally consistent version of the data that existed at the start of the transaction.
              • Data modification made by other transactions after the start of the current transaction are not visible to statements running in the current transaction.
            • Configured on the database and requires you to set the transaction isolation level at the beginning of the transaction
Alter Database @DBName set allow_snapshot_isolation on

Set transaction isolation level repeatable read;

          • Read Committed snapshot isolation
            • Similar to snapshot isolation level as it allows readers read data without being blocked by locks acquired by writers.
            • doesn’t require the user to ask for it explicitly as it is automatically applied whenever the transaction is at a read committed isolation level.
        • The characteristics of the isolation levels
          • Dirty read
            • When a transaction is allowed to read data from a row that has been modified by another running transaction that has not yet committed
          • Non-repeatable read:
            • Occurs when during the course of a transaction, a row is retrieved twice and the values within the row differ between reads
          • Phantom read:
            • Occurs when in the course of a transaction, two identical queries are executed, and the collection of rows returned by the second query is different from that returned by the first.

Image [1]

          • To query the isolation level of a transaction, use

DBCC useroptions

        • Lock duration is based on
          • The lock owner.
          • A server side cursor
          • The session

Image [2]

        • Monitoring locks
          • DMVs provide you information about locks and transactions currently in effect.
          • sys.dm_tran_locks displays the locks being held by open transactions
          • sys.dm_tran_database_transactions displays the open transactions at the DB level
          • sys.dm_tran_session_transactions correlates associated sessions and transactions
      • Blocking
        • Blocking occurs when transactions are waiting for a resource to be freed due to another transaction’s lock on it.
          • Can be identified when
            • request_status in Sys.dm_tran_locks DMV shows the value of wait.
            • If the blocking_session_id of sys.dm_exec_requests or sys.dm_tran_locks > 0, it indicates blocking
        • Blocking can also occur due to compiling or recompiling query execution plans.
          • Can be identified when
            • wait_resource in sys.dm_exec_requests contains compile
            • resource_subtype in sys.dm_tran_locks
        • Blocks can be avoided by
          • keeping transaction scope as short as possible
          • Not allowing user interactions during transactions.
          • Practicing proper indexing to limit locks acquired and reduce the chance of blocking
          • Elevating the transaction isolation level above the default only for a good reason
          • Examining the T-SQL code to see whether developers have added locking hints, index hints, or join hints
        • Using AlwaysOn Replicas to improve concurrency
          • Reporting queries can reduce concurrency in OLTP systems as they often carry out aggregation which blocks InsUpDel operations.
          • A new feature allows you offload reporting load to a read-only replica created with always on.
          • To do this, add a clause to the connection string ApplicationIntent=ReadOnly.
        • Deadlocks
          • Deadlocks occur when 2 or more tasks permanently block each other because each has a lock on a resource that the other tasks are trying to lock.
          • Resources that could cause a block are
            • Locks.
            • Worker Threads.
            • Memory.
            • Parallel query execution-related resources.
            • Multiple active result sets(MARS) resources.
          • Types of deadlocks
            • conversion deadlocks
              • They occur when two connections both own shared locks on a resource and they both try to convert their locks to exclusive locks.
            • writer-writer deadlocks
              • caused by resource ordering
              • Fixed by ensuring that all code access resources in the same order
            • Reader-writer deadlocks
              • Occur when both connections hold exclusive locks on different resources, and each tries to request a shared lock on the other connection’s resource
            • Cascading deadlocks
              • They occur when there are more than 2 connections involved in the deadlock. Even after a deadlock victim has been chosen, another deadlock still exists.
              • Deadlock victims will continually be chosen until all the deadlocks in the chain have been resolved.
          • SQL server has a thread just for looking for deadlocks
          • Deadlocks raise a 1205 error upon which one of the processes in the deadlock is terminated and rolled back.
            • The victim is usually the one with the fewest transaction log records
              • This behavior can be overridden by setting deadlock_priority on the connection
          • Capturing deadlock information
            • Using trace flags
              • Turn on trace flag 1204
                • Writes deadlock information into SQL server activity log in a text format
              • Turn on trace flag 1222
                • Writes the deadlock information to XML
              • The trace flags can be turned on using DBCC Traceon

DBCC Traceon(@TraceFlagNo, -1)

                • This allows you carry out tracing to the SQL server activity log without the overhead of a profiler trace
            • For more detailed information, use
              • SQL trace to capture
                • lock:deadlock chain
                • deadlock graph
                • lock:deadlock
              • Extended events
                • xml_deadlock_report events in the system_health session
      • Activity monitor
        • Used to monitor the health of an instance.
        • It has 4 views to show real time metrics
          • %Processor time
          • Waiting tasks
          • Database I/O
          • Batch requests per second
      • Diagnosing bottlenecks
        • Microsoft recommends using the waits and queues methodology to solve bottlenecks


          • First determine what is causing the bottleneck
            • Examine sys.dm_os_wait_stats so you can see the cumulative wait times for every possible wait type since SQL server was restarted.
              • To reset it run DBCC sqlperf(‘sys.dm_os_wait_stats’, clear);
            • Examining the wait type can reflect a bottleneck on the same resource.
            • A task is a unit of work scheduled for execution.
              • It can be in 1 of 3 stages
                • Running: currently executing
                • Runnable: It has the resources it needs and is waiting for a scheduler to execute
                • Suspended: Waiting for a resource
              • You monitor task status using sys.dm_os_waiting_tasks
                • The wait_duration_ms column shows how long
                • The wait_type shows what the wait type is for that session
SELECT wt.session_id,wt.wait_duration_ms,wait_type, blocking_session_id, status
FROM sys.dm_os_waiting_tasks AS wt
JOIN sys.dm_exec_sessions AS s ON wt.session_id = s.session_id
FROM sys.dm_os_waiting_tasks AS wt
JOIN sys.dm_exec_sessions AS s ON wt.session_id = s.session_id
FROM sys.dm_os_waiting_tasks AS wt
JOIN sys.dm_exec_sessions AS s ON wt.session_id = s.session_id
      • Using reports for performance analysis
        • The standard reports provide a lot of information in graphical format.
        • Custom reports can be added for additional reports access.
        • SQL server 2012 performance dashboard report shows a simple method of seeing current and historical reports for a database.
      • The Kill process
        • Used to remove a session that is not responsive or is the head blocker in a blocker chain.
        • Can be done using
          • Activity monitor: Select the process and click kill
          • TSQL:

Kill { @SessionID } [with statusonly]

            • the With statusonly option shows you the status of the kill operation as the DB is moved to a transactionally consistent state.
            • Kill cannot be used to kill your own user processes
SQL server agent, backup and restore
Managing SQL server Agent
  • Server Agent enables you automate execution of scheduled or event-driven administrative jobs.
    • A job is a collection of tasks(Job steps).
    • The Server agent account determines how the system interacts with resources(Both internal and external).
      • If The agent account is on the domain, it needs
        • Logon as a service right
        • membership of the access security group so jobs owned by domain users that aren’t in the host’s local admin group don’t fail.
        • be a member of sysadmin role to
      • To support SQL server agent proxies, the service account must have additional permissions
        • This allows users that aren’t part of sysadmin role create jobs
      • If you want to support multi-server job processing, you must add the account used for agent to the TargetServersRole in msdb on the master server.
    • Server agent security
      • If a user is not a member of the sysadmin role, he must be a member of either of the following roles to use server agent.
        • SQLAgentUserRole: Users have permissions only on the local jobs and job schedules they own. They cannot run multi-server jobs
        • SQLAgentReaderRole: Members have SQLAgentUserRole permissions and can view the properties of all jobs and job schedules including multi-server jobs
        • SQLAgentOperatorRole: Members can execute, enable, disable, stop or start all jobs and can delete the job history for any local job.
    • SQL server agent mail
      • Database mail is used to transmit alerts as email messages.
    • SQL server agent error log
      • This stores warnings and error messages from server agent.
    • SQL server can be configured to generate alerts when it detects events and trigger conditions
      • Generating an alert involves
        • Creating an alert name
        • Specifying an alert trigger
        • Specifying an alert action
    • Jobs are used to automate routine tasks.
      • They can be scheduled to run at particular times
      • They can be monitored with the job activity monitor.
      • Notifications can be sent to operators from SQL agent.
      • In multi server environments
        • Server are either targets or master servers.
          • Target servers report to master servers
        • Server agent’s service account should be on the domain.
        • To link a target to a master, on the target, run sp_msx_enlist, and to delist, run sp_msx_defect
  • Backup strategy
    • SQL server supports the following backup types
      • Full DB backups
        • Includes all DB objects, system tables and data.
        • Transactions that occur during the backup are also recorded
        • Allow you to completely restore a database.

Backup database @DBName to @Destination

      • Differential backups
        • Backs up data altered since the last full backup regardless of whether other differential backups have been taken.
        • Require less time than full DB backups
        • Transactions that occur during the backup are also recorded

Backup database @DBName to @Destination with differential

      • Transaction log backups
        • This records the changes that have occurred since the previous transaction log backup then truncates the transaction log thus removing transactions from the log that have been committed or rolled back.
        • They represent the state of the transaction log at the time the backup starts.
        • They are incremental backups and must be restored in sequence of being taken.

Backup log @DBName to @Destination with differential

      • File and filegroup backups
        • This backs up individual DB files and filegroups rather than performing a full DB backup.
        • This backs up very large databases
        • The transaction log must also be backed up
        • The truncate log on checkpoint must be disabled.
        • Useful for very large DBs
      • Copy only backups
        • Functionally the same as full DB or transaction log backups.
        • They do not affect the backup sequence
          • It doesn’t truncate the transaction log and is ignored by differential backups as a reference point.
    • You can use multiple backup types in a backup strategy.
    • Backup compression reduces the amount of space required to store a backup
      • CPU load is increased during the compression.
      • Resource governor can be used to limit CPU usage.
      • Compressed and uncompressed backups cannot be stored in the same media set.
    • Recovery models
      • A recovery model determines how you backup a database and the strategy you implement to meet your recovery point objective(RPO).
        • RPOs enable you measure the amount of data that might be lost in the event of failure. Often times the time referenced in SLAs
        • RTOs (Recovery Time Objective) measure the maximum amount of time it can take to restore data to the RPO.
      • There are 3 recovery models
        • Simple: No transaction log backups are taken. It is only possible to recover to the most recent DB backup only.
        • Full: This enables you take transaction log backups and full backups.
          • This allow you recover to the point of the last full/ transaction log backup.
          • Allow point-in-time recovery
        • Bulk-logged: a recovery model that minimizes transaction log activity during bulk operations. You cannot perform point-in-time recovery using this model
    • System DB backups
      • The preferred backup strategy for each system DB is
        • master: Frequently
        • model: only after modifications
        • msdb: Only after you make changes to jobs and alerts
        • tempdb: Not backupable
        • Resource DB:
          • Holds copies of all system objects
          • Cannot be backed up directly. Require a File-based backup. Should be backed up after patching or upgrading SQL server
        • Configure distribution.
          • Used when the instance is the distributor in a replication topology.
          • Should be backed up when the publisher and subscription DBs are backed up.
    • Backing up replicated DBs
      • When backing up the replicated DBs, you should also back up the system DBs used by the replication process.
      • The following strategies are required
        • At the publisher: master, msdb and publication
        • At the distributor: master, msdb and distribution
        • At the subscriber: master, msdb and subscription
      • When backing up databases that use snapshot and transactional replication, set the sync with the backup option on the publication and distribution databases.
        • This ensures transactions in the publication DB log are not truncated until they have been backed up at the distribution DB
    • For mirrored DBs, you can only back up the principal DB. On it, you cannot use the backup log with norecovery option
    • Backing up alwaysOn replicas
      • This depends on the automated backup preference setting applied at the availability group level
      • There are 4 options for availability group backup
        • Prefer secondary
          • Default
          • Backups occur only on secondary except when only the primary is online in which case it uses the primary
        • Secondary only
          • Backups occur only on secondary and do not occur if the primary is the only replica online
        • Primary
          • Backups should be taken on the primary replica.
          • This allows differential backups to be taken
        • Any Replica
          • Backups can occur on any replica in the availability group.
      • The backup preference can be changed using the alter availability group statement with the automated_backup_preference option

Alter availability group @AvailabilityGroupName set (automated_backup_preference = secondary_only)

      • Log backups and copy only backups are supported on replicas
        • You can only backup when the replica is in synchronized or synchronizing state
    • Database checkpoints
      • Checkpoints write all modified pages held in memory to disk then records that information in the transaction log.
      • checkpoints can be configured on the instance using sp_configure with the recovery interval (min) option.
    • Backup devices
      • To add a backup device, use the sp_addumpdevice SP

exec sp_addumpdevice ‘disk’, ‘@BAckupDeviceName’, ‘@directory\file.bak’

        • This allows you to backup to the device directly buy specifying @BAckupDeviceName

backup database @DBName to @BAckupDeviceName

    • Backing up media sets
      • A media set can include a single or multiple files OR tapes.
      • A media set can contain up to 32 backup devices.
    • Backup events are written to the Application event log.
      • Successful backup are written to the msdb.dbo.backupset table.
  • Restoring DBs
    • Restore options depend on the backups that exist
      • Full backups only: this allow you restore to the last time the backup was carried out
      • Full and differential backups: Allow you restore to the point where the differential base taken
        • Restore the full backup with the norecovery option
        • restore the differential with the recovery option.
      • Full and transactional log backups.
        • Backup the current transaction log with the no_truncate option.
        • restore the most recent full backup with the Norecovery option
        • restore transaction logs in sequence with the NoRecovery option till you get to the newest
        • Restore the newest transaction log with the recovery option
      • Full, differential and transaction logs
        • Backup the current transaction log with the no_truncate option.
        • restore the most recent full backup with the Norecovery option
        • restore the differential with the Norecovery option.
        • restore transaction logs in sequence with the NoRecovery option till you get to the newest
        • Restore the newest transaction log with the recovery option
    • File restores
      • If a DB contains multiple files or filegroups, and uses the full or bulk-logged recovery model, you can perform a file restore rather than a full restore.
      • Done if the volume that hosts a particular file fails or a file becomes corrupt while other files used by the DB do not have any problems.
      • You can perform an online restore only on enterprise editions.
        • It enables you restore files in secondary filegroups as long as the primary filegroup is online.
    • Page restores
      • Used to repair a small number of damaged pages in an otherwise healthy database.
      • If a DB file constantly suffers from damaged pages, the volume that hosts it is faulty.
      • Possible under the following conditions
        • The Database uses full or bulk-logged recovery
        • Only on read-write filegroups
        • Only Db pages can be restored.

Restore DB @DBName page=’@IDOfFileHostingPage:PageNo1, @IDOfFileHostingPage:PageNo2′ from @BackupFile with Norecovery.

        • Restore any logs taken after the backup of BackupFile with Norecovery option
        • Perform a log backup
        • Restore the log with the recovery option.
      • AlwaysOn Availability Groups and DB mirrors support automatic page repair. If an error occurs on the primary, it broadcasts a request to the secondary and retrieves the page.
        • To verify the status of automatic page repair
          • Always on: Query the sys.dm_hadr_auto_page_repair DMV
          • DB mirrors: Query sys.dm_db_mirroring_auto_page_repair DMC
    • Restoring a DB protected with transparent Data encryption
      • You can restore a DB With TDE as long as you have the certificate and private keys.
        • Include backing up the certificate and private keys in your backup scheme.
      • Before performing a restore on a system without the certificate, use the create certificate statement with the from file option.
    • Restoring system DBs
      • Model and msdb are restored like user DBs
      • Tempdb doesn’t need to be restored
      • Master:
        • Start the DB in single user mode( specify -m in the startup options of the service)
        • Run the restore DB command

restore database master for @Backup with replace

        • If it is unsuccessful, rebuild the system databases and restore recent copies of all databases

setup.exe /Q /Action=RebuildDatabase /InstanceName=MSSQLServer

    • Restoring replicated DBs
      • When you restore a publication, distribution or subscription DB, you must also restore the master and msdb Databases.
        • This ensures all the DBs are consistent as regards replication configuration and settings
    • Checking DB status
      • Run the following against master to see the DB status

select databasepropertyex(‘databasename’, ‘status’);

        • Possible statuses are
          • Online
          • Offline
          • Restoring
          • Recovering
          • Suspect: A recovery operation failed and the DB might be corrupt
          • Emergency: The DB is in a read-only state and accessible only to members of the sysadmin role.
      • You can check which users have access to a DB by running the following against master

select databasepropertyex(‘databasename’,’useraccess’)

SQL server querying revision

  • Relational Model is the foundation of SQL. A relation has a head(columns ) and a body(rows) The relational model is based on 2 mathematical models.
    • Set Theory
      • A set is a collection M into a whole of definite, distinct objects m( the "elements" of M).
    • Predicate logic
      • A predicate is an expression that when attributed to some object, makes a proposition true or false or unknown (in TSQL ).
  • Logical Query processing:
    • The phases are  From ->where ->group by -> having-> select-> order by. The output of each phase is a virtual table which serves as the input of the next phase.
      • From: where the tables you want to query is specified and table operators(apply, Join) also.
      • Where: Filters rows based on the predicate specified. Only rows for which the predicate evaluates to true are returned.
      • Group By: defines a group for each distinct combination of values in the grouped elements from the input table. All expressions processed in subsequent phases must guarantee a single value per group.
      • Having: Filters based on a predicate but is only evaluated per group and filters groups as a whole. Only groups for which the predicate evaluates to true are returned
      • Select: If a query is a grouped query, each group is represented by a single row in the result. It has two steps
        • Evaluating the expressions in the select list and producing the result attributes. Aliasing is also carried out here. Aliases defined here aren’t visible to phases processed earlier and to items in the same select list.
        • using the Distinct clause to remove duplicates.
      • Order By: returns the result in presentation order according to the expressions that appear in the order by list. The output of this clause isn’t relational
Using the from and select Clauses
  • From Clause
    • Where you indicate the tables that you want to query
    • where you can apply table operators like joins to input tables.
  • Select Clause
    • Evaluates expressions that define the attributes in the query’s result, assigning them aliases if needed
    • Eliminates duplicate rows using the DISTINCT clause(if specified) .
Data Types and Built in functions
  • Data Types
    • They do the following to objects
      • constrain the data that is supported: It has a domain of supported values and will not allow values outside that domain.
      • Encapsulates behavior that operate on it: If data is stored in the wrong data type, it will need to be converted before it can be used. e.g. + is addition for Int, but concatenation for char.
      • Assign the size of elements: The bigger the type used, the more storage it uses and the more I/O is required to process queries that include it.
      • Exposes the data through operators.
    • For Fixed types, updates will not require the row to physically expand, requiring no data shifting as they use the storage for the indicated size. Hence they offer better update performance.Variable types on the other hand use the storage for what you enter+offset information. They let you save a lot of storage, and offer better read performance.
    • Choosing a data type for keys
      • When choosing keys, we either go with Intelligent keys(those generated by the application we interact with), or surrogate keys(attributes chosen solely for the purpose of being used as keys). To generate surrogate keys, the most common techniques are
        • Identity column property: Generates keys in an attribute(of a numeric type) with a scale of 0.
        • Sequence object: An independent object from which you can obtain sequence values. It is not tied to any column and can be generated before use.
        • Non-sequential GUID(Globally Unique ID): type is UniqueIdentifier. generated with function NEWID(). GUID are globally unique across space and time.
        • Sequential GUID: This is used to generate sequential GUIDs using the NewsequentialID(). Can only be used as a constraint on an attribute.
      • When choosing surrogate keys, the size matters for performance(read performance) reasons as larger attribute(data size) consume more storage. If indexes are built on the key, the storage requirement is increased.
    • Date and time functions
      • To get current date and time
        • Getdate(): TSQL only, Datetime attribute
        • current_timestamp: standard, datetime  attribute
        • Sysdatetime(): standard, datetime2
        • sysdatetimeoffset(): standard, datetimeoffset
      • To get the date and time component of a datetime attribute
        • cast the attribute as either date or time.
        • datepart() is used to extract a part of a datetime attribute and return it as an integer. e.g. datepart(month, current_timestamp). can also calculate the week number e.g. datepart(week, current_timestamp). The day(), month() and year() are abbreviations of datepart().
        • datename() is similar to datepart() but it returns a character string. Mostly used to extract the month from a datetime attribute.e.g. datename(month, current_timestamp)
      • To build a datetime attribute from components.
        • Datefromparts(): build a date from a numeric type.
        • datetime2fromparts()
        • datetimefromparts()
        • datetimeoffsetfromparts()
        • smalldatetimefromparts()
        • timefromparts()
      • To get the last day of a month, use EOMONTH().
      • To add to a datetime attribute use DateAdd(). You must select which of the components you want to add to. format is Dateadd(component, difference, datetime). e.g. dateadd(day, -100, current_timestamp).
      • To calculate the difference between 2 datetime attributes use datediff(). You must select which of the components you want to calculate the difference for e.g. datediff(day, current_timestamp, dateadd(week, -100, current_timestamp))
    • Character Functions
      • Concatenation: You can use either + or concat
        • +: If any of the inputs is null, the operator returns a null.
        • Concat(): If any of the inputs is null, it substitutes the null with an empty string.
      • Substring extraction and position
        • Substring(): To extract a substring from a string. format is substring(string, start_position, no_of_characters). If no_of_characters >Len(string), it just returns the chars available till the end of string(Len-Start_position).
        • Left() and Right(): To extract a specified number of characters from the left of right of a string. Format is left(string, No_of_Characters).
        • Charindex(): Returns the position of the first occurrence of a string provided as the first argument in the string provided as the second. format is charindex(string_to_search_for, String_to_be_searched).
        • PatIndex(): returns the position of the first occurrence of a pattern provided as the first argument in the string provided as the second. Use % for any string, _ for a character and [] to represent a range of characters. Format is PatIndex(pattern, String_to_be_searched). e.g. Patindex(‘%[0-9]%’, @String_to_be_searched).
      • String Length
        • LEN(): to measure the number of characters in a string. Ignore spaces at the end of a string.
        • Datalength(): To measure the length of a string in bytes. Doesn’t ignore spaces at the end
      • String Alteration
        • Replace(): Replace in a string provided as the first argument, all occurrences of the string provided as the second argument, with the string provided as the third argument. Format is replace(@String_to_be_used, @string_to_replace, @replacement_string)
        • Replicate(): Replicate an input string a requested number of times. Format is Replicate(@string, @number_of_times).
        • Stuff(): From the string input as the first argument, it starts at the position input as the second argument, removing the number of characters input as the third argument and replacing them with the string input as the 4th argument. Format is Stuff(@String, @Start_position, @Number_of_characters_to_remove, @replacement_string).
      • String Formatting
        • Upper(): returns the input string in upper case.
        • Lower(): returns the input string in lower case.
        • Ltrim(): returns the input string with the leading spaces removed.
        • RTrim(): returns the input string with the ending spaces removed.
        • Format(): Format an input string based on a format string. Can also format into a culture as the third parameter. E.g. format(1759, ‘00000000’) returns 00001759.
    • Case Expressions and related Function
      • Case returns a specified value depending on what is used as the input string. It is either simple or searched.
        • Simple when the Case simply compares an input expression to multiple when expressions and returns a result expression. Format is
                                             Case @input_expression
                                             When @value then @result_expression
                                             When @value2 then @result_expression2
                                             Else @result_expression3
                                             End as @Alias.
        • Searched when case uses predicates in the when clauses to determine the result expression to return.
                                             When @input_expression @operator(e.g. >) @value(e.g. 10) then @result_expression
                                             When @input_expression @operator(e.g. >) @value2(e.g. 10) then @result_expression2
                                             When @input_expression @operator(e.g. >) @value3(e.g. 10) then @result_expression3
                                             Else @result_expression4
                                             End as @Alias
      • Coalesce(): Standard. Accepts a list of expressions and returns the first that is not null. the type of the return expression is determined by the return expression. It also carried over the nullability of  the first element if used in a select…into statement.
      • NullIf(): Standard. Accepts 2 expressions and returns null if they’re equal. returns the first if they’re not.
      • IsNull(): Accepts 2 expressions and returns the first that is not null. the type of the return expression is determined by the first input. It set the attribute as not null when used in a select…into statement.
      • IIF(): returns one value if input predicate is true and another value otherwise. Format is IIF(@predicate, @true_result, @false_result)
      • Choose(): allows you provide a position and a list of expressions and returns the expression in the indicated position.Format is Choose(@postion, @expression1, @expression2, @expression3, @expression4)
Filtering Data with Predicates
  • T-SQL supports 3 query clauses that enable you filter data based on predicates.
    • Where
    • On
    • Having
  • Predicates, 3-valued logic and search arguments.
    • Predicates follow 2 Valued logic when an attribute is not nullable. An expression can either evaluate to true or false only.
    • Predicates follow 3 valued logic when an attribute is nullable. An expression can either evaluate to true, false or unknown(when comparing with a null mark). for a statement to capture all tuples in 3 valued logic, you must take care of null. You can do that using the following predicates
      • Is Null: returns true when the tested operand is null.
      • Is not null: returns true when the tested operand is not null.
    • Filtering queries have a performance side. with the right filters, the DBMS can choose to use indexes without doing a full table scan.
    • Search arguments
      • This refers to how predicates are arranged to maximize the efficiency of queries by making them use indexes.
      • Predicates that are Search arguments are in the format @attribute @operator @value OR @value @operator @attribute. e.g. col1 = 10.
      • Applying manipulation to the filtered column prevents the predicate from being a search argument e.g where coalesce(col1, ‘3’) = coalesce(@value, ‘3’) ensures that indexes defined on col 1 aren’t used. Instead. consider using where col1 = @value or (col1 is null and @value is null);
    • Manipulating predicates
      • Combining predicates: you can use AND and OR to combine predicates.
      • Negating predicates: negating true = False and vice versa. However negating null still yields null. To include Null in a result, use the is null operator.
      • When combining operators, there’s precedence. Highest is NOT the AND then OR. You can use the () operator to control evaluation.
    • Filtering Character Data
      • When filtering character data, ensure the literal in the filter is of the same data type as the attribute being filtered. Otherwise, SQL will have to carry out implicit conversion to get them to the same data type.
      • Like Predicate: this supports filtering string data base on pattern matching using wildcards. The wildcards are
        • %: Any string including an empty one
        • _: A single character
        • []: A character list. E.G. ‘[AC]%’
        • [-]: A range of characters. E.G. ‘[A-C]%’
        • [^]: A single character that is not in the list or range. E.G. ‘[^A-CG]%’
      • If you want to look for a character that is a wildcard, you indicate it after the wildcard as an escape character using the ESCAPE keyword. you also use the escape character(!) before the character e.g. Like ‘!_%’ Escape’!’ lets you search for _ in a string.
    • Filtering Date and Time Data
      • For date literals,
        • it’s best to express them in the language-neutral pattern YearMonthDay i.e. ‘20140319’. This is neutral for all date and time types. It ensures the date is the same even if a users language sets the dateFormat to something else.
        • You an also use the convert or parse to specify the format
          • Convert(): this lets you specify the style number as the conversion style
          • Parse(): this lets you specify a culture name
      • If only date is specified during an insert into a Datetime attribute, the time is assumed to be midnight.
      • When specifying the range, use the format where @date >= @firstdate and  @date < @lastdate+1. If the time value is specified, SQL server can round up and cause wrong data to be included in the result.
Sorting Data
  • A table represents a relation and as it is a set, no order is guaranteed for its elements. A query that doesn’t have an explicit instruction to return the rows in a particular order doesn’t guarantee the order of rows in the result.
  • Order BY clause
    • This is the only way to guarantee the rows are returned in an order.
    • The default sorting order is ascending. otherwise, specify desc.
    • Determinism: When ordering isn’t done by a unique attribute, the order of the query’s output isn’t deterministic. The order is deterministic when the order by list is unique. Deterministic queries are the only ones where the result set is repeatable. You can specify ordering by a unique attribute for the ordering to be deterministic. format is
                         Order by @Attribute1 desc, @attribute2
    • You can order by elements you’re not selecting as long as the rows would have been normally allowed there. This doesn’t apply when distinct is specified as the result rows don’t map to the source rows in a one to one manner, but in a one-many manner.A attribute can represent a number of tuples in this scenario.
    • It is executed after select so column aliases in the select clause are visible here.
    • Nulls are sorted before non-nulls in ascending order. to sort them after, use a coalesce to replace them with a cheat value.
    • Indexes help SQL server avoid actually sorting the data to address a order by request.
Filtering Data with TOP and OFFSET-FETCH
  • Top: Used to filter a requested number or percent of rows based on indicated ordering. Specified in select clause along with the requested number of rows in parentheses. The ordering is dependent on the order by clause. Format is
                    select top (@No_Of_Rows) @attribute1, @attribute2, @attribute3
                    select top (@percent_of_rows) percent @attribute 1, @attribute2, @attribute3
    • You can use a variable as the @no_of_rows.
    • To ensure the result is deterministic, you can either
      • Use the with ties option: This ensures that all ties are included. It usually results in more rows being included than were requested. Format is
                                   Select top(@No_of_rows) with ties @attribute1, @attribute2, @attribute3
      • Guarantee determinism by adding a tiebreaker: add an element to the order by that makes the result deterministic.
  • Offset-Fetch: A standard alternative to TOP. Can be used to skip a certain number of rows.
    • Appears right after the order by clause which must be present.
    • When offset-fetch is used, the order by clause plays an additional role of telling the offset-fetch which rows it needs to filter.
    • Offset
      • You must specify the offset clause which allows you select the number of rows you want to skip(or 0 if you don’t). Format is
                                   Offset @no_of_rows_to_skip rows
      • When used without Fetch, it skips @no_of_rows_to_skip and fetches all the remaining rows.
    • Fetch
      • The Fetch clause is optional and is used to set how many rows you want to filter. You must specify first or next before @number_of_rows_to_retrieve. Format is
                                   Offset @no_of_rows_to_skip rows fetch first @number_of_rows_to_retrieve rows only
    • You can use expressions as input for the @no_of_rows_to_* value.
Using Joins
  • Joins are used to query data from different tables and match the rows that need to be related. There are 3 types.
  • Cross Joins
    • Performs a cartesian product of two tables, multiplying between the tables and yielding a row for each combination of rows from both sides. If @RelationA has @X tuples and @RelationB has @Y tuples, then their cross join will have @X * @Y tuples.
    •  The format is
                         Select A.@Attribute1, B.@Attribute2
                         From @relationA as A
                              cross join @relationB as B
                         where A<= @Value
  • Inner Joins
    • Allows matching of rows from two tables based on a predicate(usually comparing a primary key to a foreign key on another table). Rows for which the predicate evaluates to false or unknown are discarded.
    • The join’s matching predicate is specified in the ON clause.
    • If an attribute of either table will be ambiguous, then aliasing must be used.
    • Filtering predicates can be specified in either the Where or On clauses . It is the equivalent of concatenating the predicates with an and operator.
    • Format is
                         Select A.@Attribute1, B.@Attribute2
                         From @relationA as A
                              inner join @relationB as B
                                   On A.@Attribute1 = B.@Attribute2
                         where A<= @Value
  • Outer Joins
    • Allows you preserve all rows from one or both sides of the join, irrespective of their being matching rows on the other side based on the ON predicate. e.g. Left outer join preserves the @first_relation, inserting nulls for the attributes that aren’t matched in the @second_relation.
    • The ON and Where clauses play different roles. 
      • ON: This plays a strictly matching role as it purely matches rows on the preserved side with the other side, inserting nulls into the attributes on the other side if it doesn’t find a match. This isn’t final with respect to the preserved side of the join.
      • Where: This carries out filtering. This is final with respect to the preserved side of the join.
    • The full outer join is an outer join that preserves both sides.
    • The Format is 
                         Select A.@Attribute1, B.@Attribute2
                         From @relationA as A
                              left outer join @relationB as B
                                   On A.@Attribute1 = B.@Attribute2
                         where A<= @Value
  • Multi Join Queries
    • A join takes place between 2 tables(physical or virtual) at a time from left to right. The result of one is used as the left input of the next.
    • You can use parentheses to separate joins into different logical phases. e.g.
                         Select A.@Attribute1, B.@Attribute2
                         From @relationC as C
                              Left outer join
                                        @relationA as A
                                        @relationB as B
                                        inner join
                                        on A.@Attribute1 = B.@Attribute2
                               On C.@Attribute1 = B.@Attribute2
Using Subqueries, Table Expressions and Apply
  • Nesting queries allows you refer to one query’s result from another, removing the need to store the result of one query in a variable before another query can access it.
  • Exists Predicate
    • This accepts a subquery as input and returns true when the subquery returns at least one row and false otherwise.
    • Usually used in co-related subqueries with where for filtering. Filters out rows where attributes from the @input_query do not exist in the result set of the subquery specified in the exists predicate.
    • format is

                         Select A.@Attribute1

                         from @Relation1 as A
                         where EXISTS in
                              (    select B.@Attribute1
                                   From @Relation2 as B
                                   where B.@Attribute1 <> @Value 
                                        and A.@Attribute1 = B.@Attribute1)
  • In Predicate
    • This accepts a subquery as input and returns true when the subquery returns at least one row and false otherwise.
    • Usually used in co-related subqueries with where for filtering. Filters out rows where the specified attribute from the @input_query do not exist in the result set of the subquery specified in the IN predicate.
    • Format is
                Select A.@Attribute1, B.@Attribute2
                from @Relation1 as A
                where A.@Attribute1 IN
                (    select B.@Attribute1
                     From @Relation2 as B
                     where B.@Attribute1 <> @Value 
                          and A.@Attribute1 = B.@Attribute1)
  • Subqueries can be categorized in terms of their containment and result
    • Containment:
      • Self-contained: Subquery that has no dependency on the outer query. They can be run independently of the outer query. E.g.
                                   Select  A.@Attribute1, A.@Attribute2
                                   From @relationA as A
                                   where A.@Attribute3 >= (select Min(@Attribute4) from @relationB);
      • Correlated: Subqueries where the inner query has a reference to a column from the table in the outer query. They cannot be run independently of the outer query.

                                   Select  A.@Attribute1, A.@Attribute2

                                   From @relationA as A
                                   where A.@Attribute3 >= 
                                       (select Min(@Attribute4) 
                                        from @relationB as B
                                        where B.@Attribute2 = A.@Attribute2 );
    • Result:
      • Scalar: Subqueries that return a single value and can be used where a single-valued expression is expected. If the subquery returns more that one value, the query fails, if it returns an empty set, it is converted to a Null  e.g.
                                   Select  A.@Attribute1, A.@Attribute2
                                   From @relationA as A
                                   where A.@Attribute3 >= (select Min(@Attribute4) from @relationB);                                   
      • Multi-valued: Subqueries that return multiple values in a single column. Can be used where a multi-valued result is expected(e.g. when using the IN predicate)

                                   Select A.@Attribute1, A.@Attribute2

                                   From @relationA as A
                                   where A.@Attribute1 not IN
                                        (select @Attribute4 from @relationB where @Attribute3=@Literal)

      • Table-valued: See Table Expressions for expatiation.
  • Table Expression: This are named queries. You write an inner query that returns a relational result set, name it and query it from an outer query. Their result must be relational. There are 4 types
    • Derived Tables: This are visible only to the statement that defines them. 
      • The derived table’s query is defined in parentheses in the from Clause of the outer query and the inner query is named.
      • Due to the all-at-once property of SQL, the name assigned a derived table isn’t visible to other elements in the same query processing phase as the derived table’s definition. To join multiple instances, you must duplicate the code.
      • Format is below where  @Inner_Query = select @Attribute4 from @relationB where @Attribute3=@Literal
                         Select A.@Attribute1, A.@Attribute2
                         From (@Inner_Query) as A                         
    • Common Table Expressions(CTEs): This are visible only to the statement that defines them.
      • As the CTE name is assigned in another query processing phase, the same CTE can be referenced several times.
      • Format is below where  @Inner_Query = select @Attribute4 from @relationB where @Attribute3=@Literal

                                   With @CTE_NAME

                                   Select @Attribute1, @Attribute2, @Attribute3
                                   From @CTE_NAME
      • To define multiple CTEs, use the format

                                   With @CTE_NAME1

      • CTEs have a recursive form. 
        • The body of the recursive query has 2 or more queries separated by a union all operator( for a union all, the attribute type for both sides has to be the same).
        • One of the Queries in the CTE serves as the anchor member and returns a valid relational result. It is invoked only once.
        • The other queries have a reference to the CTEname in their recursive member. This query is repeatedly invoked until it returns an empty set.  
        • In each iteration, the reference to the CTE name from the recursive member represents the previous result set. 
        • The reference to the CTE name from the outer query represents the unified results of the invocation of the anchor member and all invocations of the recursive member.
        • Format is, @Anchor_member = ‘select @Attribute1, @Attribute2 from @relation1 as @Alias1 where @Attribute5=@literal’ and @recursive member = ‘ select @Attribute1, @Attribute2 from @CTE_NAME1 Join @relation1 as @Alias2 on @Alias1.@Attribute2 = @Alias2.@Attribute2

                                             With @CTE_NAME1

                                                  Union ALL

    • Views: These are database objects, hence they’re re-usable. They don’t accept parameters.
      • Queries can select from them as if they’re tables, but they’re not persisted in the database
      • Format for creation is below where  @Inner_Query = select @Attribute4 from @relationB where @Attribute3=@Literal 

                                   Create view @View_name

      • Format for execution is

                                   Select @Attribute1, @Attribute2

                                   From @View_Name                  

    • Inline table-valued functions: These are database objects, hence they’re re-usable. 
      • They support input parameters.
      • Format for creation is below where  @Inner_Query = select @Attribute4 from @relationB where @Attribute3=@Literal

                                   Create Function @Function_Name (@Variable as Int) returns table

      • Format for execution is

                                   Select @Attribute1 

                                   From @Function_name(@variable);
  • Apply
    • The apply operator allows you apply a table expression given to it as the right input to each row from a table expression given to it as the left input.
    • The inner query in the right table expression can have a reference to an element from the left table. the right table expression is evaluated separately for each left low.
    • Sometimes used to replace cursors.
    • There are 2 types
      • Cross apply
        • The right table expression is applied to each row from the left input.
        • If the right table expression returns an empty set for a left row, the left row isn’t returned.
        • The Format is 

                                             Select A.@Attribute1, A.@Attribute2, C.@Attribute2

                                             From @RelationA as A
                                             Cross Apply ( 
                                                  Select B.@Attribute2, B.@Attribute3
                                                  From @RelationB as B
                                                  Where A.@Attribute3 = B.@Attribute4) as C

      • Outer apply
        • The right table expression is applied to each row from the left input
        • if the right table expression doesn’t return a set for a left row, the right expression returns null as a place holder. 
        • The format is 
                                             Select A.@Attribute1, A.@Attribute2, C.@Attribute2
                                             From @RelationA as A
                                             outer Apply ( 
                                                  Select B.@Attribute2, B.@Attribute3
                                                  From @RelationB as B
                                                  Where A.@Attribute3 = B.@Attribute4) as C
Set Operators
  • Set Operators operate on two result sets of queries, comparing rows between the results. Depending on the result of the comparison and the set operator used, the operator determines whether to return the row or not.
  • They have the following guidelines
    • As complete rows are matched between the result sets, the number of columns in the queries must be the same and the column types of corresponding columns need to be compatible(implicitly convertible).
    • They consider 2 nulls as equal
    • The individual queries must return a relation.
    • the column names of result columns are determined by the first query
    • They have an implied distinct. They do not return duplicate rows
  • The format for set operators is
  • They are
    • Intersect: It returns only rows common to both result sets.
    • Union: It unifies the results of 2 input queries. 
    • Except: It performs a set difference between the first result set and the second returning only rows present in the first but not in the second. The order of the input queries matters
    • Union all: A multiset operator. It unifies the result of set 2 input queries, while keeping duplicates( the implied distinct for set operators doesn’t apply).
      • Format is


                                   Union all

  • Their order of precedence is Intersect>Union/Intersect and Union = Intersect, however this can be controlled with parentheses
Data Analysis Operations
  • Data analysis functions are functions applied to a set of rows and return a single value.
  • They are either Group functions or window functions.
Grouped Queries
  • Grouping allows you to define groups in your data, and then perform data analysis computations(e.g. aggregate functions) per group.
  • Grouped queries include the group function or the group by clause.
    • Aggregate Functions: They arrange all rows in one group if no group by clause is specified and perform computation for the group.
      • Examoles include MIN, MAX, SUM, COUNT, AVG
      • You can work with distinct occurrences by specifying the DISTINCT option
        • Format is

                                                       select @Attribute1, count(distinct @Attribute2)

                                                       from @Relation1
                                                       Group by @Attribute3.
    • Group By clause: 
      • Allows grouping the rows based on a specified grouping set of expressions specified in the group by clause.
      • All elements with the same value in the grouping set’s attribute are grouped and subsequent computation give a value per group.
        • Count(): Specifying count(*) instead of count(@Attribute1) returns a smaller count as * causes the count to ignore nulls.
      • The grouping set can be made of multiple elements(Attributes).
      • All Attributes with Null are considered equal.
      • Having Clause
        • This uses a predicate that evaluates per group unlike WHERE that’s evaluated per row.
      • Format is 

                                    Select @Attribute1, @Attribute2, @Attribute3, @Attribute4

                                   From @Relation1
                                   Group By @Attribute2, @Attribute1

      • All clauses that operate after a group by clause must be guaranteed to return a single result  value per group. 
        • You can refer to elements from the group by list as they return one distinct value per group.
        • If you want to refer to Attributes from the underlying table that aren’t in the group by list, 
          • apply an aggregate function to them to show they’re guaranteed to return one value per group 
          • add the attribute to the group by clause.
          • Carry out the grouping in a CTE and join the CTE to the base table adding the elements during the join.
      • Multiple Grouping Sets
        • They allow you group data in more than one way.
        • They’re specified in the group by clause.
        • There are three ways to define them
          • Grouping Sets: 
            • Used to list all grouping sets you want to define in the query.
            • the empty (         ) grouping set defines one group with all rows for computation of total aggregates.
            • Format is 

                                                                 Select @Attribute1, @Attribute2, @AggregateFunction(@Attribute3) as @Attribute4

                                                                 From @Relation1
                                                                 Group by grouping sets
                                                                 (@Attribute1, @Attribute2), –GroupingSet1
                                                                 (@Attribute2), –@GroupingSet2
                                                                 (     ) –@GroupingSet3

          • Rollup:
            • Used when there’s a hierarchy formed by the input elements. where only a subset of the possible grouping sets is interesting.
            • When the elements form a hierarchy, you use rollup to avoid computing unnecessary aggregates.
            • E.G. rollup(@Attribute1, @Attribute2, @Attribute3) groups by
              • (@Attribute1, @Attribute2, @Attribute3)
              • (@Attribute1, @Attribute2)
              • (@Attribute1)
              • (                 )
          • Cube:
            • Accepts a list of expressions as inputs and defines all possible grouping sets that can be generated from the inputs including the empty grouping set.
            • E.G. Group by Cube(@Attribute1, @Attribute2) groups by
              • (@Attribute1, @Attribute2)
              • (@Attribute1)
              • (@Attribute2)
              • (                 )
        • Nulls are used are as place holders in rows where an attribute isn’t part of the grouping set. 
          • If all grouped attributes are defined as not null, you can identify the rows that are associated with a single grouping set, based on a unique combination of nulls and non-nulls in those columns. 
          • However, if the attributes allow nulls, you need to use either  grouping() or grouping_ID()to differentiate.
            • Grouping():Accepts a single elements as input and returns 0 when the element is part of the grouping set and 1 when it is not.
              • You can identify a grouping set by looking for 0s in the elements that are part of the grouping sets and 1 in the others.
              • Format is
Select @Attribute1,grouping(@Attribute1), @Attribute2, grouping(@Attribute2), @Attribute3, grouping(@Attribute3), @AggregateFunction(*)
from @Relation1
Group by rollup(@Attribute1, @Attribute2, @Attribute3);
            • Grouping_ID(): 
              • This accepts the list of grouped columns as inputs and returns an integer representing a bitmap. Calculate the bitmap to determine which of them(having 0) is part of the grouping set and which(having 1) is not part of the grouping set.
              • The result integer is the sum of the values representing elements that are not part of the grouping set because their bits are turned on
              • Format is

select @Attribute1, @Attribute2, @Attribute3, grouping_id(@Attribute1, @Attribute2, @Attribute3), @Aggregate_Function(*)

From @relation1
group by rollup(@Attribute1, @Attribute2, @Attribute3)
Pivoting and Unpivoting data: done using table operators.
  • Pivoting: A technique of grouping and aggregating data, transitioning it from a state of rows to a state of columns. 
    • You use the Pivot operator.
    • In all pivots, you need to specify
      • Grouping element(on rows): what do you want to see on the rows
      • Spreading element( on columns): what do you want to see on the columns
      • Aggregation element( Data): what do you want to see where columns and rows intersect.
    • It is recommended to use a CTE because the grouping element is determined by elimination(i.e. the elements in the table not specified in the pivot statement). If the base table is queried directly, other elements will be included as grouping elements. In the example below, grouping is done by the @grouping_column. 
    • The @spreading_column and @aggregation_column cannot be the results of an expression. They must be column names from the CTE. However, an expression can be used and aliased in the CTE.
    • Count(*) cannot be used as the aggregate function.if you want a count, define a dummy column in the CTE and reference that.
    • Only one aggregate function can be used.
    • The IN clause only accepts a list of static values. you can use dynamic SQL if you don’t know the values during query definition.
    • @select list = @Grouping_column, @distinct_spreading_value1, @Distinct_Spreading_Value2, @Distinct_Spreading_Value3
    • Format is 

                         With Pivotdata as 

                              Select @grouping_column, @spreading_column, @aggregation_column
                              from @Relation1
                         Select @Select_list     
                         from Pivotdata
                              Pivot (@Aggregate_function (@aggregation_column)
                                   For @spreading_column in (@distinct_Spreading_Value1, @Distinct_Spreading_Value2, @Distinct_Spreading_Value3))as @Alias;
  • UnPivoting: A technique for rotating the input data from a state of columns to a state of rows.
    • It takes a set of source columns and rotates them to multiple rows, generating 2 target columns: One to hold the source column names, another to hold the values.
    • You need to specify 3 elements.
      • The set of source columns you’re unpivoting
      • The name to assign to the target names column
      • The name to assign to the target values column
    • The unpivot operator filters out rows in the @value_column with NULLS as they represent inapplicable cases.
    • The types of all columns that you’re pivoting must be the same
    • The Format is
                         Select @column_list_alias, @name_columns_alias, @values_column_alias
                         From @Relation1
                         Unpivot(@values_column for @name_columns in (@source_Column1, @Source_column2, @Source_column3) as @Alias;
Window Functions
  • Window Functions enable you to perform data analysis computations. You define the set of rows per function and then return one result value per each underlying row and function. You define the set of rows for the function to work with using the over clause.
  • They use aggregate function which are applied to a window of rows defined by the over clause. 
  • Over Clause
    • Used to define a window of rows for aggregate functions. The window is defined in respect to the current row. 
    • an empty over clause Over() represents the underlying query’s result set.
    • You can use window partition clause to restrict the window.
    • Format is

                         Select @Attribute1, @Attribute2, @Attribute3, @Aggregate_function(@Attribute1) over(partition by @Attribute2)

                         From @Relation1;
  • Framing: 
    • Used to define ordering within the partition using a window order clause, and based on that order, confining a frame of rows(rows or range) between 2 window frame extents/delimiters.
    • You define the Windows frame extents using a window frame clause, which requires a window order clause to be present.
    • Window Frame extent/ delimiters define the range of elements in a result set as a result of an aggregate function.
    • Window frame units
      • Rows: 
        • Allows 3 Window Frame Extents/delimiters.
          • Unbounded preceding/following: the beginning or end of the partition
          • Current Row: representing the current row. assumed as default if none is specified.
          • @number Rows preceding/following: @number rows before or after the current respectively
          • Format is

Select @Attribute1, @Attribute2, @Attribute3,@Attribute4 @Aggregate_function(@Attribute2) over( partition by @attribute1 order by @Attribute1, @Attribute3 rows between @delimiter1 and @delimiter2) 

      • Range
        • Allows 2 Window Frame Extents/delimiters.
          • Unbounded preceding/following: the beginning or end of the partition
          • Current Row: representing the current row. assumed as default if none is specified.
Window Ranking
  • Allow you rank rows within a partition based on a specified ordering specified in the over clause.
  • There are 4 ranking functions
    • Row_number(): computes a unique sequential integer within the window partition based on the window ordering.
      • Isn’t deterministic if the ordering isn’t unique.
    • Rank(): Assigns the same ranking value to all rows with the same ordering value. If n of the ordering value have the same value, the next ordering value is N+@rank_count.
    • Dense_rank(): Assigns the same ranking value to all rows with the same ordering value.  Assigns the same ranking value to all rows with the same ordering value. However, it only ranks distinct values hence if n of the ordering value have the same value, the next ordering value is @rank_count + 1.
    • Ntile(@NumOfTiles): Allows you arrange rows within the partition in a requested number of equally sized tiles, base on the specified ordering.
      • Isn’t deterministic if the ordering isn’t unique.
  • Format is
               Select @Attribute1, @Attribute2, @Window_Ranking_function() over (order by @Attribute1 as @ordering_value partition by @Attribute2).
Window offset Functions
  • Return an element from a single row that is in a given offset from the current row in the window partition, or from the first or last row in the window frame
  • There are 4 of them
    • LAG & LEAD : 
      • relies on an offset with respect to the current row. Supports Window partition and ordering clauses, but not window frame clauses. For Lag, Returns an element from the row in the current partition that is a requested number of rows before the current row(based on window ordering). For Lead, Returns an element from the row in the current partition that is a requested number of rows after the current row(based on window ordering).
      • Allow you specify the offset and the value to return if the attribute is null.
      • Format is

                                   Select @Attribute1, @Attribute2, @Attribute3, Lag(@Attribute1, @offset, @value_If_Attribute_is_null)over (partition by @Attribute2 order by @Attribute3)

                                   From @Relation1

      • Returns the first/last value from the first or last rows in the window frame.
      • Supports Partitions, order and frame clauses.
Querying Full-Text Data

Full-Text search 
  • Full text search allows approximate searches. They require full-text indexes inside full-text catalogs. Full text indexes can be created over Char, Text,image, XML and varbinary(max) type attributes.
  • Full-Text search components
    • You can store documents in binary or XML columns and use full-text queries on them.
    • Word breakers and stemmers perform linguistic analysis on all full-text data. they’re language specific.
      • Word breakers identify individual words(tokens) which are inserted into full-text indexes in compressed format.
      • Stemmers generate inflectional forms of a word based on the rules of a language.
    • You can specify a different language for each full-text indexed column.
    • Stopwords: used to define words that shouldn’t be indexed because they’re irrelevant.
    • Full-text queries can also search for synonyms as defined in the thesaurus file. The following can be configured if thesaurus files
      • diacritics_sensitive: set to 0 if a language is accent insensitive and 1 if it is.
      • Expansion: to add expansion words( words to search for in addition e.g. student, scholar) for a word.
      • replacement: To define replacement words or terms for a specific word or term e.g. Windows 2008 instead of win2k8.
    • To search a document’s properties, you can specify a search property list and add properties to the list.Document properties have predefined GUID and integer IDs.
    • Full Text Catalogs and indexes
      • Full text indexes are stored in full-text catalogs which are virtual objects that contain full-text indexes. 
      • Statistical semantics are an option for full-text indexes
        • They create additional key phrase and document similarity indexes that are part of statistical semantic indexing. Full-text search uses the key phrases to identify and index documents that are similar.
        • They allow you to query the meaning of the documents
        • Allow you to create text-mining solutions.
        • You need to install the semanticsDB
Using the contains and freetext predicates
Contains and freetext are predicates for limiting the result set of a query using full-text indexes
  • Contains 
    • Allows you search for Words.
    • You can add custom weights to words you are searching for.
  • FreeText
    • Less specific that contains. 
    • Searches for the values that match the meaning of a phrase not just exact words. 
    • The engine performs word breaking of the search phrase, stems and identifies expansions and replacements for the words in the searched term with words from the thesaurus.
    • Format is 

                         Select…..From…..Where freetext(Ftcolumn, ‘Searchword1’).

Full-text search functions
  • Containstable() and freetexttable() returns 2 columns: Key and Rank. 
    • The key column is the unique key from the index used in the key index clause of the create fulltext index statement. Rank returns an value between 0 and 1000 which states how well a row matches your search criteria.the higher the rank, the higher the relevance.
    • Freetexttable supports a fewer number of parameters.
    • Format is 
                         CONTAINSTABLE ( table , { column_name | ( column_list ) | * } , ‘ <contains_search_condition> ‘ [ , LANGUAGE language_term] [ , top_n_by_rank ] )
    • Search conditions are the same as for a contains predicate.
  • Semantic search functions
    • There are 3 functions for this
      • SemanticKeyPhraseTable
        • returns a table with key phrases associated with the full-text indexed column from the column list.
        • The source_key parameter specifies the unique key from the index used in the key index clause of the create fulltext index.
        • Format is 

                                                            SEMANTICKEYPHRASETABLE ( table, { column | (column_list) | * } [ , source_key ] )

      • SemanticSimilarityDetailsTable
        • Returns a table with key phrases that are common across 2 documents.
        • Format is

                                        SemanticSimilarityDetailsTable(table, source_column, source_key,matched_column, matched_key

      • Semanticsimilaritytable
        • Returns a table with documents scored by semantic similarity to the searched document specified with the source_key parameter. 
        • Format is
                                             SEMANTICSIMILARITYTABLE( table, { column | (column_list) | * }, source_key )

Querying and Managing XML Data

Intro to XML
  • XML uses tags to name part of an XML document. The parts are called elements.
  • XML documents are ordered: The position of elements matters
  • XML is case-sensitive unicode text
  • Some characters in XML have special meanings as they are considered markup. to include them,
    •  you must escape them using ampersand(&) followed by a special code, followed by semicolon(;).
    • you can use the XML CDATA section, written as <![CDATA[@String_with_character_to_insert]]> which prevents the special characters in the string being parsed.
  • processing instructions are written in <?@processing_instruction?>
  • Comments are in the format <!–@Comment–>
  • XML is a document when it has a single root node and a fragment when it doesn’t.
  • Elements have attributes
    • Attributes have their own names and values. They can be presented in 2 ways
      • attribute-centric: Attributes values are enclosed in quotation marks. 
      • element-centric: Attribute are nested elements of the original elements. Use the keyword elements to ensure the output is element-centric.
  • Namespaces are used to distinguish between elements from different sources. 
    • they are declared in the root element of an XML document.
    • They can be aliased. then you prefix element names with the alias 
  • XML data should have a fixed schema(metadata) as the actual data is mixed with elements and attribute names. The most common  schema is XSD.
    • Validating an XML document is checking whether the document complies with a schema.
    • XMLSCHEMA directive: 
      • This is used to return the XSD schema of the document. It is returned before the actual XML data. It accepts a parameter that defines a target namespace.
Producing XML from Relational Data.
  • You can convert the result set to XML using the FOR XML clause of the select T-SQL statements. 
  • The FOR XML clause comes after the order by clause.
  • There are several modes
    • For XML raw
      • The output is similar to the tabular expression of the data. Every row from the returned rowsets converts to a single element named row, and the columns become the attributes of this element.
      • You must use the elements keyword for the output to have element-centric attributes
      • You can enhance the RAW mode by renaming the row element, adding a root element, including namespaces and making the XML returned element centric.
      • The XMLSchema directive can be used here to return the XML schema
    • For XML auto
      • Elements are nested if a previous element shares an attribute specified earlier. 
      • You must use the elements keyword for the output to have element-centric attributes
      • Due to the nesting, the order by clause must be selected carefully as well as the select clause so the nesting is properly done.
        • The ordering should follow a one-to-many relationship
      • The XMLSchema directive can be used here to return the XML schema
    • For XML Path
      • Allows you manually define the XML returned.
      • It uses standard XML XPath expressions to define the elements and attributes you are creating.
      • Column names and aliases serve as Xpath expressions defining the path to the element in the XML generated. 
        • Path is expressed in a hierarchical way; levels are delimited with slash /
        • Every column becomes an element. If you want attribute-centric XML, prefix the alias name with @.
      • To create XML with nested elements for child tables, you need subqueries in the select part of the query in the path mode. The subqueries have to return a scalar value in a select clause.
        •  You return a scalar value by returning XML from the subquery. Then the result is returned as a single scalar XML value. You format nested XML
          from the subquery with the FOR XML clause, like you format XML in an outer query. Additionally,you have to use the TYPE directive of the FOR XML clause to produce a value of the XML data type, and not XML as text, which cannot be consumed by the outer query.
  • The With Namespaces clause 
    • Used before the select clause to define namespaces and aliases in the returned XML.
    • A colon is used to separate the namespace from the element name.
  • The Root Keyword is used to specify a root element.
Shredding XML to tables
You can convert XML to tables. Their are 2 methods
  • OPENXML rowset function: this provides a rowset over in-memory XML documents by using the Document Object Model(DOM) presentation.
    • Useful when a documents will be shredded multiple times.
    • Before parsing the DOM, you prepare the DOM presentation of XML using sys.sp_xml_preparedocument. After shredding the document, you remove the DOM presentation using sys.sp_xml_removedocument.
    • OpenXML uses the following parameters.
      • An XML DOM document handle, returned by sp_xml_preparedocument. 
        • This is an integer.
      • An XPath expression to find the nodes you want to map to rows of a rowset returned. 
        • Specified as rowpattern and defines how XML nodes translate to rows. The path to a node is used as a pattern
        • nodes below the selected node define rows of the returned rowset.
      • A description of the rowset returned
      • Mapping between XML nodes and rowset columns.
        • Done using the WITH clause of the openXML function. 
          • In the With clause, you specify an existing table, which is used as a template for the rowset returned or you define a table syntax.
        • You can use the flag parameter of the openXML function to specify the mapping. 
          • 1 is for attribute-centric mapping
          • 2 is for element-centric mapping
          • 11 is for both mappings
Querying XML data with XQuery
XQuery allows you navigate XML, loop over nodes, shape the returned XML instance and much more.
  • Xquery is case-sensitive
  • Variable start with $ e.g. $i
  • Xquery returns sequences which can include atomic or complex values(XML nodes).
    • Any node(element, attribute, text, processing instruction) can be included in the sequence.
    • Sequences can be formatted to get well-formed XML. 
  • Query() is used to query XML.
    • What to return from a query can be specified with the following
      • query(‘*’) returns the complete XML instance.
      • data() is used to specify what to return when you query XML. 
        • Used with query(). e.g. @XML_Variable.query(‘data(@attributes_to_return)’)
        • It returns the data values not the attributes.
        • data(*) returns all the atomic data values in the xml.
        • data(root/@Attribute1/@Attribute2) is used to return the atomic data values in the path(root/@Attribute1/@Attribute2) specified.
  • Every identifier is a qualified name(Qname). A Qname consists of a local name and optionally a namespace prefix.
    • Standard namespaces are
      • xs: namespace for an XML schema
      • xsi: The XML schema instance namespace. Used to associate XML schemas with instance documents
      • xdt: namespace for data types.
      • fn: functions namespace
      • sqltypes: namespace for mapping for SQL data types
      • xml: default XML namespace.
    • You can use standard namespaces in queries without defining them. 
  • You define your own data types in the prolog(at the beginning of your XQuery) which is seperated from the query body with a semicolon. 
  • XQuery Data Types
    • Xquery data types are either 
      • node types: attribute, comment, element, namespace, text, processing-instruction and document
      • atomic types: xs:string, xs:boolean, xs:QName, xs:date, xs:time, xs:datetime, xs:float, xs:double, xs:decimal, xs:integer.
  • Xquery functions
    • Xquery functions are organized into categories.
  • Navigation
    • With Xquery, you can specify a path absolutely or relatively from the current node. XQuery takes care of the current position in the document.
    • Every path consists of a sequence of steps, listed from left to right.
      • Steps are separated with slashes
      • A step may consist of 3 parts
        • Axis: the direction of travel. there are 6 axes
          • Child::
            • children of the current node. default and can be skipped
          • descendant::
            • all descendants of the node. Direction is down
          • self::
            • the current node. direction is here
          • descendant or self::
            • Retrieves the context node and all its descendants. direction is here and then down
          • attribute::
            • Retrieves the specified attribute of the current node. Direction is right
          • parent::
            • Retrieves the parent of the current node. direction is up.
        • Node test: specifies the criterion for selecting nodes.
          • This follows the axis you specify. Can be as simple as a name test.
          • Specifying a name means you want nodes with that name.
          • You can use wildcards.
            • Asterisk(*) means you want any principal node with any name.
            • A principal node is the default node kind for an axis. It’s an attribute if the axis is attribute:: and an element for all others
          • You can also perform node type tests
            • They help you query nodes that are not principal nodes.
            • Types of node type tests 
              • comment(): select comment nodes
              • node(): returns any kind of node
              • processing-instruction(): retrieve a processing instruction node
              • text(): retrieve text nodes or nodes without tags
        • Predicate: narrows down the search
      • The format of a path is


  • Predicates: There are 2 types
    • Numeric:
      • Select nodes by position. they’re included in brackets.
      • e.g. (/x/y)[1] means the first y child of elements x
    • Boolean:
      • Select all nodes for which the predicate evaluates to true.
      • Can work on sequences and atomic values
        • For sequences, if one atomic value in a sequence leads to a true exit of an expression, the whole expression is evaluated to true.
          • Done using general comparison operators(=, <, > ,!=)
        • For Atomic values,
          • use value comparison operators do not work on sequences
    • Xquery supports if…then…else expression.
      • Used to return one expression or another depending on the value of the expression in the if. Similar to case.
    • They’re a for each loop used to iterate through a sequence of nodes returned by an XPath expression.
    • Signify For..let..where..order by..return
      • For: used to bind iterator variables to input sequences. 
      • Let: Optional. let you assign a value to a variable for a specific iteration
      • Where: optional. lets you filter the iteration
      • order by: control the order in which the elements of the input sequence is processed. the order is based on atomic values.
        • Expressions passed here must be compatible with the gt operator.
      • return: evaluated once per iteration. Also used to format the result XML.
        • Use {} around expressions to extract the value of a the expression
The XML data type(using)
  • XML data type has 5 methods that accept XQuery as a parameter.
    • query(): use to query XML data.
    • value(): retrieving atomic values
      • It returns a scalar value. you must specify the position of the element in the sequence you are browsing.
      • value(@Xquery_Expression, @dataTypeToReturn)
    • exist(): checking existence of a node in an XML instance.
      • Commonly used in the where clause.
      • returns 1 where the node searched for exists in the XML instance
      • returns 0 where the node searched for doesn’t exist in the XML instance
      • returns null if the XML instance is NULL
    • modify(): modifying sections within the XML data
      • Due to the size of XML data, it’s more efficient to change a small portion of it.
      • Invoked in an update T-SQL statement
      • There are 3 keywords for data modification
        • insert
        • delete
        • replace value of
    • nodes(): shredding XML data into relational data. 
      • Usually faster than using the OPENXML function.
      • returns a result set that contains logical copies of the XML instance. 
      • it has to be invoked for every row in the table.
  • XML is a Large object(LOB) type. 
  • Using XML data type for Dynamic Schema
    • You can validate XML data against an XML schema collection. 
      • This is useful for a dynamic schema. and allows you support data of different schemas.
      • XML schemas allow you constrain the possible nodes, the data type of the nodes, the values of the nodes.
      • Validating XML against a schema result in a typed XML.
    • to prevent binding a product to the wrong schema, use a trigger or (a check constraint to check whether the namespace is the same as the product category name.
  • XML indexes
    • They allow index seek operation instead of a table scan on XML data.
    • The first index on an XML column is the primary XML index. It contains a shredded persisted representation of the XML values. For each XML value in the column, the index creates several rows of data. It can only be created on tables with a clustered primary keys.
    • there are 3 types of secondary XML indexes
      • Path: useful if queries specify path expressions. 
      • value: useful if queries are value-based and the path is not fully specified or it includes a wildcard
      • Property: useful for queries that retrieve one or more values from xml instances using value()
Creating tables and enforcing data integrity.
Tables are created in two ways
  • Select into: creates a table automatically using the output of a query for the basic table definitions
  • Create table statement: you explicitly define the components of the table
    • The definition of a table(table Schema) can be defined after the table’s creation using the ALTER TABLE command.
    • The database Schema  is a container that you can use to group database objects
      • A table can belong to only one schema.
      • Every table’s name has two parts @Database_schema.@table_name.
      • Schema’s cannot be nested in other schemas.
      • Schema’s are owned by users.
      • To move a table from one schema to another, use the format

                                   Alter schema @destination_Schema transfer  @table_name.

    • Naming tables and columns
      • The name of tables and columns must be valid SQL identifiers.There are 2 types of identifiers
        • Regular: 
        • Delimited: identified by []or "" surrounding the name.
      • As the maximum identifier length is 128 characters, make sure identifier names aren’t too long
    • Choosing column data types
      • Use the most efficient data type.
    • Null and default values.
      • You can specify whether a column allows null in its value or not by specifying null or not null after the column’s data type.
      • If you don’t want to allow nulls, but there is a possibility of the data being unknown at insertion time, you can create a default constraint to assign a default value

                                   @column_name @datatype  not Null default(‘@default_value’).

    • Identity property and sequence numbers
      • The identity property can be assigned to a column to automatically generate a sequence of number upon insert
        • Only one column in a table can have it.
        • You specify both seed and increment values for the numbers generated.
        • To insert into an identity column, use the format

                                             set identity_insert @table_name On;

    • Computed columns
      • This are columns that are computed based on expressions.
      • The value of a computed table cannot be non-deterministic e.g. from using getdate().
      • Format is

                                   @computed_table_name As @Attribute1 @Operator @attribute2

    • Table Compression
      • You can compress the data in a table to get more efficient storage in the enterprise versions of SQL server. Data can be compressed in
        • Row: SQL server applies a more compact storage format to each row of a table
        • Page: Include row-compression plus additional compression algorithms.
        • Format is

                                             Create table @Table_name

                                             ) with (Data_compression = @CompressionType)

  • Altering a table: used to alter a table’s structure and are or remove table properties. 
    • To add a column, you use the format

                         Alter table @Tablename

                         Add @columnName @column_Definition

    • To alter a column, use the format

                         Alter table @tablename

                         Alter column @columnname @column_definition

    • Cannot be used to
      • Change a column name
      • add an identity property
      • remove an identity property
Enforcing data integrity
  • Tables in a database sometimes need to be validated to ensure
    • The range of values the column accepts
    • Whether the values should reference another column
    • Whether they should be unique
  • The validation are controlled using constraints.
    • Constraints are database objects and must have unique names. however their names should reflect their function(the table name they’re constraining).
    • Can be declared during the create table and alter table commands
    • With Check
      • This causes the constraint creation to fail if any data already in the column
    • Format is 
Create table @table_name
     constraint @constraint_name @constraint_type(@column_name)
Alter table @table_name
Add constraint @constraint_name @constraint_type(@column_name)
    • There are 5 types
      • Primary Key constraints
        • This are used to enforce uniqueness of values for every row in a column.
        • there can only one primary key per table.
        • They can only be created on columns that don’t allow nulls and have only unique values
        • SQL server enforces uniqueness using a unique index.
      • Unique constraint
        • To enforce uniqueness of rows in a table.
        • Only one null can exist in a column with a unique constraint.
        • A unique non-clustered index is created for enforcing uniqueness.
      • Foreign Key constraints
        • A column or combo of columns in one table used to look up data in another table. 
          • In the other table, the referenced column has a primary key or unique constraint applied.
          • The referencing column can have the attribute duplicated.
        • Used to enforce that values entered into the referencing table are from a referenced table
        • Very useful for joining.
          • Creating a non-clustered index on foreign key columns helps query performance.
        • Format is 

                                             Alter table @table_name with check

                                             Add constraint @constraint_name foreign key(@column_name) references @Referenced_table(@referenced_column) with check
      • Check Constraint
        • Used to limit the range of values in a column, beyond the limits of the data type.
        • Format

                                             Add constraint @constraint_name check(@column @comparison_operator @value)

      • Default Constraint
        • This supplies a default value during an insert if no other value is supplied.
        • Values don’t have to be supplied to columns with default during insertion
        • Format is 

                                             Add constraint @constraint_name default(@default_value)

Designing and creating views, inline function and synonyms
Views, inline functions and synonyms are ways of presenting a logical view of a table without exposing the physical base table directly.
Views and inline functions
  • They allow you present the contents of tables to user and present complex query logic so a user doesn’t have to remember them.
  • Views
    • Appear as table and can be used to store and re-use queries in the database.
    • Properties for a view can be 
      • With Encryption: this obfuscates the view text so the user can’t easily see the select text of the view
      • With Schemabinding: binds the view to the table schemas of the underlying tables. The underlying tables cannot be altered without dropping the view and the view cannot have it’s schema changed unless the view is dropped.
      • With View_metadata: returns the metadata of the view instead of the base tables.
      • With Check option: this ensures that if you modify the rows of a table with a view that has a where statement, it ensures you can’t update rows so they’ll not be filtered in by the where statement.
    • A view can only have one result set. Hence it can only have one select statement except a union is used.
    • Views must return a relational result set
    • Views cannot create a table or reference a temporary table.
    • You can alter a view using the alter view command. You must reissue the the view when you alter it.
    • Partitioned views
      • You can use views to partition large tables on one server, in one or more tables across different databases and across multiple servers.
      • A partitioned view uses a union across partitioned tables.
    • To carry out DML(insert, update, delete) with a view,
      •  you must reference exactly one table at a time, no matter how many tables the view references.
      • The view columns must not have expressions surrounding them e.g. aggregate functions.
      • You cannot modify a view column that is computed from a union, cross join, except, intersect.
      • you cannot modify a column whose result is from grouping or distinct.
      • you cannot modify a view with a( top or offset-fetch and a with check option.)
    • Format is
  • Inline Functions
    • They are parameterized views(views with parameters)
    • returns a rowset based on the select statement.
    • Properties of an inline function can be
      • With Encryption
      • With Schemabinding
    • Format is 

                         Create function @Function_name(@parameter1 @parameter1type, @parameter2 @parameter2type)

                         returns table    
  • This provide an abstraction layer for all schema-scoped database objects. They are names stored in a database that can be used as substitutes for other object names
  • You can refer to a synonym anywhere you’ll refer to a table except in DDL statements(e.g. alter)
  • they can refer to objects in a different database or a linked server.
  • Synonyms cannot refer to other synonyms.
  • Synonyms do not expose metadata of to users.
  • Synonyms have a late-binding behavior
    • The object referenced doesn’t have to exist and SQL server doesn’t test for its existence
    • This means you can use a single synonym for many different objects, creating the synonym for each object as you need it. or have it in different DBs refering to the same object.
  • Format is

               Create synonym @Synonym_name for @TableName

Inserting, Updating and deleting data
Inserting data
  • You can Specify the column names in the insert  statement
    • Lets you skip columns that have identity/default values defined.
      • To assign a value to a identity column, you need to issue the set identity_insert @Table_name on.
      • During an insert, If a value or default isn’t specified for a column,
        • SQL checks whether the column gets it’s value automatically e.g. identity or default constraint. If it doesn’t
        • SQL checks if the column allows nulls and inserts a null. If it doesn’t
        • An error is thrown
  • Data can be inserted into a table using
    • Insert values: Allows you insert rows into a table based on value expressions.
      • Allows you change the arrangement of values in the insert statement. 
      • Format is

                                   Insert into @Relation(@attribute1, @attribute2, @attribute3) 

    • Insert Select: Allows you insert the result set of a query into the specified target table.
      • Allows you change the arrangement of values in the insert statement.
      • Format is

                                   Insert into @Relation1(@attribute1, @attribute2, @attribute3) 

                                   Select @select_list
                                   From @Relation2
                                   where @predicateExpression
    • Insert Exec: Allows you insert the result set returned by dynamic SQL or a stored procedure into the specified table.
      • Allows you change the arrangement of values in the insert statement.
      • Format is

                                   Insert into @Relation1(@attribute1, @attribute2, @attribute3) 

                                   Exec @Stored_Procedure1

    • Select Into: Creates the target table based on the definition of the source and inserts the result rows form the query into the table.
      • Copies the data from the source.
      • Copies some aspects of the definition(e.g. column names, types, nullability and identity). If you don’t want them copied, you need to apply some modifications to the data. 
        • e.g @identityColumn +1 removes the identity property from the destination.
        • If you want to change the data type, use case() or convert().
        • Modifying a column makes it nullable. To prevent it from being nullable, use ISNULL() e.g. isnull(cast(@Table1 as @datatype),’@AlternateValue’)
      • Some table definition like indexes, permissions, constraints and triggers aren’t copied. If you want to include them, you need to script them from the source to the target.

Update data

  • The update statement is used to update existing rows in your tables.
  • Format is

               Update @target_table

               Set @Attribute1 = @Expression1, @Attribute2 = @Expression2, @Attribute3 =  @Expression3
               where @predicateExpression
  • Updates can be based on Joins.
    • Update rows in a table while referring to related rows in other tables for filtering and assignment
    • Format is

                         Update @Target_table_alias

                         Set @Attribute1 = @Expression1, @Attribute2= @Expression
                         From @Target_table as @target_table_alias
                         Inner Join @Source_table as @Source_table_alias
                         on @Source_table_alias.@Attribute2 = @target_table_alias.@Attribute1
                         where @predicateExpression
    • Nondeterministic Updates can happen when multiple source rows match a target row.
      • SQL server updates one of the rows.
  • To prevent non-deterministic updates, use the merge statement instead of updates based on joins. It fails if there a multiple source rows for a target row.
  • Update can be done through table expressions
    • Useful when you want to see the rows that will be modified before the actual update.
    • Define a Table expression based on the select, then modify using update
    • Useful when you want to modify rows with expressions not allowed in set clause(e.g. window functions)
    • Format is
                         @Table_Expression as @TE_Alias
                         update @TE_Alias
                         set @TE_Alias.Attribute1 = @Expression1, @TE_Alias_Attribute2 = @Expression2.
                         where @predicateExpression
  • Update Based on a variable
    •  Modify a row and collect the result. Instead of using an update and a select, use the short version
    • Format is

                         Declare @variable as @DataType

                         Update @relation1
                         Set @variable1 = @Attribute1+=@Expression
                         where @predicateExpression
Deleting Data
There are 2 statements you can use to delete rows from a table: 
  • Delete: Used to delete rows from a table. A predicate can be used to limit the rows deleted.
    • Fully logged and can take a long time to complete.
      • Causes the transaction log to increase in size.
    • Exclusively locks the table hence a delete of a lot of data can cause lock escalation.
      • To prevent this, split the delete into smaller chunks using TOP and use an infinite loop

                                   While 1=1

                                        Delete Top(1000) from @relation1
                                        if @@rowcount<1000 break
                                        where @predicateExpression

    • Delete based on a join
      • Enables you delete rows from one table based on information that you evaluate in related rows in other tables.
      • You can also use a subquery instead(with in or exists)
      • Format is

                                   Delete from @relation1_alias

                                   From @relation1 as @relation1_alias
                                   Inner join @relation2 as @relation2_alias
                                   On @relation1.@attribute1 = @relation2.@attribute1
                                   where @predicateExpression

    • Delete based on table expressions
      • Use a table expression to define the rows you want to delete, then issue a delete against the table expression. 
      • Format is

                                   With @TableExpression1

                                   Delete from @TableExpression1

    • Format is 

                         Delete From @relation1

                         where @predicateExpression

  • Truncate: Deletes all rows from the target table.
    • Doesn’t allow an optional predicate filter.
    • Format is

                         Truncate table @relation1

  • Truncate is minimally logged and is faster than delete.
  • Delete doesn’t reset the identity property of a column and truncate does
  • Delete works for a table with a column referenced by a foreign key as long as the row being deleted isn’t referenced. Truncate isn’t
  • Delete is allowed against a table involved in an indexed view. truncate isn’t
Data Modification
Using sequence and identity column property
  • Identity column property
    • Automatically assigns a value to the column upon insertion. 
      • You can’t put a value in the identity column during an insert except you enable identity_insert on the table.
        • Set identity_insert @Table on
    • You cannot update an identity column.
    • Only one column can have this property.
    • The data type must have a scale of 0(no decimal). Numeric/Decimal with a scale of 0 is allowed
    • You can define a seed(first value) and increment( value to increment previous supplied value by). default for both is 1.
    • To query the last identity value generated use
      • Scope_identity(): returns the last identity value generated in your session in the current scope
      • @@identity: returns the last identity value generated in your session regardless of scope.
      • Ident_current(@relation): accepts a table as input and returns the last identity value generated in that table regardless of session.
    • Delete doesn’t reset the identity property of a column and truncate resets the identity column to its seed.
    • Identity doesn’t guarantee uniqueness as you can turn on identity_insert and can reseed the value. To enforce uniqueness, you need a unique index(primary or unique constraint)
    • Identity doesn’t guarantee there’ll be no gaps between values.
      • If an insert fails, the identity value isn’t reset, so the value the column returns is lost.
    • Identity doesn’t support cycling. After you reach the maximum value in the data type, the next insertion will fail. you need to reseed the identity value
      • To reseed an identity value, use the DBCC CheckIdent command.
        • dbcc checkident(‘@tableName’, reseed, @NewSeedValue)
  • Sequence Object
    • An independent object in the database.
    • Can be used for keys that shouldn’t conflict across different tables.
    • The value is generated before use.
    • Supports cycling.
    • The data type must have a scale of 0(no decimal). Numeric/Decimal with a scale of 0 is allowed
      • Default data type is bigint
      • E.g create sequence @sequenceName as @datatype;
    • Has the following properties
      • Increment by: The increment value. Default is 1
      • MinValue: the minimum value to support. Default is the minimum value in the type.
      • MaxValue. The Maximum value to support. Default is the maximum value in the type.
      • Cycle|No cycle: defines whether to allow the sequence to cycle or not. default is no cycle
      • Start with: sequence start value. default is MinValue for an increasing sequence and MaxValue from a decreasing(increment is -1) sequence.
    • To request a new value from the sequence, use the Next Value For function.
      • The next value for is usually part of a default constraint to generate values automatically.
      • e.g. Add constraint @constraint name default(next value for @SequenceName)
    • You can change all properties of a sequence except it’s data type.
      • Use the Alter Sequence command. 
        • To change the current value, use the restart with command
        • E.g. Alter sequence @sequence1 restart with @restartValue.
    • Has a cache Option which controls how ofter the current sequence value is written to disk versus to memory. Data is written to memory until the defined number of values are assigned.
      • This gives a very large performance improvement
      • e.g. alter sequence @sequenceName cache @numberTillDiskWrite
    • Doesn’t guarantee there’ll be no gaps between values.
      • If an insert fails, the value isn’t reset, so the value the sequence returns is lost.

 Merging Data

  • The merge statement allows you merge data from a source table or expression to a target table.
    • Has the following clauses
      • Merge Into @Target_Table: Defines the target table for the operation.
      • Using @source_Table: Defines the source table for the operation. operates similarly to a from clause. 
        • You can define joins, refer to table expressions, apply, pivot, unpivot or use table functions. The output is a table result that is used as the source of the merge
        • you can use any table as the source.
      • On @merge_Predicate: specify a predicate that matches rows between the source and target and defines whether a source row is or isn’t matched by a target row.
      • When Matched [and @predicate] Then @action: Defines an action to take when a source row is matched by a target row. 
        • Only Update and Delete are allowed because since a target row exists, you can’t insert.
        • you can apply 2 when matched if you want different predicates to cause either update or delete.
      • When not matched [by target] [and @predicate] then @action: This defines what action to take when a source row is not matched by a target row. 
        • You can only use insert here as the row doesn’t exist.
      • When not matched by source [and @predicate] then @action: This defines an action to take when a target row exists, by isn’t matched by a source row.
        • Only Update and Delete are allowed because since a target row exists, you can’t insert.
    • At minimum, only one clause is required and it can be any of the when clauses.
    • It’s best to compare all the columns in the predicate of the when matched clause to ensure unnecessary updates aren’t carried out 
    • Format is
The output Clause
  • The output clause for modification statements, which you can use to return information from modified rows for auditing and archiving.
  • Similar to Select
    • To select the previous values from the columns, use specify the modification before the column name. You can use for
      • Insert
        • Used to return information from the inserted rows.
        • Add an output clause after the insert and select the attributes you want to return.
        • Format

                                             insert into @Destination_Table(@Attribute1, @Attribute2, @Attribute3)

                         output inserted.@Attribute1, inserted.@Attribute2, inserted.@Attribute3
      • Delete
        • Used to return information from deleted rows in a delete statement. you need to prefix the columns with the keyword deleted.
        • Add an output clause after the delete and select the attributes you want to return.
        • Format is 

                                             delete from @Destination_Table

                         output deleted.@Attribute1, deleted.@Attribute2, deleted.@Attribute3
      • Update
        • Used to return information from modified rows
          • You can return both the old and new images of the modified rows
          • To return columns before the update, use the prefix deleted
          • To return columns after the update, use the prefix inserted
          • Format is
Update @Destination_table
set @Attribute1 = @Expression1

 deleted.@attribute1, inserted.@Attribute2

      • Merge
        • Merge can carry out inserts, updates and deletes.
          • You can refer to columns from both the target and source.
          • Use $action to determine the action the merge carried out.
            • $action returns a string("insert", update or delete") to signify the action that was taken.
          • You use coalesce(inserted.@attribute1, deleted.@attribute1) to return the applicable one
          • Format is
output $action as The_action, coalesce(inserted.@Attribute1, deleted.@Attribute1)
      • Composable DML
        • to capture output from a modification statement when you only want a subset of the output rows and not all of them.
        • To do this
          • You define a table based on a modification with an output clause.
          • You have an outer insert select against a target table, with the source table being the derived table. 
        • Format is
Insert into @ArchivingTable(@Attribute1, @Attribute2, @Attribute3)
select @Attribute1, @Attribute2, @Attribute3
          output $action as the_action, inserted.@Attribute2, inserted.@Attribute3) as @Alias
where the_action = ‘Insert’
  • You can use an into to write to a table.
    • If you use into, the target cannot participate in a foreign key relationship and cannot have triggers on it. 
    • Add the into after the output clause.
  • Format is

                    insert into @Destination_Table(@Attribute1, @Attribute2, @Attribute3)

output inserted.@Attribute1, inserted.@Attribute2, inserted.@Attribute3
into @ArchiveTable(@Attribute1, @Attribute2, @Attribute3)
Select @Attribute1, @Attribute2, @Attribute3
from @SourceTable


Implementing transactions, Error handling and dynamic SQL
SQL maintains transactional control over all changes to database data. this helps ensure the integrity of database data will never be compromised by partially completed transactions, constraint violations, interference from other transactions and service interruptions
Understanding Transactions
  • A transaction is a logical unit of work: either all of it completes as a whole unit, or none of it does.
  • Controlling the result of transactions
    • Commit: When the work of a transaction has been approved by the user, SQL server completes the transaction by committing them
    • Rollback: If an unrecoverable error occurs or the user decides not to commit, then the transaction is rolled back.
  • Properties of a transactions
    • Atomic: Every transaction is an atomic unit of work, either all database changes in the transaction succeeds or none of the succeed.
      • SQL server treats every DML or DDL command individually and will not allow any command to only partially succeed.
      • If more than 1 command is present in a transaction, SQL server will not allow the entire transaction to be committed unless all commands are successful
        • This behavior can be controlled using error handling techniques.
    • Consistent: Every transaction whether successful or not, leaves the database in a consistent state as defined by all object and database constraints. If an inconsistent state results, SQL server will roll back the transaction to maintain a consistent state.
      • It ensures that all constraints in the database are enforced.
    • Isolation: Every transaction looks as though it occurs in isolation from other transactions in regards to database changes. The degree of isolation can vary based on isolation level.
      • It ensures that when a transaction makes multiple changes to the database, none of the objects being changed by that transaction are allowed to be changed by any other transaction. If another wants to change the data, it must wait for the first to conclude.
      • Maintained by locking and row-versioning
    • Durable: Every transaction endures through an interruption of service. when service is restored, all committed transactions are rolled forward( committed changes to the database are completed) and all uncommitted transactions are rolled back(Uncommitted changes are removed).
      • Maintained using the transaction log. 
      • Every database change is first written to the transaction log, with the original version of the data before being written to the database data. when a transaction is committed, the fact that it has been committed is written to the log.
  • Transaction commands
    • Begin transaction(Begin tran): marks the start of a transaction.
    • Commit transaction/work(commit): Used to commit a transaction
    • Rollback transaction/work(rollback): Used to roll-back a transaction
  • Transactions can be nested 
  • Transaction levels and states
    • @@Trancount: This can be queried to find the level of transaction
      • 0 means the code is not within a transaction
      • >0 means that there is an active transaction
      • >1 indicates the nesting level of nested transactions
      • Doesn’t report uncommittable transactions
    • XACT_STATE(): this can be queried to find the state of the transaction
      • 0 indicates there’s no active transaction
      • 1 indicates that there is an uncommitted transaction and it can be committed, but the nesting level is not reported.
      • -1 means there’s an uncommitted transaction, but it cannot be committed due to a prior fatal error.
      • Doesn’t report the transaction nesting level.
  • Transaction modes: There are 3 ways of working with transactions
    • Autocommit:
      • It is the default mode
      • Single data modification and DDL are executed in the context of a transaction that will be automatically committed when the statement succeeds or automatically rolled back if the statement fails.
      • No transactional command is issued.
      • Changes to the database are automatically handled statement by statement as transactions.
    • Implicit transactions:
      • When you issue DML, DDL or select, A transaction starts, increments @@trancount for the session, but doesn’t automatically commit or roll back the statement. You must issue a commit or rollback to finish the transaction.
      • Turned on with : SET IMPLICIT_TRANSACTIONS ON
      • Running a command sets @@Trancount to 1(you are 1 level deep in the tran). You must manually issue a commit or rollback to finish the trans.
        • Any locks taken by your command are held till you commit. if you don’t commit, you can leave locks open.
        • It is set per session.
        • You can rollback a transaction if you notice mistakes.
    • Explicit transactions
      • Begins when you issue the begin tran command to start a tran.
        • This causes the value of @@Trancount to be 1.
        • When used with implicit transactions, @@trancount is set to 2 when Begin tran is run.
          • This makes it a nested transaction.
      • Nested Transactions: This occcurs when explicit transactions are nested
        • The behavior of rollback and commit changes.
        • Every new Begin Tran increments @@trancount by 1. A commit decrements @@trancount by 1 while a rollback resets @@trancount to 0. Hence, only the commit that reduces @@trancount to 0 is relevant.
      • Marking a transaction: 
        • You can name an explicit transaction. 
          • The name must be a valid identifier
          • SQL only records tranName for the outermost tran
          • E.g. begin tran @TranName
        • Named transactions are used for placing a mark in the transaction log to which the database can be restored.
          • E.g. Begin tran @TranName with Mark
        • To restore to the mark, reference the mark when restoring the transaction log.
          • Use StopAtMark to reference the TranName.
          • Use StopBeforeMark to restre to just before the transaction.
          • E.g. 

                                                       Restore Database @DBName from disk = ‘@Backuppath\backup.bak’ with noRecovery
                                                       Restore Log @DBName from disk = ‘@Backuppath\backup.trn’ with StopAtMark = ‘TranName’

  • Transactions can span batches.
  • Transaction Options
    • Savepoints: Locations within transactions that you can use to roll back a selective subset of work.
      • Define a savepoint using the Save Transaction command.
      • Rollback must reference the savepoint. 
    • Cross-database transaction
Basic Locking
  • Locking is used to preserve the isolation of transactions.
  • There are 2 general modes
    • Shared: For sessions that read data
      • Shared locks are compatible with each other
        • Several sessions can issue shared locks on an object
    • Exclusive: For sessions that write data
      • Used to exclusive lock an object so no other transaction can change the data until the locking tran commits or rolls back.
      • In most isolation levels, most sessions cannot even read data from exclusively locked objects.
      • Exclusive locks aren’t compatible with other locks
        • No other tran can read or write to the object
      • Blocking: 
        • If a session has an exclusive lock on an object, it blocks any other session from using the object until the first session either commits or rolls back.
        • If a session has a shared lock on an object, It blocks any others attempting an exclusive lock until the first session is done.
          • Exclusive locks aren’t compatible wth shared locks
          • Transactions with READ COMMITTED isolation ensures shared locks are dropped as soon as a read is completed.
      • Deadlocking
        • A result of mutual blocking between 2 or more sessions. 
        • Occurs when sessions cannot be resolved by waiting for another tran to finish because of a cyclical relationship between several commands. 
        • SQL server aborts one of the trans and returns error 1205 to the client.
Transaction Isolation Levels
  • This is a method of controlling the isolation requirements for readers.
  • Can reduce or increase blocking level based on whether they require locks, how long locks are held
  • This is configured per session
    • A transaction can read other trans data or allows data to be changed by other trans while the current trans only reads
  • Read Committed:
    •  Default. 
    • All readers in the session will only read data changes that have been committed
    • Select will use shared locks and exclusive locks will block sessions with Read Commited.
    • TO read uncommitted data, queries need to use the (nolock) or (readuncommited) table hints.
  • Read Uncommitted
    • Allows readers to read uncommitted data.
    • Removes the shared locks taken by select statements so readers no longer block writers.
    • Dirty reads can occur.
      • the result of the select can include information that was rolled back by another tran.
  • Read Committed Snapshot(RCSI)
    • Default for Windows Azure SQL database.
    • A method of using Read Committed.
    • Uses row versioning in tempdb to store original versions of changed data. 
      • The versions are stored as long as needed to allow readers read data in its original state. hence shared locks aren’t needed on the underlying resource while reading committed data
    • Set at the DB level.
  • Repeatable Read
    • guarantees that data read in a tran can be re-read later in the tran.
    • Updates and deletes or rows already selected are prevented.
    • Shared locks are kept until the end of a tran.
    • Phantom reads can occur
      • New data can be added after the first read.
  • Snapshot
    • Uses row versioning in tempdb.
    • Configured at the DB level and set per tran.
      • A tran using snapshot isolation will be able to repeat any reads and will not see any phantom reads.
      • New rows may be added, but the tran will not see them.
    • Doesn’t require shared locks as it uses row-versioning.
  • Serializable
    • Strongest isolation level
    • set per session
    • Doesn’t allow any inserts that would satisft the select.
  • Format is

               Set Transaction Isolation level @TranIsolationLevel.

Error Handling
Detecting and raising Errors
  • When an error is encountered, An error condition is generated and SQL server takes an action.
  • If T-SQL code isn’t in a try/catch block , error messages will be passed through to the client and cannot be intercepted in T-SQL code. 
  • When SQL generates an error condition, the function @@ERROR will have a positive integer value indicating the error number.
  • Analyzing error messages
    • Error messages are usually in the form 

                         @ErrorNumber, @SeverityLevel, @State, @Line_Number

                         @Error Message
    • E.g.


    • Error Number: 
      • SQL server error messages are numbered 1 to 49999
      • Custom error messages are 50001 and higher
      • 50000 is for a custom message without an error number.
    • Severity Level
      • There are 26 levels(0 -25)
      • Errors with severity  >=16 are logged to SQL server log and eventVwr
      • Errors with severity >= 19 can only be specified by members of sysadmin
      • errors with severity >=20 cause the connection to be terminated and any open transactions to be rolled back
      • errors with severity between(0 -10) are informational only
    • State: 
      • Maximum value of 127.
      • Used internally by Microsoft
    • Error message: 
      • 255 unicode characters long.
      • all stored in sys.messages.
      • You can add using sp_addmessage
  • You can raise your own errors using RaisError and Throw
    • Raiserror()
      • Message(messageID, string or string variable) along with severity and state are required. E.g. raiseerror(@ErrorMessage, 16, 0);
      • TO issue errors with severity >20, you need to include the WITH LOG option(you must be a sysadmin). 
        • The connection will be terminated
      • Use NOWAIT to send messages immediately to the client.
      • Doesn’t terminate the batch
        • Code after it will continue to run
      • Format is
    • Throw command
      • Similar to Raiserror except for the following differences.
        • Parameters are error_number, message, state
          • Can be used without parameters in CATCH
        •  the error number doesn’t require a matching defined message in sys.messages.
        • Any parameter can be a variable
        • state can be a maximum of 255
        • The severity is always 16
        • It always terminates the batch except when in a try block.
      • Format is


  • Try_Convert and Try_Parse
    • Try_Convert: Attempts to cast a value as a target data type.
      •  if it succeeds, returns the value, and returns a null if it doesn’t succeed. 
      • E.g. Try_convert(@destination_data_Type, @ValueToConvert);
    • Try_parse: Attempts to convert data of a data type into a specified data type. 
      • If it succeeds, it returns the value, and returns a null if it doesn’t succeed.
      • E.g. try_parse(@ValueToConver as @DestinationDataType)
  • Handling Errors after detection:
    •  There are 2 error handling methods available
      • Unstructured
        • You handle errors as the happens by accessing @@Error.
          • Consists of testing individual statements for their error status immediately after the execute. 
            • When an error occurs, the error number is recorded in @@error 
            • If no error occurs, the value in @@error is 0.
          • Querying @@error causes it to reset to a new number as it reports the status of the command last executed.
            • copy the value into a variable and test the variable.
        • Using XACT_ABORT with transactions
          • Affects the entire batch
            • you can make an entire batch fail if any error occurs by beginning it with SET XACT_ABORT ON.
          • Set per session.
            • All other transactions are subject to it until it’s turned off
          • Causes a transaction to roll back based on any error with severity >10
          • you cannot trap the error or the error number.
          • None of the remaining code is executed
      • Structured
        • You designate a central location (the catch block) to handle error.
        • Done using TRY/CATCH
          • You wrap the code to test for errors in a try block
            • Every try block must be followed by a catch block where you handle its errors. they must be in the same batch
          • If an error is detected in a try block, control is passed to its catch
            • Remaining statements in the try aren’t executed
            • after the catch, control is passed to the first statement after the try_catch
          • No error message is sent to the client
            • Any error with a severity >-20  that do not close connection are handled by the catch.
            • to test for the state of a transaction, query XACT_STATE()
              • XACT_STATE values
                • 1: An open transaction that can either be committed or rolled back
                • 0: no open transaction. same as @@trancount = 0
                • -1: an open transaction that can not be committed
          • You cannot trap error that are due to compilation errors, syntax errors or non-existent objects.
          • Functions to report error details
            • Error_number(): returns the error number
            • Error_Message(): returns the error message
            • Error_Severity(): returns the severity of the error
            • Error_line(): returns the line number of the batch the error occurred
            • Error_procedure(): Returns the function, trigger of procedure executing when the error occurred.
            • Error_state(): Returns the state of the error.
          • Using raiserror in try must have a severity between(11 -19) to throw control to the catch.
          • In the catch, you can use 
            • Raiserror
            • throw with params
              • Causes execution in the catch to end.
            • Throw: raises the original error message and passes it to the client.
              • Causes execution in the catch to end.
            • XACT_ABORT
              • Transfers control to the Catch.
              • The transaction is uncommitable in the Catch.
                • XACT_State = -1
Dynamic SQL
The Technique of using T-SQL to generate and potentially execute other T-SQL. Your code dynamically constructs code that will dynamically be executed batch by batch.
  • Used in scenarios where SQL doesn’t let you specify a variable in a clause. e.g FROM clause, USE clause, column names, contents of list in IN and PIVOT
  • Execute Command is used to Execute Dynamic SQL
    • You can use string variables in the execute command
      • The string variables must be a single batch(no GO)
  • SQL Injection
    • Dynamic SQL puts your Database at risk of SQL injection as the user can simply put his code instead of yours into the application
    • To prevent it, parameterize the dynamic SQL generation and execution using sp_executesql.
      • Sp_executesql is an alternative to Exec.
      • It support parameters. The parameters must be unicode.
      • The format is


Designing and Implementing T-SQL routines
Stored Procedures
  • This are routines that reside in a DB and encapsulate code.
  • A Stored procedure is a single batch of code.
  • Called using Execute.
  • They can be passed data through input params and receive data using output params.
    • Params can be passed by position
    • Params can be passed by association.
    • Default parameters(parameters that are assigned default values) do not need to be entered.
  • Can encapsulate T-SQL code.
  • They cannot have the Use statement
  • Created using create Proc command
  • Declare a variable as output for it to be returned to the caller
    • output variables are always optional.
    • after declaring the parameter, use the keyword output(out) to make it an output parameter.
    • During execution, the output keyword must be used after setting the value of the output variable.
  • Set nocount on
    • This lets you prevents messages that show how many rows were returned from being sent to the client.
  • Return is used only if you want to return an output parameter. to return a result set, do not specify a return in the procedure definition.
  • Return and return codes
    • This is used to cause a procedure to exit. 
    • Can be used with conditional statements to control the flow of queries
    • There can be several returns in a SP.
    • Return alone causes the return of 0 for successful and a negative number for an error. 
  • You can call one stored procedure from inside another.
    • temporary tables of a stored procedure are visible down not up.
      • A SP it calls can see it’s contents
      • The calling SP cannot see the called SPs contents
    • Variables and parameters are invisible to called SPs.
  • To Declare, Format is
               create proc @SP_Name
               @variable1 as @variable1type, @variable2 as @variable2type=@DefaultValue, @variable3 as @variable3type output
               Set Nocount on
  • To Execute, Format is

               Exec @SP_NAME @variable1 = @Literal1, @Variable2 = @Literal2, @variable3 =@variable3 output

Branching Logic
  • Branching is used to control the flow of your code. They are used t enable your code handle complex situations that require different actions based on inputs.
  • Statement blocks are statements inside a Begin.. END that allow branching statements execute more than one statement after evaluation of input expressions.
    • E.G.


  • Statements used for branching logic
    • IF/ELSE
      • Allows you conditionally execute code.
      • If the expression after the if keyword is true, the statement after the if is executed. If it isn’t, the statement after the else is executed. 
        • You can have multiple else statements
    • While
      • Allows you create loops to execute a statement block as long as a condition continues to evaluate to true.
      • The condition must be modified so that it eventually leads to a false.
      • You can use the break statement to end a while statement and go to the next block of code.
      • You can use the continue statement to end a loop and cause control to start at the beginning of the statement block again
    • WaitFor
      • It causes the execution of statements to pause for a specified period of time
      • There are 3 options
        • WaitFor Delay: used to specify how much time to wait for
          • E.G. Waitfor delay ’00:00:00′
        • Waitfor Time: used to specify the time execution should continue
          • E.G. Waitfor time ’23:00:00′
        • WaitFor Recieve: used only with service broker
    • Goto
      • Causes your code to jump to a defined T-SQL label.
  • Triggers are a kind of Stored procedure associated with DML events on a table or view. 
    • It cannot be explicitly executed
  • Triggers are fired when a DML event the trigger is associated with occurs. 
  • There are 2 types of trigger events
    • DML triggers: A TSQL batch associated with a table that is defined to respond to a particular DML event such as insert, update or delete, or a combination of those events.
      • A trigger executes only once for each DML statement, no matter how many rows were affected
      • The schema must be the same as that of the table or view it is associated with.
      • Triggers execute as part of the transaction associated with the event statement.
        •  Issuing a rollback in a trigger causes all changes to rollback. 
      • Exit with a return statement.
      • You can access the inserted and deleted tables.
        • They contain rows affected by the modification that caused the trigger to fire.
      • There are 2 types
        • After
          • Fires after the event it is associated with finishes 
          • Can only be defined on permanent tables
          • Checking if @@rowcount =0 in the first line of a trigger improves performance
          • The trigger executes after the DML statement has passed all constraints check
          • After triggers can be nested to a depth of 32.
          • Format is 

                                                       Create trigger @Trigger_name

                                                       On @Table_name
                                                       After @DML_event1, @DML_Event2, @DML_Event3
                                                       If @@rowcount = 0 return;
                                                       Set NOcount on
        • Instead of
          • Fires instead of the event it is associated with.
          • can be defined on temporary and permanent tables and views.
          • Executes a batch of code instead of the DML statement.
          • commonly used on views to take care of constraints that prevent direct actions from being taken(e.g. foreign constraint).

User defined functions(UDF)

  • Encapsulate reusable code and return a scalar value or table to the caller.
  • They accept parameters which are accessed in the function as variables.
  • they are embedded in SQL and they execute as part of a SQL command. 
    • They do not require execute command
  • They cannot perform DDL, or change data in permanent tables using DML statements
  • They can be nested
  • There are 5 options
    • Encryption: obfuscates the source code
    • Schemabinding: binds the scemas of all referenced objects
    • returns null on null input: any null parameters cause a scalar UDF to return Null without executing the body of the function.
    • Called on null input: default, Function will execute even if null is passed as a parameter
    • Execute as: execute under various contexts
  • There are 2 types
    • Scalar UDFs: return a single value to the caller
      • They can appear anywhere an expression that returns a single value can appear.
      • They are run for every row in the result set and need to be very efficient.
      • to create them, the Format is

                                   Create function @FunctionName

                                        @param1 @param1DataType
                                        @param2 @param2DataType
                                   returns int
                                   AS begin
                                   return @Param1 @operator @param2
      • TO execute them , just call them from a capable clause(Where, select etc) e.g
select @Attribute1, @Attribute2, @functionName(@Attribute1, @Attribute2)
From @Relation1
    • Table-valued UDFs: return a table to the caller. 
      • They can appear anywhere a table is expected in the from clause.
      • there are 2 types
        • Inline: 
          • They contain a single select statement that returns a table.
          • Format for creation is 
                                   Create function @FunctionName
                                        @param1 @param1DataType
                                        @param2 @param2DataType
                                   returns table as return
                                   @Select_list From @relation @whereIncludingParams
          • To execute, simply select from the function 
            • e.g. Select @Attribute1, @Attribute2, @Attribute3 from @functionName(@param1, @param2).
        • Multistatement
          • You define the table to be returned as a table variable and insert data into the table variable.
          • Return only ends the function and isn’t used to send data to the caller.
          • Format for creation is
Create function @functionName
@param1 @param1DataType
@param2 @param2DataType
returns @tableVariable table
@Attribute1 @Attribute1DataType,
@Attribute2 @Attribute2DataType
     insert @TableVariable
        @Select_list From @relation @whereIncludingParams
Analyzing Query performance
Query Optimization
  • The number of execution plans grows with query complexity
  • Query Execution phases
    • T-SQL query:
      •  this it the statement to execute.
      • Performed by the relational engine
        • This uses logical operators(e.g. Join)
    • Parsing: 
      • checks whether the query is syntactically correct. 
      • The output is a parse tree( a tree of logical operators)
      • Performed by the relational engine
    • Binding: carries out name resolution for the objects in the query, binding objects to logical operators. The output is an algebrized tree( a parse tree associated with objects)
      • Performed by the relational engine
    • Optimization: generation of candidate plans and selection of a plan: the output is the actual execution plan(a single tree with physical operators)
      • Performed by the relational engine
    • Execution: 
      • This executes the query, compiling the code to binary before execution.
      • The output is the result set.
      • SQL can cache an execution plan so it doesn’t have to do optimization again.
      • Performed by the storage engine
        • This uses physical operators(e.g. merge join operation)
  • The query optimizer is a cost-based optimizer.
    • It assigns a cost to each possible plan. A higher cost means a more complex plan which means a slower query. The optimizer tries to find the one with the lowest cost. 
      • The cost of a query is calculated by determining the algorithm used by a physical operator and the cardinality estimation(estimating the number of rows to be processed).By calculating the cost of all operators in a plan, it can calculate the cost of the whole plan.
        • The number of rows is gotten from optimizer statistics
  • Caching the selected execution plan in the plan cache can speed up the execution of the same query or of an equivalent query from the execution perspective. SQL server parameterizes the query so one plan can serve multiple similar queries
    • Use of Stored procedures lets SQL parameterize queries.
  • SQL caches the execution plan separately from the actual value(the execution context) allowing SQL server reuse the execution plan many times.
System Monitoring for query optimization
  • Monitoring has an impact on the system being monitored.
  • SQL server Extended events is a very lightweight performance monitoring system
    • allows you correlate data from SQL server with data from the OS and application.
    • Server as a container for all extended event objects
      • Events:
      • Targets
      • Actions
      • Predicates
      • Types
      • Maps
  • SQL trace: an internal server mechanism for capturing events.
    • The source for a trace event can be a batch or an event like a deadlock.
    • After an event, the trace gathers the event information
    • SQL server profiler is the UI for trace. 
      • When monitoring a Production system, use trace without profiler as profiler uses system resources.
Using SET session options and analyzing query plans
  • Set Session Options
    • Set Statistics IO command
      • One of the targets of optimizing a query is to lower disk I/O
      • Lowering the number of pages SQL server has to read.
      • A session command( stays on till you complete the session or disconnect from SQL Server)
      • Sample result is 
      • returns the following data
        • Scan count: the number of index or table scans performed
        • Logical reads: the number of pages read from the data cache.when you read a whole table, it shows you an estimate about table size.
        • Physical reads: the number of pages read from the disk. Lower than the number of pages because some pages are cached
        • Read-ahead reads: the number of pages SQL reads ahead
        • LOB logical reads: the number of LOB pages read from the data cache.
        • LOB physical reads: the number of LOB pages read from the disk
        • LOB read-ahead reads: the number of LOB objects read ahead
    • Set Statistics time command
      • this returns statistics of time including CPU time and total time needed for the operation. it shows time for execution and pre-execution phases.
  • SQL caches data in memory to prevent it having to go to disk to pull data.
  • Execution Plans
    • This shows how a query is executing. 
      • the physical operators used during the execution
      • the relative cost of each operator in the total query cost as a percentage of the total query cost.
      • Arrows showing how data flows from one physical operator to another.
      • The total and estimated cardinality of rows. 
        • Their difference shows errors in cardinality estimation
      • to see more details about an operator press F4.
    • the Execution plans(estimated and actual plans) are output as text, XML and graphically.
      • To see text
        • Set Showplan_text and Set Showplan_all for estimated plans
        • Set statistics profile for actual plans
      • To see XML
        • Set showplan_XML for estimated plans
        • set statistics XML for actual plans.
      • Graphically
        • Ctrl + L for estimated plans
        • Ctrl +M for actual plans
    • Some of the operators shown
      • Table scan: scan of a whole table stored as a heap(a table can be stored as a heap or a clustered index).
      • Clustered index scan: Scan of a whole table as a clustered index/ Indexes are stored as balanced trees
      • Clustered index seek: SQL server seeks for teh first value in the seek argument(e.g. a column value in where) in a clustered index and performs a partial scan
      • Index scan: scan of a whole non-clustered index
      • Index seek: SQL server seeks for the first value in the seek argument(e.g. a column value in where) in a non-clustered index and performs a partial scan.
      • RID lookup: Lookup for a single row in a table stored as a heap using the row identifier.
      • Key lookup: Lookup for a single row in a table stored as a clustered index using the key of the index.
      • Hash match join: a join that uses the hash algorithm.
      • merge join: a join that uses the merge algorithm.
      • Nested loops: Joins that use the nested loops algorithm
      • Stream Aggregate: Aggregation of ordered rows.
      • Hash Match Aggregate: Hash algorithm used for aggregating
      • Filter: a filer of rows based on a predicate
      • Sort: a sort of incoming rows.
Dynamic management objects
  • SQL monitors itself and gathers information useful for monitoring the health of an instance, find problems such as missing indexes and optimize queries. The items are stored in Dynamic management objects(function and views).
  • DMOs are in the sys schema and start with dm_.
  • DMOs are virtual objects that give you access to data SQL server stores in memory.
  • They store cumulative data which are reset after the restart of an instance.
  • Common DMOs for query tuning
    • SQL server OS(SQLOS) related DMOs: manages OS resources specific to SQL server
      • sys.dm_os_sys_info: returns info about the system.
      • sys.dm_os_waiting_tasks: info about sessions that are currently waiting on something. E.g. they’re waiting on a block.
    • Execution-related DMOs: Info into queries that have been executed, including their query text, execution plan, number of executions. 
      • The sql_handle column is a hash map of the batch text that is executed. you can use it to retrieve info from the sys.dm_exec_sql_text.
    • index-related DMOs: info about index usage and missing indexes.
Implementing indexes and statistics
  • The logical structure of data
    • A page is an 8Kb unit and belongs to a single object. It is the smallest unit of reading and writing.
    • An extent is 8 consecutive pages
      • Mixed extent: If an extent consist of pages belonging to multiple object.
      • Uniform extent: if an extent consists of pages belonging to a single object. 
    • SQL server traces which pages and extents belong to an object through Index Allocation Map(IAM) pages. An IAM page can point to ~4GB of data.
      • Every table/index has a first IAM page.
      • IAM pages are organized as a doubly linked list(each page points to its descendant and antescedent.
      • SQL server stores a link to first IAM pages in its own tables
      • When a table is created, no pages are allocated. After the first insert, the first page and IAM are assigned.
      • Internal fragmentation means that pages are not full. The more rows you have on a page, the fewer pages SQL server must read to retrieve these rows. and the less memory used for cached pages for the same number of rows.
    • SQL server organizes tables as heaps or as balanced trees(a clustered index/table).
      • Heaps
        • Data stored in any logical order.
        • If you do not create a clustered index, then the table is a heap.
        • Data is found by scanning the whole heap in physical or allocation order. 
        • New rows are stored anywhere in a heap. this leads to their becoming fragmented.
      • Clustered index
        • A table is organized as a balance tree when you create a unique index.
        • a balanced tree has a single root page and one or more leaf pages. 
          • It can have zero or more intermediate pages
          • Data is stored in leaf pages.
          • Data is stored in logical order of the clustering key.
          • Intermediate pages point to lead-level pages. they contain a clustering key value and a pointer to a page where this value starts in logically ordered leaf level.
          • If a single page can point to all leaf-level pages, then only a root page is allocated. 
          • If more than one page needs to point to a lead-level page, an intermediate-level page is added.
          • The root page points to intermediate level pages. If the root-page cannot point to all first-level intermediate pages, a new intermediate level is created.
          • Pages on the same level are organized as a doubly linked list
            • SQL server can find the previous and next page in logical order for any specific page.
          • SQL server uses the IAM pages to track the physical allocation of the balanced tree pages.
        • Columns with unique or non-unique values can be the clustering key(key of a clustered index).
          • Uniqueness is maintained by adding a sequential integer to the clustering key’s value if it occurs again.
        • To seek for a row, SQL only needs to read 3 pages instead of the whole heap(root and intermediate pages).
        • Inserting a new row into a full page causes SQL server to split the page into 2 pages and move half of the rows to the second page as it needs to maintain the logical order of the row. 
          • This leads to internal fragmentation.
        • New pages can be reserved anywhere in a data file as physical order of pages needn’t correspond to the logical order.
          • This means the pages are out of order and are logically/ externally fragmented.
        • You can control internal fragmentation with the FillFactor option for leaf_level pages and Pad_index for the higher level pages of the create index statement. 
        • You can rebuild or reorganize an index to get rid of the external fragmentation using alter index…reorganize and alter index…rebuild.
        • A short clustering key means more rows can fit on pages above leaf level and fewer lower levels are needed. 
          • Fewer levels mean a more efficient index as SQL server needs to read fewer pages to find a row. Hence shorter keys are better for seeks
          • Good for OLTP environments
        • A large clustering key is useful where queries read huge amount of ordered data.
          • This is better for scans.
          • Good for warehousing environments
    • Non-clustered indexes
      • Indexes that do not contain all of the data and serve as pointers to table rows for quick seeks are called non-Clustered indexes.
      • Similar in structure to clustered indexes.
        • Root and intermediate levels look the same as a clustered index.
      • The leaf level is different because it doesn’t hold all the data.
        • What is stored depends on the underlying table organization(either balanced tree or heap).
        • It contains the index keys and row locators.
          • Row locators point to a row in the underlying table. 
            • If the table is a heap, the row locator is called a row identifier(RID) 
              • A 8-byte pointer containing DB file ID and Page Id for the target row, and the target row ID on that page
              • To seek for a row, SQL server needs to traverse the index to the leaf level, and read the appropriate page from the heap , retrieving the row from the page with a RID lookup.
                • RID lookups are efficient when your query searches for a small amount of rows only. 
              • If a row moves in a heap, all non-clustered indexes need to be updated to reflect their new physical location. However, SQL server is optimized to leave a forwarding pointer if a row has to move to another page.
            • If the table is a balanced tree, then the row locator is the clustering key.
              • To seek for a row, SQL server has to traverse all levels on a non-clustered index and also all levels of a clustered index in a key lookup
              • If a row moves , non-clustered indexes don’t need to be updated to reflect their new physical location as the row-locator pints to a logical location.
      • If the clustering key is narrow, SQL server can accommodate more rows on a leaf-level page than when RID(8-bytes) is used as a row locator.
      • You can have up to 999 non-clustered indexes on a table.
      • Filtered non-clustered indexes
        • This span a subset of column values only. and apply only to a subset of table rows. 
        • Useful when some values in a column occur rarely, whereas others occur frequently. You could just create a filtered index over the rare values only
        • SQL server only updates this index for a change in rare values.
        • Can be used to maintain uniqueness of values for a row with values and lots of nulls.
        • Created by adding a where to a create index statement. 
      • Columnstore Indexes
        • a method of storing index data column by column.
        • You need to create a column store index on the table.
        • It is stored compressed. when a query references a single column that is part of a columnstore index, SQL server fetches only that column from disk, not the entire row like in row storage.
        • However, to return rows, they must be reconstructed
        • Useful for queries that aren’t very selective. especially in data warehouse queries.
        • Tables with columnstore indexes are read only because rows must be reconstructed. To update a table, you must drop the columnstore index.
        • Divided into segments which are stored as LOB
          • it is the unit of transfer from disk to memory
          • Segments consist of multiple pages
          • Has metadata that stores the min and max value for each column for that segment. Only segments requested by a query are loaded into memory.
    • Indexed views
      • An index view is a materialized view. 
      • The view must be created with schemabinding
      • You must use count_big() aggregate function
    • The clustering key’s uniqueness is maintained using a uniquifier(which is added to repeating values). this ensures lookups don’t return wrong values.
    • If you update the clustering key or change the table from a balanced tree to an heap, all non-clustered indexes must be rebuilt.
Using search arguments
  • This is a way of writing predicates to enable SQL server use indexes.
    • Indexes are used only when the arguments in the predicate are searchable. The column name is alone on one side of the expression and the calculated value appears on the other side.
    • A column that has an index on it appears in the phrase alone, not as a function’s parameter. 
  • Efficient queries use the WHERE clause to filter rows.
    • The WHERE clause benefits from indexes. 
    • Where doesn’t guarantee that indexes will be used.
    • The clause has to be supported by an appropriate index and it must be selective enough. 
      • If it’s not very selective, it is less expensive to do a table/clustered index scan than to do a non-clustered index seek and then RID or key lookups. 
  • Join can benefit from indexes as well.
  • For queries that aggregate data and use group by, you should support group by with a index. 
    • Aggregating data is done wither with a hash or a stream physical aggregate operator.
      • Stream is faster, however, it needs a sorted input.  
  • Index usage can be tracked using sys.dm_db_index_usage_stats.
  • If SQL server finds all data in nonclustered indexes, then the query is covered by the nonclustered indexes and the indexes are covering indexes.
    • covering more queries by adding more columns causes the key to be longer and the index less efficient. 
      • You can include a column in a nonclustered index on a leaf level only not as part of the key.
      • You do this by using the include clause of the create index statement.
  • SQL server maintains statistics of the distribution of key values in system statistical pages.
    • SQL optimizer uses this pages for cardinality estimation.
  • Auto-created statistics
    • SQL server maintains statistics automatically by default.
    • Statistics are created for indexes, and for columns used as search arguments.
    • There are 3 database options that influence auto creation of statistics
      • Auto_create_statistics: When On, SQL server creates statistics automatically.
        • On by default
      • Auto_update_statistics: When on, SQL server auto update statistics when there are enough changes in the underlying tables and indexes. SQL server updates out of date statistics during optimization. On by default.
      • Auto-update_statistics_async: determines whether SQL server uses synchronous or asynchronous statistics update during query optimization. If statistics are update asynchronously, SQL server cannot use them for the optimization of the query that triggered the update and doesn’t wait for the statistics to be updated. 
    • Statistics are stored in a statistics binary LOB and is created on one or more column. It contains a histogram with the distribution of values in the first column. 
  • Manual maintenance of statistics is done using the 
    • Create statistics command
    • Drop statistics command
    • update statistics command
    • sys.sp_updatestats: to update stats for all tables in a DB
    • Useful after a bulk import to maintain statistics, or after an upgrade of the DBMS.
  • To see the histogram of a statistics, use the DBCC show_statistics command
    • e.g. dbcc show_statistics(N’@TableName’, N’@Index_Name’) with histogram

Coursera: Getting and cleaning data (My Course Notes)

Goal of this course is the first 3 of this sequence

Raw Data-> Processing script-> tidy data -> data analysis -> data communication
Raw and processed data(1)
  • Data are values of qualitative or quantitative variables, belonging to a set of items.
    • variables are a measurement or characteristic of an item
  • Raw data
    • The original source of data
    • Has not been subjected to processing or any other manipulation.
    • No software has been run on the data
    • the numbers on the data have not been manipulated
    • Data has not been removed from the data set
    • The data hasn’t been summarized in any way.
  • Processed/Tidy data
    • Data that is ready for analysis
    • processing can include merging, subsetting, transforming
    • All steps should be recorded.
    • Each variable must be in one column
    • each different observation of a variable should be in a different row
    • There should be one table for each kind of variable
    • If you have multiple tables, they should include a column in the table that allows them to be linked.
Components of Tidy data(1)
  • When you go from raw to tidy data, you should have
    • The raw data
    • A tidy data set
    • A code book describing each variable and its values in the tidy data set
    • An explicit and exact recipe used to go from raw data to tidy data
  • Code book
    • Includes information about the variables(including units) in the data set not contained in the tidy data
    • Includes information about the summary choices you made
    • Information about the experimental study design you used
    • Should have a section called "Study Design" that has a thorough description of how you collected the data
    • There should be a section called "Code book" that describes each variable and its units
  • The instruction list
    • A script in R or python
    • The input for the script is the raw data
    • the output for the script is the processed, tidy data
    • there are no parameters in the script.
    • It might be impossible to script every step. Instead provide instructions like this
      • Step 1= Take the raw file, run version @3.1.2 of @summarize software with parameters a=1, b=2, c=3
      • Step 2= run the software separately for each sample
      • Step 3= take column three of outfile.out for each sample and that is the corresponding row in the output data set.
Downloading Files
  • R can be used to download files
  • This allows the downloading process to be put in the R script.
    • Getting/Setting your working directory
      • The 2 main commands are getwd() and setwd()
        • They both work with absolute and relative paths
          • Absolute paths: "C:\\Users\\Andrew\\Downloads"
          • Relative Paths: ".\\data" or "..\\"
            • \\ is used in windows instead of / in other OSs
    • Checking for and creating directories
      • file.exists("@directoryName") checks if the directory exists
      • dir.create("@directoryName") creates a directory if it doesn’t exist.
  • To download from the internet
    • usedownload.file()
      • parameters are url, destfile, method.

                         e.g. download.file(fileurl, destfile="./@Directory/@File.@Format". [method = "@method"])

      • @method is curl for Https websites, not necessary
    • helps with reproducibility
    • Note the time the file was downloaded
      • use date()
Reading local Data Files
  • read.table() is the main function for reading data int R
  • Read the data into RAM
  • parameters are file, header, sep, row.names, nrows
    • quote: Used to tell R whether there are any quoted values. quote="" means no quotes
      • Sometime ‘ or " are placed in data values and cause issues when reading files, setting quote="" resolves this.
    • na.strings: sets the character that represents a missing value
    • nrows: how many rows to read from the files
    • skip: number of lines to skip before starting to read
  • Related functions are read.csv and read.csv1
    • read.csv sets sep=, and header = TRUE by default
Reading Fixed Width files
  • use read.fwf().
    • the widths option specifies an integer vector that is used to specify the width of each column.
Reading Excel files
  • requires the xlsx package
    • Install using install.packages("xlsx")
  • read.xlsx() 
    • parameters are files, sheetIndex, header
      • Sheetindex is used to state which Excel sheet the data is stored on.
    • Allows you read specific rows and columns
      • Use parameters colIndex and rowIndex
        • They accept vectors showing the columns and rows to read.
  • Write.xlsx() writes out an Excel file with similar arguments
  • XLConnect package has more options for writing and manipulating excel files.
    • XLConnect.pdf
Reading XML
  • XML consists of markup and content
    • markup are labels that give the text structure.
    • content are the actual text of the document.
  • Tags, elements and attributes
    • Tags correspond to labels. Applied to particular parts of the text so it’s structured.
      • start tags <section>
      • end tags </section>
      • empty tags <line-break /> 
    • Elements are specific examples of tags
      • <Greeting> Hello, world </Greeting>
    • Attributes are components of the label
      • <img src="jeff.jpg" alt="instructor"/>
        • src and alt are attributes
      • <step number = "3"> connect A to B </step>
  • Requires the xml package
    • xmlTreeParse(@FileURL, useInternal=TRUE/FALSE) to parse out the XML file.
      • Loads the document into memory for parsing
      • If you’re reading an html file, use htmlTreeparse() instead
      • useInternal = TRUE if you want to get all the nodes inside the document.
    • xmlRoot() is used to remove the XML version header of the XML document. output is a parsed xmlObject
    • xmlName() is used to get the tag of the document’s root.
    • use names(@RootTag) to get the tags that are just below @RootTag
  • You can directly access parts of an XML document
    • Accessed the same way a list is accessed
      • using [[@ElementNumber]]
      • To access subelements, 
        • [[@elementNumber]][[@subelementNumber]]
  • Can be done programatically
    • use xmlSApply(@ParsedXMLObject, @FunctionToApply)
      • @FunctionToApply can be xmlValue if you want to return all the contents of each children elements as a single string.
    • Xpath programming can be used to specify further what element to return.
      • XML.pdf
      • Use xpathSApply(@ParsedXMLObject, "@XpathNotationOfNodeToExtract", @FunctionToApply)
        • @FunctionToApply usually is xmlValue
        • @XpathNotationOfNodeToExtract
          • /node: Used to extract top level node
          • //@node: used to extract a node at any level
            • e.g. to extract the node named house use //house.
          • @nodePath[@attr-name]: @Node with an attribute name
          • @nodePath[@attr-name=’@Value’] @Node with tag @attr-name that has a value @Value.
            • @nodePath is the path to the node that contains the tag @attr_name. usually in the format //@OutputOfxmlName()/@ParentTag/@ParentTag
Reading JSON
  • JavaScript Object Notation(JSON) is a lightweight data storage format often used for APIs.
  • It’s similar to XML but has a different syntax/format.
  • requires the jsonlite package
  • use fromJSON(@FileToRead) to read a JSON file.
    • returns a data frame
  • To access a variable
    • @dataFrame$VariableName
  • To write data frames to JSON
    • use toJSON()
      • the pretty parameter indents the JSON document
Using data.table
  • inherits from data.frame.
    • All functions that accept data.frame accept data.table
  • Faster than data.frame at subsetting, grouping and updating
  • requires data.table package.
  • use tables() to see all the data tables in memory
  • data tables allow subsetting rows
    • @DataTable[@RowNum,]
      • To return just the @rowNum row 
      • The same as @DataTable[@RowNum]
    • @DataTable[@DataTable$@columnName=="@value",]
      • To return all rows where the column @columnName has a value @Value
  • To subset columns in Data tables,
    • you cannot subset data tables the same way you subset data frames
      • @DataTable[,c(@columnIndex1, @columnIndex2)] doesn’t work.
    • requires expressions 
      • They can be used to summarize datasets
        • @DataTable[,@OutputDataClass(@function1(@ColumnName),@function2(@columnName2))]
          • @OutputDataClass can be list, table, vector e.t.c
          • as
  • To add a new column to a Data table, use := operator

                    @DataTable[,@NewColumnName:=@ExistingColumnName @Operator @Value]

    • This allows you carry out multi step operations as expressions are involved.

                         @DataTable[, @NewColumnName:= { tmp<- (@ColumnName1+ @ColumnName2); @Function(tmp @Operator @value)}]

    • Allows you carry out plyr like operations

                    @DataTable[,@NewColumnName:=  @ColumnName1 > Value]

      • Creates a new column that has boolean values
      • also allows you use the by keyword

                                   @DataTable[,b:= mean(x+w), by=a]

        • Where a is a column with boolean values.
        • hence rows where a = TRUE will have the same value in b i.e. the calculation for the mean will only considers x and w values from those rows. also rows where a=FALSE will have the same values in b
  • Data tables must be copied explicitly using copy(). 
  • Special variables
    • .N
      • Means count the number of times something occurs
      • E.g.

                                   @dataTable[, .N, by=x]

        • this counts the number of times the distinct elements in the x column occurs.
  • Keys
    • Allows rapid subsetting and sorting.



      • The above automatically subsets the datatable based on the key.
        • returns all rows where the value of x is a
  • Joins
    • Joins are based on keys
    • Data tables can be joined based on a similar element being set as the key.
  • For fast reading
    • done using fread()
Reading from MySQL

                         @ConnectionName<- dbConnect(MySQL(),user="@userName", host="@Hostname")

    • To connect specifically to a database on an instance

                         @DBName<-dbConnect(MySQL(), user="@UserName", db="@DBName", host="@HostName")

    • To run a command

                         dbGetQuery(@ConnectionName, "@DBQuery")

    • To disconnect from an instance


    • use dbListTables() to show the tables in a Database


    • To see the column names of a table, use dbListFields()

                         dbListFields(@DBName, "@TableName")

    • To run a query on a table

                         dbGetQuery(@DBName, "@SQLQuery")

    • To read a whole table to a R object

                         @RObjectToStoreTable<- dbReadTable(@DBName, "@TableName")

      • To read a subset of the Table

                                   @RObjectToSendQuery<- dbSendQuery(@DBName, " select * from @TableName @whereConditions")

                                   @RObjectToContainResult<- fetch(@RObjectToSendQuery[,[n=@NoOfRowsToRead] );

        • To clear the result of a query on the instance’s memory.


Reading data from HDF5
  • Hierarchical Data Format is used for storing large data sets.
    • Data is stored in groups containing zero or more data sets and their metadata.
      • Each group has a group header with a group name and a list of its attributes.
      • Has a group symbol table with a list of objects in the group.
    • Data sets are multidimensional arrays of data elements with metadata
      • They have a header with name, datatype, dataspace and storage layout.
      • Have a data array with the data.
  • To interface with HDF5, you need to install packages from bioconductor


    • Install the bio base packages


  • Use h5createFile() to create a hdf5 file.


    • To create group

                         @RObjectForH5 = h5createGroup("@Filename.h5", "@GroupName[/@SubGroupName]")

  • To list the contents of an hdf5 file use h5ls()


  • to write to a group, use h5write()

               h5write(@Robject, "@FileName.h5", "@GroupName[/@SubGroupName]")

    • If the destination Group(@GroupName[/@SubGroupName]") doesn’t exist, then it is created.
    • You can write different data classes e.g. dataframes, arrays, matrices.
    • You can direct data to be written to specific places using the index option

                         h5write(@ObjectToWrite, "@FileName.h5", "@GroupName[/@SubGroupName], index=list(@FirstRowNo:@LastRowNo, @FirstColumnNo:@LastColumnNo)

  • To read data, use h5read()

               readA=h5read("@FileName.h5", "@GroupName[/@SubGroupName]")

    • You can direct data specific data to be read using the index option

                         h5read("@FileName.h5", "@GroupName[/@SubGroupName]", index=list(@FirstRowNo:@LastRowNo, @FirstColumnNo:@LastColumnNo) )

  • The HDF5 tutorial rhdf5.pdf
Reading data from the Web
Scope is data scraping, authentication and interfacing with APIs
  • Web scraping is programmatically extracting data from the HTML code of websites
    • Read the data into R using
      • readLines()

                                   @ConnectionName<- url("@WebAddress")

                                   @RObject<- readLines(@ConnectionName)
        • Close the connection after you’re done


      • htmlTreeParse()

                                   @HTMLobject <- htmlTreeParse(@url, useInternalNodes=T)

        • htmlTreeParse in in XML package
        • You can specify the nodes to read using xpathSApply()
      • GET() in the httr package

                                   @RObjectToStoreTheHtml <- get(@URL)

                                   @RobjectToStoreTheContent <- (@RObjectToStoreTheHtml , as="text")
                                   @RobjectToStoreParsedHTML<- htmlParse(@RobjectToStoreTheContent ,asText=TRUE)

        • GET() is in httr package
  • Accessing websites with passwords
    • GET allows you authenticate yourself for websites.

                         @DestinationRObject<- GET("@URL",authenticate("@UserName","@Password"))

    • Handles allow you save authentication to a website.


                         @DestinationRObject= GET(handle=@RobjectForURL , authenticate("@UserName","@Password")

      • Handle() is in the httr package
      • To access the website, simple refer to it using the handle and it’ll use the previous authentication via cookies.

                                   GET(handle=@RobjectForURL , path="/")

Reading data from APIs
  • Done using the httr package
  • You need to create an account with the API organization.
  • To access twitter form R

               myapp=oauth_app("Twitter", key= "@yourConsumerKeyHere", secret="@YourConsumerSecretHere")

               sig=sign_oauth1.0(myapp, token="@yourTokenHere", token_secret="@yourTokenSecretHere")

Reading from other sources
  • Interacting with files
    • file(): to open a connection to a text file
    • url(): to open a connection to a url
      • To read a specific line from a url, use readLines()

                                   @connection<- url(@URL)

                                   @Robject<- readLines(@connection)

    • gzfile(): to open a connection to a .gz file
    • see ?connections for information on other connection mechanisms
    • always 
  • The foreign package allows you interact with other statistical programming languages
    • Loads data from minitab(read.mtp), SAS(read.xport), SPSS(read.spss), stata(read.dta)
    • library(foreign)
  • RODBC provides interface amultiple DBMSs
  • You can read images using jpeg(), readbitmap(), png() directly into R and manipulate them
  • Reading Geographic Information Systems(GIS) data using rdgal(), rgeos(), raster()
  • You can read MP3 files using tuneR packages or seewave packages
Subsetting and sorting
  • Subsetting can be done based on rows and columns
    • Based on Columns

                         @Data[, @ColumnOrVectorOfColumns]

      • @columnOrVectorOfColumns specifies the columns to select from the dataset
    • Based on rows


      • @rowOrVectorOfRows specifies the rows to select from the dataset
      • You can also subset on rows based on conditions

                                   @data[@Data$Column @ConditionalOperator @VectorOfValues,]

    • Based on rows and columns

                         @data[(@Column1 @Operator @value @CombinationOperator @column2 @operator @value),@Column1]

      • @CombinationOperator can be & for and or | for or.
    • Dealing with NA’s
      • Using which is the most efficient way of subsetting without including NAs

                                   Data[which(@Column1 @Operator @Value),]

  • Sorting 
    • To sort a column, use sort()


      • To sort in decreasing order, use the decreasing=TRUE option
      • To sort with NA’s at the end, use na.last=TRUE
    • To sort a data frame use sort()
      • This sorts the dataframe based on the values in the column. 


        • You can specify a tie breaker
    • You can also sort based on arrange() in the plyr package
  • Adding columns to a dataframe

                    @Data$@NewColumnName<- @VectorOfValueToInsertIntoColumn

    • you can use cbind

                         cbind(@Data, @VectorOfValuesToInsert)

  • Adding rows to a dataframe
    • Use rbind()
Summarizing Data
  • summary() can be used to summarize dataFrames.
  • str() can also be used for summarization. It shows the structure of the data.
  • quantile() can be used to summarize a column
    • To view the default of 5 quantiles(0,25, 50,75 & 100)

                         quantile(@Data$@Column, na.rm=true)

    • To view specific quantiles

                         quantile(@Data$@Column, probs=@vectorOfQuantiles

      • sample of @vectorOfQualities is c(0.5, 0.75, 0.9)
        • 0.5 is median or 50% percentile
  • Table can be used to make tables to summarize data


      • This returns a count of the distinct elements in @column.
      • Use the useNA="ifany" option if you want to see a count of NA values.
    • To view the relationship between the columns of @Data, input 2 values into table

                         table(@Data$@Column1, @Data$Column2)

      • Column1 becomes the row and @Column2 becomes the column. 
    • To view a count of all elements of a column that meet a particular condition

                         table(@Data$@Column @ConditionalOperator @VectorOfValues)

      • @conditionalOperator can be %in% to check if @vectorOfValues exists in @data
  • To count the number of NAs in a column


  • To check if missing values exist in a column.


    • It returns true is an NA exists
  • To check if every value in a column satisfy a condition, use all()

               all(@Data$@Column @Condition)

  • Row and column sums
    • You can calculate the sum of all the columns in a dataset using colSums()


      • To check if any of the columns in a dataset contains an NA value,
        •  apply colSums to the is.na of the dataset



  • Crosstabs
    • used to identify where the relationships in a dataset exists.

                         xtabs(@VariableToDisplayInResult ~ @Variable1ToGroupBy [+ @Variable2ToGroupBy], data=@Data)

      • To group by all other variables, set @Variable1ToGroupBy to .
  • FlatTables
    • Often used to summarize data output of xtabs() into a compact form.


  • To see the size of a dataset


    • to specify the units of the output, use the unit=@Units command

                         object.size(@Data, unit="Mb")

Creating New variables
  • Sometimes variables are missing in a dataset. 
    • You can transform the dataset to get the values you want
    • Common missing variables are
      • Missingness indicators
      • cutting up quantitative variables
      • applying transforms
  • Creating sequences
    • You use seq()

                         seq(@FirstValue, @LastValue)

      • use the by = @value option to indicate the increment value to use in creating the sequence
      • use the length =@value option to indicate how many values should be created for the sequence.
    • To create an index of the values in a column/vector

                         seq(along = @vector)

      • The output is a sequence from 1 to n(@vector)
  • Subsetting variables

               @Data$@NewColumn= @Data$@Column %in% @vectorOfValues


    • This returns  a count of true and false observations of the conditions compared to create @NewColumn
  • Creating binary variables
    • use ifelse() to apply a value when a condition is true, and another when it is false

                         @Data$@NewColumn= ifelse(@Data$@Column @ComparisonOperator @Value, True, False)

      • To view a count of how many observations do not meet the condition,  use table() to compare @NewColumn to @Data$@Column @ComparisonOperator @Value

                                   table(@Data$NewColumn, @Data$@Column @ComparisonOperator @Value)

  • Creating categorical variables
    • Done using cut()
    • To split a variable based on it’s value, e.g. into quantiles 

                         @Data$NewColumn = cut(@Data$@Column, breaks=@MeasureToSplitData)


      • e.g. @MeasureToSplitData can be quantile(@Data$Column) to split the data into 5 quantiles.
      •  table(@Data$NewColumn) is Used to view the count of the groups @Column is split into.
      • You can compare @NewColumn to @Column using table() to view the relationship between them.
    • To split a variable into group based on quantiles
      • use cut2() in the Hmisc package

                                   @Data$@NewColumn = cut2(@Data$@Column, g=@NoOfQuantilesToSplitInto)

    • The result of a cut/cut2 is a factor variable.
  • Creating Factor variables
    • You can convert a variable into a factor

                         @Data$@ColumnToRepresentFactors<- factor(@Data$@Column)

    • To convert factor variable to numeric, use as.numeric(@Data$Column)
    • Alternatively, use transform to convert the variable type to factor

                         @Data<- transform(@data, @VariableToConvertToFactor= factor(@VariableToConvertToFactor))

      • You can use labels to change the value of the observations to a character that represents the factor

                                   @Data$Variable<- factor(@Data$variable, label=c("LabelForValue1", "LabelForValue2"…))

        • Sometimes, the number of variables are continuous(e.g decimal numbers with a variable of 0.1 and a range of 10.). You can split this into a limited number of factors using cut(). This involves creating tertiles.
          • First split the data into quantiles

                                                       @variableToStoreCutpoints<- quantiles(@Data$variable, seq(0,1, length=@NumberOfFactors), na.rm=TRUE)

          • Cut the data and define a new variable using the quantiles

                                                       data$newVariable <- cut(@Data$variable, @variableToStoreCutpoints)

Reshaping Data
  • melt is used on a data frame to select which columns are id and which are measure.

               melt(@Data, id=@VectorOfColumnsThatAreID, measure.vars= @VectorOfColumnsThatAreMeasures)

    • The output is an object with the @VectorOfColumnsThatAreMeasures where 2 new columns are made named variable and value. For each of the rows, the value of the column is put as the variable and it’s value set to the value of that column. 
  • A melted dataframe can be reshaped to a dataframe using dcast() or an array using acast()

               dcast(@Data, @Column ~variable[,@AggregationFunction])

    • This compares distinct(@column) to all the columns from the melt that are used as measure.vars and a summarized comparison showing just the count for each variable.
    • If @AggregationFunction is set to mean, it returns the mean of the variable for each distinct(@Column).
  • Averaging values
    • You can use tapply()

                         tapply(@Data$@ColumnToApplyFunctionTo, @Data$@ColumnToGroupBy, @functionToApply)

      • The result contains the sum of all ColumnToApplyFunctionTo grouped by ColumnToGroupBy if functionToApply is sum
    • You can use split()

                         split(@Data$@ColumnWithDataToMeasure, @Data$ColumnToGroupBy )

                         %>% lapply(sum)
                         %>% unlist

      • This returns a list with @ColumnWithDataToMeasure  grouped by @ColumnToGroupBy 
      • you can use lapply(sum) to get the sum of the groups
      • You can use unlist to convert the list to a vector.
    • You can use ddply
      • Part of the plyr package

                                   ddply(@Data, .(@columnToGroupBy), summarize, sum= sum(ColumnWithDataToMeasure ))


Managing data frames in R using dplyr
  • dplyr is a package developed to make it easy to work with data frames
  • dplyr expects
    • There is one observation per row
    • Each column represents a variable or measure or characteristic
    • An optimized version of plyr
  • The dplyr functions have a similar format
    • The first argument is a data frame
    • the subsequent arguments describe what to do with the first argument. 
    • You can refer to columns in the data frame directly without using $ operator
    • the result is a new data frame
    • Da
  • The dplyr verbs
    • select()
      • returns a subset of the columns of a data frame
    • filter
      • extracts a subset of rows from a data frame based on logical conditions
    • summarize
      • generates summary statistics of different variables in the data frame.
    • mutate
      • add new variables/columns or transform existing columns
    • arrange 
      • reorders rows of a data frame
      • preserves the order of the columns
    • rename
      • rename variables in a data frame
Merging data in R
  • Similar to linking as done in RDBMSs
  • merge() can be used to merge 2 dataframes x and y.
    • merge() has the following parameters which are used to specify which variable to use for the merge
      • x/y/by/by.x/by.y/all
    • By default it merges by the columns with a common name

                         merge(@DataFrameX, @DataFrameY, [by.x="@ColumnInXThatMatchesY", by.y="@ColumnInYThatMatchesX"]) 

      • if by.x and by.y aren’t specified, it’ll merge all columns that have the same name. If it’s possible for the value’s in both datasets to be different for the same named variable, by.x and by.y should be used to specify the column to use for joining.
      • all=TRUE means if a value appears in one of the datasets but not the other, it should include the value with NA for the variables in the other dataset
  • join()
    • in the plyr package
    • default is a left join.

                         join(@DataFrame1, @DataFrame2)

      • It joins columns with the same name.
    • join_all allows you join multiple dataframes.

                         list(@DataFrame1, @DataFrame2, @DataFrame3) %>% join_all

      • First combine the dataframes into a list
      • Then perform a join_all on the dataframe
      • They all need a column with the same name
Editing text variables
  • Manipulating text programmatically 
  • To see the column names of a dataframe,


    • use tolower() to convert it to small characters.
  • To split a string when a character is encountered, 

               strsplit(@Data, "@CharacterToSplitOn")

    • @CharacterToSplitOn is a special character, precede it with \\ e.g. "\\.".
    • strsplit() returns a list
    • You can use sapply to get the first elements from the list
      • Create a function that gets the first character

                                   firstElement<- function(x){x[1])

      • Use sapply to get the first elements from the list by applying a function to every element of the list.

                                   sapply(@list, firstElement)

  • To substitute out characters in a vector for others
    • To replace the first appearance of the character
      • use sub()

                                   sub("@CharacterToRemove", "@CharacterToPut", @Data)

    • To replace multiple appearances of the character
      • use gsub()

                                   gsub("@CharacterToRemove", "@CharacterToPut", @Data)

  • To search for specific values in variables
    • use grep()


      • It returns the row numbers of the observations with the string
      • to return the actual observation, use the option value=TRUE


      • to see the number of elements that satisfy the function, use length(@grepCommand)
    • use grepl()


      • This returns a logical vector with true for the observations with the string and false for observations without the string
      • You can use this to subset only cols where a string appears.

                                   @Data[grepl("@StringToSearchFor","@Data") ,]

  • The stringr package contains a lot of string manipulation functions
    • nchar(@Data)
      • To get the number of characters in a variable
    • substr(@Data, @firstPosition, @LastPostition)
      • This returns a string of the characters from the first to the last position
    • paste(@firstString, @lastString)
      • This joins 2 strings together and puts a space in between
    • paste0(@firstString, @lastString)
      • Joins the 2 strings together without a separating space
    • str_trim(@Data)
      • Removes the space characters at the end of a string
  • Characteristics of text in data sets
    • Names of variables should be
      • All lower case when possible
      • Descriptive
      • Not duplicated
      • Not have underscores, dots or white spaces
    • variables with character values
      • should be made into factor variables(depends on application)
      • should be descriptive(use TRUE/FALSE instead of 0/1)
Regular Expressions
  • They are often used with grep, grepl, sub,gsub and other string manipulation functions
  • Regular expressions can be viewed as a combination of literals and metacharacters
    • Metacharacters provide a way to represent
      • whitespace word boundaries
      • sets of literals
      • the beginning and end of a line
        • ^ represents the start of a line. 

                                             ^@String checks for observations with @String at the beginning of the line

        • $ represents the end of a line

                                             @String$ checks for observations with @String at the end of the line

      • alternatives("war" or "peace")
  • Character classes with []
    • To match a sequence of characters regardless of their cases, enclose the upper and lower cases in [] for each letter
      • [Bb][Uu][Ss][Hh] matches bush regardless of which of it’s alphabets being capital or small.
    • You can specify a range of characters i.e numbers or letters
      • to specify numbers 0 to 9, use [0-9]
      • to specify small case a to z, use [a-z] 
      • to specify capital case a to z, use [A-Z]
        • to specify either small or caps [a-zA-Z]
    • To match characters that do not satisfy a condition, use [^@C]
      • e.g [^?.]$ matches characters that do not end with ? or .
    • You can combine e.g. 
      • ^[Ii] matches only observations that the first character is I regardless of it’s case.
      • ^[0-9][a-zA-Z] matches only observation that the first character is a number and te second is an alphabet regardless of it’s case.
  • MetaCharacters
    • "."
      • The . means any character/ wild character
        • e.g. 9.11 matches 9A11, 9-11, 9.11, 9911.
    • "|"
      • | means or. It can be used to combine expressions that serve as alternatives to each other
        • e.g. food|water matches either food or water.
        • flood|earthquake|hurricane|wildfire matches either of any of the elements.
      • ^[Gg]ood|[Bbad] matches either Good, good at the beginning of the observation or Bad and bad somewhere in the observation
      • ^([Gg]ood|[Bbad]) matches either Good, good or Bad and bad at the beginning of the observation.
    • "?"
      • This indicates that the expression is optional
        • e.g. [Gg]eorge( [Ww]\.)? [Bb]ush matches "george bush", "george W. Bush"
    • "*" and "+"
      • They’re used to indicate repetition
      • "*"
        • This means the character can repeat any number of times including not appearing at all
          • e.g. (.*) matches both elements that don’t have any characters between () as well as those that have characters there
            • it matches () or (a) or (asdfas)
        • It matches the longest possible string that satisfies the regular expression.
          • e.g. ^s(.*)s returns a string containing the characters between the first and last s in an observation. 
          • To prevent greediness, use *? instead.
      • "+"
        • This means the character can repeat any number of times but must appear once.
          • e.g. [0-9]+(.*)[0-9]+ matches at lease one number,followed by any number of characters followed by at least one number.
    • "{@minimumNumberOfRepetition[, @maximumNumberOfRepetition]}"
      • They are interval qualifiers and are used to specify the minimum and maximum number of matches of an expression.
      • specifying {@minimumNumberOfRepetition} means exactly @minimumNumberOfRepitition time
      • specifying {@minimumNumberOfRepetition,} means at least @minimumNumberOfRepitition times
      • e.g. ( +[^ ]+ +){1,5} means space, not space, space between 1 to five times. 
        • e.g. " Apple Camera food " 
      • e.g. [Bb]( +[^ ]+ +){1,5} debate matches bush with up to 5 words till debate
    • "()"
      • Used to remember text matched by the subexpression enclosed
        • we refer to the matched text as \1,\2
        • E.g. +([a-zA-Z]+) +\1 + to refer to a repeat of the same word matched with the expression.
  • To search for a meta character, escape it with "\"
    • e.g. to search for ".", use "\."
Working with dates
  • date() is used to return the date and time. It’s of class character
  • Sys.Date() returns the date. It’s of date class
  • formatting dates
    • use format(@dateObject , "@FormatString")
      • The Format string values are 
        • %d is day as number(0-31)
        • %a is abbreviated weekday(mon – sun)
        • %A is unabbreviated weekday(Monday to Sunday)
        • %m is month(00-12)
        • %b is abbreviated month(jan-dec)
        • %B is unabbreviated month(January to December)
        • %y is 2 digit year
        • %Y is four digit year
    • e.g. format(@DateObject, "%d %m %y")
  • as.Date()
    • used to convert a character string to a date.

                         as.Date(@dateObject, "@DateFormatOfCharString")

      • @DateFormatOfCharString tells as.Date what format to expect. e.g. for "1Feb1999", @DateFormatOfCharString is "%d%B%Y"
  • You can find details about a day
    • weekdays(@dateObject) shows you what day of the week it is (Monday – Sunday)
    • months(@dateObject) shows you what month it is (January – December)
  • you can find difference between days if they’re of date class

               @DateObject1 – @DateObject2 

  • Lubridate
    • A package for handling date objects
    • You can use it to convert a number to a date. 
      • it reads the number in the format of the function
      • use the format of the number variable as a function.
        • ymd(@Number)
        • mdy(@Number)
        • dmy(@Number)
        • ymd_hms(@Number)
      • You can specify timezone using the tz option

                                   ymd_hms(@Number, tz="@timeZone")

        • @timeZones can be checked using ?Sys.timezone
    • Use wday(@Number) to get weekday
      • This returns the number of the day in the week.
      • use option label = TRUE to return the abbreviated weekday.
      • Date and time are preferred in the Date or POSIXct or POSIXly classes.
  • Data resources
    • Open Government sites



    • Some APIs with R interfaces


  • Manipulating Data with Dplyr
    • You need to install and load the dplyr package
    • In Dplyr, when selecting columns, you don’t need the Data$@Column format to select @column. Just use @column
    • Dplyr allows you work with tabular data from a variety of sources
    • To work with data in dplyr, load the data into a tbl_df(data frame tbl)


      • Data of class tbl_df prints better than data of other classes.
    • dplyr supplies 5 verbs that cover most data manipulation tasks
      • select()
        • Select keeps only the columns you specify
        • To select Columns @Column1, @Column2 from tbl_df @data

                                             select(@Data, @Column1, @Column2…)

          • To select a sequence of columns, use the : operator to separate the first and last columns

                                                       select(@Data, @FirstColumn, @DataColumn)

          • To remove specific columns from @data, use -@ColumnToRemove

                                                       select(@Data, -@ColumnToRemove)

          • To remove a sequence of columns 

                                                       select(@Data, -(@FirstColumn:@LastColumn)

      • filter()
        • filter keeps only rows you specify
        • To filter columns where @column = @value

                                             filter(@Data, @column1 == @value[,@column2 ==@value…])

          • To filter on multiple conditions, separate the columns with ,
          • To filter with an OR condition, separate the columns with |.

                                                       filter(@Data,@column1==@value| @column2==@value)

        • You can filter on the value of a column being NA using the is.na().
          • To filter only where the column is not NA, use !is.na(@column).
      • arrange()
        • This allows you sort a tbl_df based on the values in a column or multiple columns in ascending order

                                             arrange(@Data, @Column1ToSortBy[,@Column2ToSortBy])

          • To sort in descending order, use desc(@ColumnToSortBy)

                                                       arrange(@Data, desc(@Column1ToSortBy))

      • mutate()
        • Allows you create a computed column based on the value of an existing column in a tbl_df and inputs it as the last column.

                                             mutate(@Data, @NewColumn1 = @ExistingColumn @Operator @value)

        • You can compute multiple columns in a single statements and reference a column computed in the same statement

                                             mutate(@Data, @NewColumn1=@ExistingColumn1+1, @NewColumn2 = @NewColumn1+8)

      • summarize()
        • Collapses the dataset into a single row.
        • Can be used to give you a value based on groups in a dataset.
  • Grouping and chaining data with dplyr
    • Grouping allows the breaking up of datasets based on the values of one or more variables.
      • Done using group_by()
        • Used to group a tbl_df by one or more variables
                                   group_by(@Data, @GroupingColumn)
          • In the result set, all operations are applied on a per group basis.
            • Performing summarize() with mean() gives the mean per group                                                       
        • n() is used to count the number of observations in a group
        • n_distinct() is used to count the number of unique observations in a group
                                   summarize(@GroupedTbl, @count=n(), @Unique=n_distinct(@column))
      • To find the value that can be used to split data into a particular percentage, use quantile().
        • The value is called @percentage quantile.
          • For the 99% quantile
                                             quantile(@Data$@Column, probs= 0.99)
      • To show all the rows in a tbl_df, use View(@Tbl_df)
    • Chaining allows you string multiple function calls in a way that is compact and readable.
      • %>% is the operator used to chain functions.
        • The value before the operator is used as the input for the function on the right. 
      • This allows us work without saving the results of all the operations for each step.
      • E.g. To select columns from a tbl_df and print,

                                   @tbl_df %>% select(@Column1, @Column2, @Column3)%>%print

  • Tidying Data with tidyr
    • Tidy data is formatted in a standard way that facilitates exploration and analysis.
    • Tidy data must satisfy 3 conditions
      • Each variable forms a column
      • Each observation forms a row
      • Each type of observational unit forms a table
    • Messy data has one or more of the following characteristics
      • data-wrangling-cheatsheet.pdfColumn headers are values, not variable names
        • This problem is solved using gather()
          • Gather() takes multiple columns and collapses them into key-value pairs, duplicating all other columns as needed
          • Used when you notice you have columns that are not variables.

                                                       gather(@Data, @Key, @Value, -@ColumnToExclude)

            • @key is the name of the output column that will store a variable that can have multiple values. e.g. sex variable with values Male and Female
            • @value is the Value that stores the measure of the key in the observation.
            • @column to exclude is the column in @Data that is already stored properly
            • If their are NA values, use option na.rm=true to remove them while doing gather()
            • to gather a sequence of columns, use : between the first and last columns
      • multiple variables are stored in one column
        • This problem is solved using separate()
          • This uses a regex or a vector of character positions to turn a single char column into multiple columns

                                                       separate(@Data, col=@ColumnToSplit, into=@vectorofColsToSplitInto, sep=@valueToUseToSplit)

            • @vectorofColsToSplitInto is a vector that shows the names of variable to split @ColumnToSplit into.
            • @valueToUseToSplit contains the value in @ColumnToSplit to use to split it into the destination number of variables. By default, it splits by any character that is not alphanumeric.
      • Variables are stored in both rows and columns
        • This requires use of gather() and separate()
        • Also involves the use of spread()
          • Use to spread a key-value pair across multiple columns.


            • distinct(@ColumnWithVariable ) is used to create new columns with each one becoming a column. It is the key in this relationship.
            • @columnWithValue is used to assign a value to each of the column created. It is the value in this relationship.
      • a single observational unit is stored in multiple tables
        • To solve this problem, you need to merge both datasets.
          • First add an additional column to the datasets to signify their difference.

                                                       @Dataset1<- mutate(@Dataset1, @ColumnToDistinguish = @value)

                                                       @Dataset2<- mutate(@Dataset2, @ColumnToDistinguish = @value)

          • Use bind_rows() to join the 2 dataset

                                                       bind_rows(Dataset1, Dataset2)

      • multiple types of observational units are stored in the same table
        • We can select out values that occur multiple times like a relational table and lookup the the other values using a unique identifier.
  • To extract a number from a character string, use extract_numeric()

               extract_numeric(Value5) returns 5

  • To aggregate a data frame

               ag <- aggregate(. ~@groupingColumn, @Data, function(x) c(mean=mean(x),sd=sd(x))

    • this calculate the mean and sd for all variables in a data frame