Book Review – SQL Server Internals: In-Memory OLTP – Detailed Notes

Here are my unedited chapter notes:

Chapter 1

Chapter 2

Collation – current version requires BIN2 on character index columns. Best to do so at the column level. Supports only sorting, comparison and grouping. Will remove the need for case sensitive code on tables and columns but not data.
Collate database_default to handle tempdb collation of temp objects. Research more.
Interpreted SQL via interop useful for ad hoc or migration of code. Doesn’t perform as well as compiled.
Key point restrictions really show that the Hk tables need to be isolated bc of the interface between the engines. P 46

Chapter 3

The restrictions of chapter 2 become clearer in this chapter with the description of the row structure. Traditional CS data structure
All indexes can be thought of as covering, each row has the actual data.
Also clear why tables structures can’t be modified. P51
Good example highlighting the time stamp usage in MVCC

Chapter 4

Hash Indexes

Indexes must be defined at table creation – part of the header
Indexes are in memory structures only and operations on them are not logged
Hash index – hashing function and collisions explained. Linked lists in buckets
Index pointer – array in the row header. Null when row points to no other rows. Points to the “next” row in the index at that hashed index entry. Every row had at least one pointer to it, either directly from the hash index bucket or from another row
After reading a second time this was much clearer
Hash indexes useful for equality search with high cardinality
For seeking on a hash index you must provide all key columns. For example last and first with a search on last will not use the index
Buckets – number of elements in the has array. Should be equal or greater than the cardinality of the index key columns. So that a bucket points to a unique values chain. Too few you get collisions on different values. Too many is wasteful. dm_db_xtp_ hash_index_stats
Large average chain length could mean

1. Empty Bucket count low. Avg and max chain lengths are similar. Many different index keys to the same bucket.
2. Empty bucket count high. Max length is high in relation to average chain length. Many row with duplicate keys or skew in key values. Not evenly distributed

Short chains along with high empty bucket cube could indicate bucket count is set too high

Range Indexes

Range indexes – If cardinality is unknown or will be range searching hash indexes aren’t helpful.
Range index connects ties at the leaf level using a bw-tree. Latch free variant of a b-tree. Index pages are variable sized. Index page contains an ordered set of key values. For each value there’s a pointer to either another internal index page or to a leaf page pointing to a data row. At the data row there could be a list of rows linked together on the key value
A key distinction for bw-trees is the page pointer is a logical PID Rather than a physical memory address. PID is the position in the mapping table which contains the physical location. Pages are never updated they are replaced and the mapping table is updated so the PID uses the new physical address
At the leaf level still uses the PID which points to the data row
Delta record – a page containing an insert or update. The mapping table is updated with the physical address of the delta record. Searching/range scanning may have to traverse this chain that is until consolidation occurs. Combining pages into a new base page. Benefit is not having to update tree during each update
**It’s not clearly explained what the benefit of the mapping table is. **

Chapter 5

Pessimistic acquiring a shared lock or range lock. Opportunistic checking that rows haven’t changed before commit of transaction
In MVCC SQL assumes that concurrent transactions won’t interfere and performs validation checks once a transaction commits to ensure it obeys the required isolation level
Isolation levels used in in memory OLTP.

Snapshot
Repeatable read
Serializable
There should be a description of the read phenomenon

Cross container transactions are really two sub transactions. one disk based and one in memory transaction
If executing an explicit transaction it will run in read committed. If accessing a in memory OLTP then this will error as this is an unsupported isolation level. Do specify the isolation level as a table hint. Read committed is supported during auto-commit or implicit transactions
Technically for a single statement transaction there is no difference between read committed and snapshot on a memory optimized table
Snapshot cannot be used in cross container transactions because each is considered/implemented as a sub transaction
The default isolation level is snapshot. There’s a new DB option to set it default for memory optimized tables. Alternatively set the isolation level with a table hint
Monitoring transactions sys.dm_db_xtp_transactions xtp_transaction_id is the sequential global transaction ID used to begint and endt
Read write conflict a transaction tries to update a row that has been updated by an active transaction. But since there are no locks anomalies can still occur.

Validation phase

Check for violations of isolation levels see read committed example on pg 107
Wait for commit dependencies to goto 0
Log the changes

This is the only waiting that can occur in memory optimized tables.
Waits could happen for commit dependencies, transaction log writes. Both which can be very short.

Validation

Validate no changes violate isolation levels. Basically is the transaction globally transactionally consistent.
Commit dependencies. A CD is when a transaction reads data from another transaction that has finished phase 1 of validation with valid end time stamps (transaction Id or actual time stamps?). In practice very rare. There is an EE and wait type. Waiting_for_dependencytx_event.
Logging. If any of the modified tables were created with SCHEMA_AND_DATA then SQL must log. Sql reads the write set to determine what to log. transactions track all there changes in the write set. A list of delete/insert ops with pointers to the version associated with each op

Once a log record is hardened to storage the state of the transaction is changed to committed. Then reduce the dependency counters by one.

Post processing. Updates time stamps of each row inserted or deleted by this transaction. Switches time stamps from transaction IDs to real time stamps.

Garage collection – Find rows who’s time stamps are earlier than the oldest active transaction on the system. It is marked stale
The actual collection process is completed in cooperation with user threads. During a regular user access. If a stale thread is encountered it will be makes expired or removed.

Chapter 6

The transaction log consists of

Log streams – changes made by committed transactions. Insertion and deletions of row versions.
Checkpoint streams

Data streams – contains all versions inserted during a timestamp interval
Delta streams – associated w data streams, contain a list of integers which versions its corresponding data steam have been deleted

No longer uses LSNs for sequencing uses end Ts for serialization order
Still only supports one log per database but due to more efficient logging the potential for log bottleneck is reduced.

Reduced logging – no need to log index operations. All indexes are rebuilt during recovery
Each atomic change isn’t logged to a single log record rather InMemory OLTP will combine into a single record
InMemory OLTP does not need to log uncommitted transactions – opportunistic remember?
For InMemory OLTP no logging is used for info. Just redo
100 inserts into a disk based table generates 200 log records. Into a InMemory OLTP table only 3.

Checkpoint

Continuous check pointing – checkpoints are continuous and incremental as activity accumulates. Rather than periodic like in disk based tables
Streaming IO – ie. sequential
Checkpoints are manual or every 512mb
Checkpoint files are stored in file stream files
Two types of checkpoint files data and delta. Called a checkpoint file pair

Data file – new versions of rows in a time interval. Append only files. Once closed only read
Delta file – info about about which rows have been deleted from the corresponding data file. Once the data file is closed so is the delta file. At recover used to filter out rows from the data file

CFP pairs can be recovered from concurrently
On insert a continuous checkpoint will append the row to a checkpoint file
A checkpoint event will close a set of files state goes from under construction to active. No more writes to these files but can be referred to by other insert update operations from other CFPs
Every checkpoint generates a new set of files. As such older files may no longer be needed and are merged
If no open transactions refer to the CFPs they transition to non-active. After a log backup they are no longer needed and can be removed by the garbage collection process
See page 135 for states. db_xtp_checkpoint_files
Pre-created – pre built as an optimization
The number of pre-created CFPs is equal to the number of schedulers
Under construction – open CFP being with continuous checkpoints being written to. A checkpoint event will close and change to active
Active – CFPs containing inserted/deleted rows for the last checkpoint event. Continuous checkpoint is no longer writing new data but deleted are added to delta files. During recovery active CFPs contain all inserted/deleted rows needed to restore the data before applying the tail of the log backup
Storage array – 8192 entry array holding CFP entries. Max 256GB of CFP data
CFPs in the storage array along with the tail of the log backup represent all the needed data to recover a InMemory OLTP table in a db
Before there are any active CFPs we can use the transaction log for recovery
Calling checkpoint is really closing a continuous checkpoint
It changes the under construction CFPs to active. Adds them to the storage array. All log entries on InMemory OLTP tables not covered by a previous checkpoint are converted into active CFPs.
Once finished:

all buffered InMemory OLTP writes are flushed to data and delta files
checkpoint inventory (file descriptors from the previous checkpoint interval and any files added by the current checkpoint) are hardened to the log
the location of the inventory is hardened to the log

A checkpoint event includes a timestamp which indicates that the effects of all transactions are before the checkpoint timestamp are in checkpoint files rather than the log thus the log is not needed to recover them. Still need a log backup to truncate log.

Merging

Each checkpoint generates new files. More files, the longer crash recover will take
We merge files with adjacent timestamp ranges when their active content drops below a percentage. See page 145 for table. Also when adjacent files are < 50% full. Can happen during manual checkpoints
Due to this process the largest storage space CFPs can use is two times larger than the corresponding memory optimized size
You can merge manually if needed. Very rare
Do the demos from this section
Merged source files are removed from the storage array and are eligible for garbage collection after checkpoint and log backups.

Garbage Collection of checkpoint files

Checkpoint – guarantees that data in checkpoint files is no longer needed.
Log backup – guarantee that the log truncation point is safely beyond where the CFPs will be needed.
Unused files garbage collected – uses the same mechanism as regular file stream data.

We can force GC and file stream GC via a SP Page 152-3
If you do a manual GC if files you’ll need to make sure you account for the additional log backups.

Recovery

Starts after the most recent checkpoint inventory has been discovered during a scan of the tail of the log. Sql server tells InMemory OLTP this location and can begin a parallel recovery. The global transaction timestamp is initialized with the highest value found in the transactions recovered
Each delta file is used as a filter to eliminate rows from the corresponding data file. Each CFPs can be processed in parallel. One thread per core
As data loads rows are linked into their indexes. Hash indexes – rows are linked into their buckets. Range indexes – added to an existing chain or created if the key doesn’t duplicate one already encountered during recovery
CFP load is complete the tail of their is replayed front the time of the last checkpoint. Finished!

Chapter 7

A table or stored proc can be natively compiled
The performance benefits of native compilation are best in complex stored procedures. Use for most performance critical parts
TSQL for tables and stored procedures generate C code. Recompiled at instance startup
The execution plan is part of the DLL. There is no re compilation. You must drop and recreate if you want a new plan. Compiled ok first execution and after instance restart and AG failover
Intended for short basic operations many complex constructs are not allowed
No parameter sniffing at compilation. Compiled with parameters UNKNOWN
When using hash indexes the query must specify all key columns in the where clause or else scan. Also based on equality. Must specify whole value not a partial string
Range indexes are only forward. There are no previous pointers. You’d need an index ascending and descending
No Halloween protection in plan. On disk based tabled this is via spool operators. In InMemory OLTP it’s done via statement id in the row header
No parallel plans
No auto update stats. Because there’s no row mod counter
When creating stored procedures be sure data is loaded at compilation time
Need native compilation to achieve the performance gains of InMemory OLTP. Needed to reduce CPU instructions so couldn’t do interpreted stored procedures
Query stats are not captured by default. Need to enable with a stored procedure. Page 178. Can be used to collect on one object or all

Chapter 8

Memory allocation is critical. Running out of memory can cause operations to fail. If under memory pressure InMemory OLTP will more aggressively clean up old row versions
256GB max for all InMemory OLTP tables in a DB. This is because of the number of checkpoint files
Rule of thumb is plan for 2x size of data for memory allocation. More of write heavy. Less if read heavy. See pg 187 for more details
Use resource governor to control memory allocations. Many Dbs can be in the same pool. Db must be taken offline then on-lined to use pool
In sys.indexes range indexes are type 2 same as non clustered. Hash indexes are 7

Best practices

Use collate bin2 at the column level rather than db level
Be accurate in you bucket count. Should be close to the number of unique values
For low cardinality use range instead of hash
Stats are not automatically updated on InMemory OLTP tables
Memory optimized table variables behave the same as table variables. But don’t use tempdb. Could help relieve contention

Add summary of 199

Good summary of where InMemory OLTP fits in.