Book Review – SQL Server Internals: In-Memory OLTP – Detailed Notes

Here are my unedited chapter notes:

Chapter 1
Chapter 2
  • Collation – current version requires BIN2 on character index columns. Best to do so at the column level. Supports only sorting, comparison and grouping. Will remove the need for case sensitive code on tables and columns but not data.
  • Collate database_default to handle tempdb collation of temp objects. Research more.
  • Interpreted SQL via interop useful for ad hoc or migration of code.  Doesn’t perform as well as compiled.
  • Key point restrictions really show that the Hk tables need to be isolated bc of the interface between the engines. P 46
Chapter 3
  • The restrictions of chapter 2 become clearer in this chapter with the description of the row structure. Traditional CS data structure
  • All indexes can be thought of as covering, each row has the actual data.
  • Also clear why tables structures can’t be modified. P51
  • Good example highlighting the time stamp usage in MVCC
Chapter 4
Hash Indexes
  • Indexes must be defined at table creation – part of the header
  • Indexes are in memory structures only and operations on them are not logged
  • Hash index – hashing function and collisions explained. Linked lists in buckets
  • Index pointer – array in the row header. Null when row points to no other rows. Points to the “next” row in the index at that hashed index entry. Every row had at least one pointer to it, either directly from the hash index bucket or from another row
  • After reading a second time this was much clearer
  • Hash indexes useful for equality search with high cardinality
  • For seeking on a hash index you must provide all key columns. For example last and first with a search on last will not use the index
  • Buckets – number of elements in the has array. Should be equal or greater than the cardinality of the index key columns. So that a bucket points to a unique values chain. Too few you get collisions on different values. Too many is wasteful. dm_db_xtp_ hash_index_stats
  • Large average chain length could mean
    • 1. Empty Bucket count low. Avg and max chain lengths are similar. Many different index keys to the same bucket.
    • 2. Empty bucket count high. Max length is high in relation to average chain length. Many row with duplicate keys or skew in key values. Not evenly distributed
    • Short chains along with high empty bucket cube could indicate bucket count is set too high
    Range Indexes
    • Range indexes – If cardinality is unknown or will be range searching hash indexes aren’t helpful.
    • Range index connects ties at the leaf level using a bw-tree. Latch free variant of a b-tree. Index pages are variable sized. Index page contains an ordered set of key values. For each value there’s a pointer to either another internal index page or to a leaf page pointing to a data row. At the data row there could be a list of rows linked together on the key value
    • A key distinction for bw-trees is the page pointer is a logical PID Rather than a physical memory address.  PID is the position in the mapping table which contains the physical location. Pages are never updated they are replaced and the mapping table is updated so the PID uses the new physical address
    • At the leaf level still uses the PID which points to the data row
    • Delta record – a page containing an insert or update. The mapping table is updated with the physical address of the delta record. Searching/range scanning may have to traverse this chain that is until consolidation occurs. Combining pages into a new base page. Benefit is not having to update tree during each update
    • **It’s not clearly explained what the benefit of the mapping table is. **
    Chapter 5
    • Pessimistic acquiring a shared lock or range lock. Opportunistic checking that rows haven’t changed before commit of transaction
    • In MVCC SQL assumes that concurrent transactions won’t interfere and performs validation checks once a transaction commits to ensure it obeys the required isolation level
    • Isolation levels used in in memory OLTP.
    • Snapshot
    • Repeatable read
    • Serializable
    • There should be a description of the read phenomenon
    • Cross container transactions are really two sub transactions. one disk based and one in memory transaction
    • If executing an explicit transaction it will run in read committed. If accessing a in memory OLTP then this will error as this is an unsupported isolation level.  Do specify the isolation level as a table hint.  Read committed is supported during auto-commit or implicit transactions
    • Technically for a single statement transaction there is no difference between read committed and snapshot on a memory optimized table
    • Snapshot cannot be used in cross container transactions because each is considered/implemented as a sub transaction
    • The default isolation level is snapshot. There’s a new DB option to set it default for memory optimized tables. Alternatively set the isolation level with a table hint
    • Monitoring transactions sys.dm_db_xtp_transactions xtp_transaction_id is the sequential global transaction ID used to begint and endt
    • Read write conflict a transaction tries to update a row that has been updated by an active transaction. But since there are no locks anomalies can still occur.
    Validation phase
    1. Check for violations of isolation levels see read committed example on pg 107
    2. Wait for commit dependencies to goto 0
    3. Log the changes
    • This is the only waiting that can occur in memory optimized tables.
    • Waits could happen for commit dependencies, transaction log writes. Both which can be very short.
    1. Validate no changes violate isolation levels. Basically is the transaction globally transactionally consistent.
    2. Commit dependencies. A CD is when a transaction reads data from another transaction that has finished phase 1 of validation with valid end time stamps (transaction Id or actual time stamps?).  In practice very rare. There is an EE and wait type. Waiting_for_dependencytx_event.
    3. Logging. If any of the modified tables were created with SCHEMA_AND_DATA then SQL must log. Sql reads the write set to determine what to log. transactions track all there changes in the write set. A list of delete/insert ops with pointers to the version associated with each op
    1. Once a log record is hardened to storage the state of the transaction is changed to committed. Then reduce the dependency counters by one.
    1. Post processing. Updates time stamps of each row inserted or deleted by this transaction. Switches time stamps from transaction IDs to real time stamps.
    • Garage collection – Find rows who’s time stamps are earlier than the oldest active transaction on the system. It is marked stale
    • The actual collection process is completed in cooperation with user threads. During a regular user access. If a stale thread is encountered it will be makes expired or removed.
    Chapter 6
    The transaction log consists of
    • Log streams – changes made by committed transactions. Insertion and deletions of row versions.
    • Checkpoint streams
    • Data streams – contains all versions inserted during a timestamp interval
    • Delta streams – associated w data streams, contain a list of integers which versions its corresponding data steam have been deleted
    • No longer uses LSNs for sequencing uses end Ts for serialization order
    • Still only supports one log per database but due to more efficient logging the potential for log bottleneck is reduced.
    • Reduced logging – no need to log index operations. All indexes are rebuilt during recovery
    • Each atomic change isn’t logged to a single log record rather InMemory OLTP will combine into a single record
    • InMemory OLTP does not need to log uncommitted transactions – opportunistic remember?
    • For InMemory OLTP no logging is used for info. Just redo
    • 100 inserts into a disk based table generates 200 log records. Into a InMemory OLTP table only 3.
    • Continuous check pointing – checkpoints are continuous and incremental as activity accumulates. Rather than periodic like in disk based tables
    • Streaming IO – ie. sequential
    • Checkpoints are manual or every 512mb
    • Checkpoint files are stored in file stream files
    • Two types of checkpoint files data and delta. Called a checkpoint file pair
    • Data file – new versions of rows in a time interval. Append only files. Once closed only read
    • Delta file – info about about which rows have been deleted from the corresponding data file. Once the data file is closed so is the delta file. At recover used to filter out rows from the data file
    • CFP pairs can be recovered from concurrently
    • On insert a continuous checkpoint will append the row to a checkpoint file
    • A checkpoint event will close a set of files state goes from under construction to active. No more writes to these files but can be referred to by other insert update operations from other CFPs
    • Every checkpoint generates a new set of files. As such older files may no longer be needed and are merged
    • If no open transactions refer to the CFPs they transition to non-active. After a log backup they are no longer needed and can be removed by the garbage collection process
    • See page 135 for states. db_xtp_checkpoint_files
    • Pre-created – pre built as an optimization
    • The number of pre-created CFPs is equal to the number of schedulers
    • Under construction – open CFP being with continuous checkpoints being written to. A checkpoint event will close and change to active
    • Active – CFPs containing inserted/deleted rows for the last checkpoint event. Continuous checkpoint is no longer writing new data but deleted are added to delta files. During recovery active CFPs contain all inserted/deleted rows needed to restore the data before applying the tail of the log backup
    • Storage array – 8192 entry array holding CFP entries. Max 256GB of CFP data
    • CFPs in the storage array along with the tail of the log backup represent all the needed data to recover a InMemory OLTP table in a db
    • Before there are any active CFPs we can use the transaction log for recovery
    • Calling checkpoint is really closing a continuous checkpoint
    • It changes the under construction CFPs to active. Adds them to the storage array. All log entries on InMemory OLTP tables  not covered by a previous checkpoint are converted into active CFPs.
    • Once finished:
    1. all buffered InMemory OLTP writes are flushed to data and delta files
    2. checkpoint inventory (file descriptors from the previous checkpoint interval and any files added by the current checkpoint) are hardened to the log
    3. the location of the inventory is hardened to the log
    • A checkpoint event includes a timestamp which indicates that the effects of all transactions are before the checkpoint timestamp are in checkpoint files rather than the log thus the log is not needed to recover them. Still need a log backup to truncate log.
    • Each checkpoint generates new files. More files, the longer crash recover will take
    • We merge files with adjacent timestamp ranges when their active content drops below a percentage. See page 145 for table. Also when adjacent files are < 50% full. Can happen during manual checkpoints
    • Due to this process the largest storage space CFPs can use is two times larger than the corresponding memory optimized size
    • You can merge manually if needed. Very rare
    • Do the demos from this section
    • Merged source files are removed from the storage array and are eligible for garbage collection after checkpoint and log backups.
    Garbage Collection of checkpoint files
    1. Checkpoint – guarantees that data in checkpoint files is no longer needed.
    2. Log backup – guarantee that the log truncation point is safely beyond where the CFPs will be needed.
    3. Unused files garbage collected – uses the same mechanism as regular file stream data.
    • We can force GC and file stream GC via a SP Page 152-3
    • If you do a manual GC if files you’ll need to make sure you account for the additional log backups.
    • Starts after the most recent checkpoint inventory has been discovered during a scan of the tail of the log. Sql server tells InMemory OLTP this location and can begin a parallel recovery. The global transaction timestamp is initialized with the highest value found in the transactions recovered
    • Each delta file is used as a filter to eliminate rows from the corresponding data file.  Each CFPs can be processed in parallel. One thread per core
    • As data loads rows are linked into their indexes. Hash indexes – rows are linked into their buckets. Range indexes – added to an existing chain or created if the key doesn’t duplicate one already encountered during recovery
    • CFP load is complete the tail of their is replayed front the time of the last checkpoint. Finished!
    Chapter 7
    • A table or stored proc can be natively compiled
    • The performance benefits of native compilation are best in complex stored procedures. Use for most performance critical parts
    • TSQL for tables and stored procedures generate C code. Recompiled at instance startup
    • The execution plan is part of the DLL. There is no re compilation. You must drop and recreate if you want a new plan. Compiled ok first execution and after instance restart and AG failover
    • Intended for short basic operations many complex constructs are not allowed
    • No parameter sniffing at compilation. Compiled with parameters UNKNOWN
    • When using hash indexes the query must specify all key columns in the where clause or else scan. Also based on equality. Must specify whole value not a partial string
    • Range indexes are only forward. There are no previous pointers. You’d need an index ascending and descending
    • No Halloween protection in plan. On disk based tabled this is via spool operators. In InMemory OLTP it’s done via statement id in the row header
    • No parallel plans
    • No auto update stats. Because there’s no row mod counter
    • When creating stored procedures be sure data is loaded at compilation time
    • Need native compilation to achieve the performance gains of InMemory OLTP. Needed to reduce CPU instructions so couldn’t do interpreted stored procedures
    • Query stats are not captured by default. Need to enable with a stored procedure. Page 178. Can be used to collect on one object or all
    Chapter 8
    • Memory allocation is critical. Running out of memory can cause operations to fail. If under memory pressure InMemory OLTP will more aggressively clean up old row versions
    • 256GB max for all InMemory OLTP tables in a DB. This is because of the number of checkpoint files
    • Rule of thumb is plan for 2x size of data for memory allocation. More of write heavy. Less if read heavy.  See pg 187 for more details
    • Use resource governor to control memory allocations. Many Dbs can be in the same pool. Db must be taken offline then on-lined to use pool
    • In sys.indexes range indexes are type 2 same as non clustered. Hash indexes are 7
    Best practices 
    • Use collate bin2 at the column level rather than db level
    • Be accurate in you bucket count. Should be close to the number of unique values
    • For low cardinality use range instead of hash
    • Stats are not automatically updated on InMemory OLTP tables
    • Memory optimized table variables behave the same as table variables. But don’t use tempdb. Could help relieve contention
    Add summary of 199
    Good summary of where InMemory OLTP fits in.