LSM

LSM

http://www.benstopford.com/2015/02/14/log-structured-merge-trees/

  1. idea: collect and batch updates in memory. 放满buffer再写入。删除的时候先放到tombstone里,再删除。
  2. merge 的想法:使用bloom filter在query的时候可以知道这个文件会不会在里面。在merge update可以prune tree。在merge的时候删tombstone。
  3. LSMT,memory里的数据结构是C0,disk里的是C1
    1. fill memory
    2. spill to disk. use SSTables(连续key value pair) and SSIndexes(key index pair) support random access
    3. Index and bloom filter in memory
    4. merge perform comaction
    5. write-ahead logs to aid recovery
    6. C0 is smaller and entirely resident in memory, whereas C1 is resident on disk. New records are inserted into the memory-resident C0 component. If the insertion causes the C0 component to exceed a certain size threshold, a contiguous segment of entries is removed from C0 and merged into C1 on disk.

http://distributeddatastore.blogspot.com/2013/08/cassandra-sstable-storage-format.html)

Bloom Filter

  • what: one bit per hash function consumed per key. ignore collision. use bit to reprensent exist or not.
  • why: secondary storage is slow to search and access
  • whow: use k hash function for the item and get k array, set these all in 1 and when we query this item, we will check all k array result