Rethinking Database Algorithms For Phase Change Memory-Books Pdf

Rethinking Database Algorithms for Phase Change Memory
19 Feb 2020 | 23 views | 0 downloads | 11 Pages | 952.44 KB

Share Pdf : Rethinking Database Algorithms For Phase Change Memory

Download and Preview : Rethinking Database Algorithms For Phase Change Memory


Report CopyRight/DMCA Form For : Rethinking Database Algorithms For Phase Change Memory



Transcription

Table 1 Comparison of memory technologies,DRAM PCM NAND Flash HDD. Read energy 0 8 J GB 1 J GB 1 5 J GB 28 65 J GB,Write energy 1 2 J GB 6 J GB 17 5 J GB 28 65 J GB. Idle power 100 mW GB 1 mW GB 1 10 mW GB 10 W TB,Endurance 106 108 104 105. Page size 64B 64B 4KB 512B,Page read latency 20 50ns 50ns 25 s 5 ms. Page write latency 20 50ns 1 s 500 s 5 ms, Write bandwidth GB s per die 50 100 MB s per die 5 40 MB s per die 200MB s per drive.
Erase latency N A N A 2 ms N A,Density 1 2 4 4 N A. Note The table contents are based mainly on 10 15 22. RESET pulse Tmelt a CPU,Temperature,SET pulse Tcryst CACHE CACHE CACHE. DRAM CACHE,PCM DRAM PCM, Figure 1 Currents and timings not to scale for SSD HARD DISK SSD HARD DISK SSD HARD DISK. SET RESET and READ operations on a PCM cell, For phase change material Ge2 Sb2 T e5 Tmelt 610 C Figure 2 Candidate main memory organizations. and Tcryst 350 C with PCM, the primary main memory and the key challenge of over technology for today s solid state drives and HDD hard.
coming its write limitations disk drives showing the following points. 2 1 PCM Technology Compared to DRAM PCM s read latency is close to. Phase change memory PCM is a byte addressable non that of DRAM while its write latency is about an. volatile memory that exploits large resistance contrast be order of magnitude slower PCM offers a density ad. tween amorphous and crystalline states in so called phase vantage over DRAM This means more memory capac. change materials such as chalcogenide glass The difference ity for the same chip area or potentially lower price. in resistance between the high resistance amorphous state per capacity PCM is also more energy efficient than. and the low resistance crystalline state is typically about DRAM in idle mode. five orders of magnitude and can be used to infer logical Compared to NAND Flash PCM can be programmed. states of binary data high represents 0 low represents 1 in place regardless of the initial cell states i e with. Programming a PCM device involves application of elec out Flash s expensive erase operation Therefore. tric current leading to temperature changes that either SET its sequential and random accesses show similar far. or RESET the cell as shown schematically in Figure 1 To superior performance Moreover PCM has orders of. SET a PCM cell to its low resistance state an electrical magnitude higher write endurance than Flash. pulse is applied to heat the cell above the crystalization tem. perature Tcryst but below the melting temperature Tmelt of Because of these attractive properties PCM is being incor. the phase change material The pulse is sustained for a suffi porated in mobile handsets 24 and recent computer ar. ciently long period for the cell to transition to the crystalline chitecture and systems studies have argued that PCM is a. state On the other hand to RESET the cell to its high promising candidate to be used in main memory in future. resistance amorphous state a much larger electrical current mainstream computer systems 9 15 22. is applied in order to increase the temperature above Tmelt Figure 2 shows three alternative proposals in recent stud. After the cell has melted the pulse is abruptly cut off caus ies for using PCM in the main memory system 9 15 22. ing the melted material to quench into the amorphous state Proposal a replaces DRAM with PCM to achieve larger. To READ the current state of a cell a small current that main memory capacity Even though PCM is slower than. does not perturb the cell state is applied to measure the DRAM clever optimizations have been shown to reduce ap. resistance At normal temperatures 120 C Tcryst plication execution time on PCM to within a factor of 1 2. PCM offers many years of data retention of that on DRAM 15 Both proposals b and c include. a small amount of DRAM in addition to PCM so that fre. 2 2 Using PCM in the Memory Hierarchy quently accessed data can be kept in the DRAM buffer to. To see where PCM may fit in the memory hierarchy we improve performance and reduce PCM wear Their differ. need to know its properties Table 1 compares PCM with ence is that proposal b gives software explicit control of the. DRAM technology for today s main memory NAND flash DRAM buffer 9 while proposal c manages the DRAM. buffer as another level of transparent hardware cache 22 Table 2 General Terms. It has been shown that a relatively small DRAM buffer 3 Term Description Example. size of the PCM storage can bridge most of the latency gap Erb Energy consumed for reading a PCM bit 2 pJ. between DRAM and PCM 22 Ewb Energy consumed for writing a PCM bit 16 pJ. Tl Latency of accessing a cache line from PCM 230 cycles. 2 3 Challenge Writes to PCM Main Memory Tw Additional latency of writing a word to PCM 450 cycles. One major challenge in effectively using PCM is overcom C size in bytes of the largest CPU cache 8 MB. L Cache line size in bytes 64B, ing the relative limitations of its write operations Com. W Word size used in PCM writes 8B, pared to its read operations PCM writes incur higher energy Nl Number of cache line fetches from PCM. consumption higher latency lower bandwidth and limited Nlw Number of cache line write backs to PCM. endurance as discussed next Nw Number of words written to PCM. High energy consumption Compared to reading a PCM Avg number of modified bits per modified word. cell a write operation that SETs or RESETs a PCM, cell draws higher current uses higher voltage and 3 PCM FRIENDLY DB ALGORITHMS. takes longer time Figure 1 A PCM write often con In this section we consider the impact of PCM on database. sumes 6 10X more energy than a read 15 algorithm design Specifically reducing PCM writes be. High latency and low bandwidth In a PCM device comes an important design goal We discuss general con. the write latency of a PCM cell is determined by the siderations in Section 3 1 Then we re examine two core. longer SET time which is about 3X slower than a database techniques B tree index and hash joins in Sec. read operation 15 Moreover many PCM prototypes tions 3 2 and 3 3 respectively. support iterative writing of a limited number of bits. per iteration in order to limit the instantaneous cur. 3 1 Algorithm Design Considerations, rent level Several prototypes support 2 4 and 8 Section 2 described three candidate organizations of fu. write modes in addition to the fastest 16 mode 3 ture PCM main memory as shown in Figure 2 Their main. This limitation is likely to hold in the future as well es difference is whether or not to include a transparent or. pecially for PCM designed for power constrained plat software controlled DRAM cache For algorithm design pur. forms Because of the limited write bandwidth writ poses we consider an abstract framework that captures all. ing 64B of data often requires several rounds of writ three candidate organizations Namely we focus on a PCM. ing leading to the 1 s write latency in Table 1 main memory and view any additional DRAM as just an. other transparent or software controlled cache in the hier. Limited endurance Existing PCM prototypes have a archy This enables us to focus on PCM specific issues. write endurance ranging from 106 to 108 writes per Because PCM is the primary main memory we consider. cell With a good wear leveling algorithm a PCM algorithm designs in main memory There are two tradi. main memory can last for several years under realistic tional design goals for main memory algorithms i low. workloads 21 However because such wear leveling computation complexity and ii good CPU cache perfor. must be done at the memory controller the wear mance Power efficiency has recently emerged as a third. leveling algorithms need to have small memory foot design goal Compared to DRAM one major challenge in. prints and be very fast Therefore practical algo designing PCM friendly algorithms is to cope with the asym. rithms are simple and in many cases their effective metry between PCM reads and PCM writes PCM writes. ness significantly decreases in the presence of extreme consume much higher energy take much longer time to com. hot spots in the memory For example even with the plete and wear out PCM cells recall Section 2 Therefore. wear leveling algorithms in 21 continuously updat one important design goal of PCM friendly algorithms is to. ing a counter in PCM in a 4GHz machine with 16GB minimize PCM writes. PCM could wear a PCM cell out in about 4 months What granularity of writes should we use in algorithm. without wear leveling the cell could wear out in less analysis with PCM main memory a bits b words or c. than a minute cache lines All three granularities are important for com. Recent studies proposed various hardware optimizations to puting PCM metrics Choice a impacts PCM endurance. reduce the number of PCM bits written 8 15 31 32 Choices a and c affect PCM energy consumption Choice. For example the PCM controller can perform data com b influences PCM write latency The drawback of choice. parison writes where a write operation is replaced with a a is that the relevant metric is the number of modified. read modify write operation in order to skip programming bits recall that unmodified bits are skipped this is dif. unchanged bits 31 Another proposal is partial writes for ficult to estimate because it is often affected not only by. only dirty words 15 In both optimizations when writ the structure of an algorithm but also by the input data. ing a chunk of data in multiple iterations the set of bits to Fortunately there is often a simple relationship between a. write in every iteration is often hard wired for simplicity if and b Denote as the average number of modified bits. all the hard wired bits of an iteration are unchanged the per modified word can be estimated for a given input. iteration can be skipped 7 However these architectural Therefore we focus on choices b and c. optimizations reduce the volume of writes by only a fac Let Nl Tl be the number latency resp of cache line. tor 3 We believe that applications such as databases fetches a k a cache misses from PCM Nlw be the num. can play an important role in complementing such architec ber of cache line write backs to PCM and Nw Tw be the. tural optimizations by carefully choosing their algorithms number latency resp of modified words written Let Erb. and data structures in order to reduce the number of writes Ewb be the energy for reading writing resp a PCM bit. even at the expense of additional reads Let L be the number of bytes in a cache line Table 2 sum. marizes the notation used in this paper We model the key num keys num keys bitmap keys. PCM metrics as follows 5 2 4 7 8 9 5 8 2 9 4 7 1011. 1010 8 2 9 4 7,TotalWear N umBitsM odif ied Nw, Energy 8L Nl Nlw Erb Nw Ewb pointers pointers pointers.
a Sorted b Unsorted c Unsorted w bitmap,T otalP CM AccessLatency Nl Tl Nw Tw. Figure 3 B tree node organizations, The total wear and energy computations are straightfor. ward The latency computation requires explanations The Table 3 Terms used in analyzing hash joins. first part Nl Tl is the total latency of cache line fetches Term Description. from PCM The second part Nw Tw is the estimated im MR MS Number of records in relation R and S respectively. pact of cache line write backs to PCM on the total time LR LS Record sizes in relation R and S respectively. In a traditional system the cache line write backs are per Number of cache line accesses per hash table visit. formed asynchronously in the background and often com when building the hash table on R records. pletely hidden Therefore algorithm analysis typically ig Number of cache line accesses per hash table visit. NhS when probing the hash table for S records, nores the write backs However we find that because of the. HashT ablelw Number of line write backs per hash table insertion. asymmetry of writes and reads PCM write latency can keep HashT ablew Number of words modified per hash table insertion. PCM busy for a sufficiently long time to stall front end cache M atchP erR Number of matches per R record. line fetches significantly A PCM write consists of i a read M atchP erS Number of matches per S record. of the cache line from PCM to identify modified words then. ii writing modified words in possibly multiple rounds The the end of the array then increment the counter For. above computation includes ii as Nw Tw while the latency a deletion one can overwrite the entry to delete with. of i Nlw Tl is ignored because it is similar to a traditional the last entry in the array then decrement the counter. cache line write back and thus likely to be hidden Therefore an insertion deletion incurs 3 word writes. 3 2 B Tree Index Unsorted with bitmap We further improve the un. sorted organization by allowing the key array to have. As case studies we consider two core database techniques. holes The counter field is replaced with a bitmap, for memory resident data B trees in the current subsec. recording valid locations An insertion writes the new. tion and hash joins in the next subsection where the main. entry to an empty location and updates the bitmap,memory is PCM instead of DRAM.
using 3 word writes while a deletion updates only the. B trees are preferred index structures for memory resident. bit in the bitmap using 1 word write A search incurs. data because they optimize for CPU cache performance. the instruction overhead of a more complicated search. Previous studies recommend that B tree nodes be one or a. process For 8 byte keys and pointers a 64 bit bitmap. few cache lines large and aligned at cache line boundaries 5. can support nodes up to 1024 bytes large which is,6 12 23 For DRAM based main memory the costs of. more than enough for supporting typical cache friendly. search insertion deletion are similar except in those cases. B tree nodes, where insertions deletions incur node splits merges in the. tree In contrast for PCM based main memory even a nor Given the pros and cons of the three node organizations. mal insertion deletion that modifies a single leaf node can we study the following four variants of B trees. be more costly than a search in terms of total wear energy Sorted a normal cache friendly B tree All the non. and elapsed time because of the writes involved leaf and leaf nodes are sorted. We would like to preserve the good CPU cache perfor Unsorted a cache friendly B tree with all the non. mance of B trees while reducing the number of writes leaf and leaf nodes unsorted. A cache friendly B tree node is typically implemented as Unsorted leaf a cache friendly B tree with sorted. shown in Figure 3 a where all the keys in the node are non leaf nodes but unsorted leaf nodes Because most. sorted and packed and a counter keeps track of the number insertions deletions do not modify non leaf nodes the. of valid keys in the array The sorted key array is maintained unsorted leaf nodes may capture most of the benefits. upon insertions and deletions In this way binary search can Unsorted leaf with bitmap This variant is the. be applied to locate a search key However on average half same as unsorted leaf except that leaf nodes are orga. of the array must be moved to make space for insertions nized as unsorted nodes with bitmaps. and deletions resulting in a large number of writes Sup Our experimental results in Section 4 show that the unsorted. pose that there are K keys and K pointers in the node and schemes can significantly improve total wear energy con. every key every pointer and the counter have size equal sumption and run time for insertions and deletions Among. to the word size W used in PCM writes Then an inser the three unsorted schemes unsorted leaf is the best for in. tion deletion in the sorted node incurs 2 K 2 1 K 1 dex insertions and it incurs negligible index search overhead. word writes on average while unsorted leaf with bitmap achieves the best index dele. To reduce writes we propose two simple unsorted node tion performance. organizations as shown in Figures 3 b and 3 c, Unsorted As shown in Figure 3 b the key array 3 3 Hash Joins. is still packed but can be out of order A search has One of the most efficient join algorithms hash joins are. to scan the array sequentially in order to look for a widely used in data management systems Several cache. match or the next smaller bigger key On the other friendly variants of hash joins are proposed in the litera. hand an insertion can simply append the new entry to ture 1 4 27 Most of these algorithms are based on the. Algorithm 1 Existing algorithm simple hash join Algorithm 3 Our proposal virtual partitioning 2. Build phase Partition phase, 1 for i 0 i MR i do 1 htsize MR hash table per entry metadata size. 2 r record i in Relation R 2 P MR MS 2 MR LR 1 L MS LS 1. 3 insert r into hash table L htsize C, Probe phase 3 initiate ID lists RList 0 P 1 and SList 0 P 1.
1 for j 0 j MS j do 4 for i 0 i MR i do virtually partition R. 2 s record j in Relation S 5 r record i in Relation R. 3 probe s in the hash table 6 p hash r modulo P, 4 if there are match es then 7 append ID i into RList p. 5 generate join result s from the matching records 8 for j 0 j MS j do virtually partition S. 6 send join result s to the upper level operator 9 s record j in Relation S. 10 p hash s modulo P,11 append ID j into SList p, Algorithm 2 Existing algorithm cache partitioning 2 Join phase. Partition phase 1 for p 0 p P p do join Rp and Sp, 1 htsize MR hash table per entry metadata size 2 for each i in RList p do. 2 P MR LR MS LS htsize C 3 r record i in Relation R. 3 for i 0 i MR i do partition R 4 insert r into hash table. 4 r record i in Relation R 5 for each j in SList p do. 5 p hash r modulo P 6 s record j in Relation S, 6 copy r to partition Rp 7 probe s in the hash table. 7 for j 0 j MS j do partition S 8 if there are match es then. 8 s record j in Relation S 9 generate join result s from the matching records. 9 p hash s modulo P 10 send join result s to the upper level operator. 10 copy s to partition Sp,Join phase,L x 1 cases,1 for p 0 p P p do.
2 join Rp and Sp using simple hash join,cache line boundaries 1 x 1 x L. following two representative algorithms Table 3 defines the. terms used in describing and analyzing the algorithms Figure 4 Computing average number of cache. Simple Hash Join As shown in Algorithm 1 in the build misses for unaligned records A record of size. phase the algorithm scans the smaller build relation R For yL x bytes y 0 L x 0 has L possible locations. every build record it computes a hash code from the join relative to cache line boundaries Accessing the. key and inserts the record into the hash table In the probe record incurs on average x 1 L. phase the algorithm scans the larger probe relation S For cache misses. every probe record it computes the hash code and probes. the hash table If there are matching build records the al Cache Partitioning When both input relation sizes are. gorithm computes the join results and sends them to upper fixed if we reduce the record sizes LR LS then the num. level query operators bers of records MR MS increase Therefore simple hash. The cost of this algorithm can be analyzed as in Table 4 join incurs a large number of cache misses when record sizes. with the terms defined in Table 3 Here we assume that the are small The cache partitioning algorithm solves this prob. hash table does not fit into CPU cache which is usually the lem As shown in Algorithm 2 in the partition phase the. case We do not include PCM costs for the join results as two input relations R and S are hash partitioned so that. they are often consumed in the CPU cache by higher level every pair of partitions Rp and Sp can fit into the CPU. operators in the query plan tree cache Then in the join phase every pair of Rp and Sp are. The cache misses of the build phase are caused by reading joined using the simple hash join algorithm. all the join keys min MRLLR MR and accessing the hash The cost analysis of cache partitioning is straightforward. table MR NhR When the record size is small the first as shown in Table 4 Note that we assume that modified. term is similar to reading the entire build relation When cache lines during the partition phase are not prematurely. the record size is large it incurs roughly one cache miss evicted because of cache conflicts Observe that the number. per record Note that because multiple entries may share a of cache misses using cache partitioning is constant if the. single hash bucket the lines written back can be a subset of relation sizes are fixed This addresses the above problem. the lines accessed for a hash table visit For the probe phase of simple hash join. the cache misses are caused by scanning the probe relation 2. For simplicity Algorithm 2 and Algorithm 3 assume perfect par. MSLLS accessing the hash table MS NhS and accessing titioning when generating cache sized partitions To cope with. matching build records in a random fashion The latter can data skews one can increase the number of partitions P so that. be computed as shown in Figure 4 The other computations even the largest partition can fit into the CPU cache Note that. are straightforward using a larger P does not change the algorithm analysis. Table 4 Cost analysis for three hash join algorithms. Algorithm Cache Line Accesses from PCM Nl Cache Line Write Backs Nlw Words Written Nw. Simple Build min MRLLR MR MR NhR MR HashT ablelw MR HashT ablew. Hash MS LS,MS NhS MS M atchP erS LRL 1 1 0 0,Cache Partition 2 MRLLR MSLLS MR LR. Partition MR LR,MR LR MS LS 2 2 2,Virtual Partition L. L MR MS L MR MS L MR MS W,Partition Join MR 2 LR 1 LS 1. MS L MR L 1 MS L 1 0 0, a Cache accesses Nl b Total wear c Energy d Total PCM access latency.
simple hash join cache partitioning virtual partitioning. Figure 5 Comparing three hash join algorithms analytically LS LR M atchP erS 1 0 5 the hash. table in simple hash join does not fit into cache hash table access parameters are based on experimental. measurements NhR NhS 1 8 HashT ablelw 1 5 HashT ablew 5 0 For configurations where virtual. partitioning is the best scheme contour lines show the relative benefits of virtual partitioning compared to. the second best scheme, However cache partitioning introduces a large number phase the records are accessed in place They are essen. of writes compared to simple hash join it is writing the tially scattered in the two input relations Therefore we. amount of data equivalent to the size of the entire input re use the formula for unaligned records in Figure 4 to com. lations As writes are bad for PCM we would like to design pute the number of cache misses for accessing the build and. an algorithm that reduces the writes while still achieving probe records Note that the computation of the number of. similar benefits of cache partitioning We propose the fol partitions P in Algorithm 3 guarantees that the total cache. lowing variant of cache partitioning lines accessed per pair of Rp and Sp fit into the CPU cache. New Virtual Partitioning Instead of physically copy capacity C. ing input records into partitions we perform the partitioning Comparisons of the Three Algorithms Figure 5 com. virtually As shown in Algorithm 3 in the partition phase pares the three algorithms analytically using the formulas. for every partition we compute and remember the record in Table 4 We assume R and S have the same record size. IDs that belong to the partition for both R and S 3 Then and it is a primary foreign key join thus M atchP erS 1. in the join phase we can use the record ID lists to join the From left to right Figures 5 a to d show the comparison. records of a pair of partitions in place thus avoiding the results for four metrics a cache accesses Nl b total. large number of writes in cache partitioning wear c energy and d total PCM access latency In each. We optimize the record ID list implementation by storing figure we vary the record size from 1 to 500 bytes and the. the deltas of two subsequent record IDs to further reduce the number of matches per R record M atchP erR from 1 to. writes As the number of cache partitions is often smaller 50 Every point represents a configuration for the hash join. than a thousand we find using two byte integers can encode The color of a point shows the best scheme for the corre. most deltas For rare cases with larger deltas we reserve sponding configuration blue for simple hash join red for. 0xF F F F to indicate that a full record ID is recorded next cache partitioning and green for virtual partitioning For. The costs of the virtual partitioning algorithm is analyzed configurations where virtual partitioning is the best scheme. in Table 4 The costs for the partition phase include scan the contour lines show the relative benefits of virtual parti. ning the two relations as well as generating the record ID tioning compared to the second best scheme. lists The latter writes two bytes per record In the join Figure 5 a focuses on CPU cache performance which is. the main consideration for previous cache friendly hash join. We assume that there is a simple mapping between a record designs We see that as expected simple hash join is the best. ID and the record location in memory For example if fixed. length records are stored consecutively in an array then the array. scheme when record size is very large and cache partition. index can be used as the record ID If records always start at 8B ing is the best scheme when record size is small Compared. boundaries then the record ID of a record can be the record to simple hash join virtual partitioning avoids the many. starting address divided by 8 cache misses caused by hash table accesses Compared to. cache partitioning virtual partitioning reduces the number Table 5 Simulation Setup. of cache misses during the partition phase while paying ex. Simulator PTLsim enhanced with PCM support, tra cache misses for accessing scattered records in the join Processor Out of order X86 64 core 3GHz. phase As a result virtual partitioning achieves the smallest Private L1D 32KB 8 way 4 cycle latency. number of cache misses for a large number of configurations CPU private L2 256KB 8 way 11 cycle latency. in the middle between the red and blue points cache shared L3 8MB 16 way 39 cycle latency. Figures 5 b to d show the comparison results for the all caches with 64B lines. three PCM metrics First of all we see that the figures 64 entry DTLB 32 entry write back queue. 4 ranks read latency for a cache line 230 cycles, are significantly different from Figure 5 a This means. PCM write latency per 8B modified word 450 cycles, that introducing PCM main memory can significantly im Erb 2 pJ Ewb 16 pJ. pact the relative benefits of the algorithms Second very. few configurations benefit from cache partitioning because. it incurs a large number of PCM writes in the partition queue in the on chip memory controller which keeps track. phase adversely impacting its PCM performance Third in of dirty cache line evictions and performs the PCM writes. Figure 5 b virtual partitioning achieves the smallest num asynchronously in the background. ber of writes when M atchP erR 18 Virtual partitioning Table 5 describes the simulation parameters The cache. avoids many of the expensive PCM writes in the partition hierarchy is modeled after the recent Intel Nehalem pro. phase of the cache partitioning algorithm Interestingly cessors 14 The PCM latency and energy parameters are. simple hash join achieves the smallest number of writes based on a previous computer architecture study 15 We. when M atchP erR 19 This is because as M atchP erR adjusted the latency in cycles according to the 3GHz pro. increases the number of S records MS increases propor cessor frequency and the DDR3 bus latency The word size. tionally leading to a larger number of PCM writes for virtual of 8 bytes per iteration of write operations is based on 8. partitioning while the number of PCM writes in simple hash. join is not affected The cross over point is 19 here Finally 4 2 B Tree Index. virtual partitioning presents a good balance between cache. We implemented four variants of B trees as described in. line accesses and PCM writes and it excels in energy and. Section 3 2 sorted unsorted unsorted leaf and unsorted. total PCM access latency in most cases, leaf with bitmap Figure 6 compares the four schemes for.
common index operations In every experiment we popu. 4 EXPERIMENTAL EVALUATION late the trees with 50 million entries An entry consists of. We evaluate our proposed B tree and hash join algo an 8 byte integer key and an 8 byte pointer We populate. rithms through cycle accurate simulations in this section the nodes 75 full initially We randomly shuffle the entries. We start by describing the simulator used in the experi in all unsorted nodes so that the nodes represent the sta. ments Then we present the experimental results for B ble situations after updates Note that the total tree size is. trees and hash joins Finally we perform sensitivity analysis over 1GB much larger than the largest CPU cache 8MB. for PCM parameters For the insertion experiments we insert 500 thousand ran. dom new entries into the trees back to back and report. 4 1 Simulation Platform total wear in number of PCM bits modified PCM energy. We extended a cycle accurate out of order X86 64 sim consumption in millijoules and execution time in cycles for. ulator PTLsim 20 with PCM support PTLsim is used the entire operation Similarly we measure the performance. extensively in computer architecture studies and is currently of 500 thousand back to back deletions for the deletion ex. the only publicly available cycle accurate simulator for out periments and 500 thousand back to back searches for the. of order x86 micro architectures The simulator models the search experiments We vary the node size of the trees As. details of a superscalar out of order processor including in suggested by previous studies the best tree node sizes are. struction decoding micro code branch prediction function a few cache lines large 5 12 Since a one line 64B node. units speculation and a three level cache hierarchy PTL can contain only 3 entries which makes the tree very deep. sim has multiple use modes we use PTLsim to simulate we show results for node sizes of 2 4 and 8 cache lines. single 64 bit user space applications in our experiments The sub figures in Figure 6 are arranged as a 3x3 ma. We extended PTLsim in the following ways to model PCM trix Every row corresponds to a node size Every column. First we model data comparison writes for PCM writes corresponds to a performance metric In every sub figure. When writing a cache line to PCM we compare the new there are three groups of bars corresponding to the inser. line with the original line to compute the number of modi tion deletion and search experiments The bars in each. fied bits and the number of modified words The former is group show the performance of the four schemes Note that. used to compute PCM energy consumption while the latter search does not incur any wear We observe the following. impacts PCM write latency Second we model four paral points in Figure 6. lel PCM memory ranks Accesses to different ranks can be First compared to the conventional sorted trees all the. carried out in parallel Third we model the details of cache three unsorted schemes achieve better total wear energy. line write back operations carefully Previously PTLsim as consumption and execution time for insertions and dele. sumes that cache line write backs can be hidden completely tions the two index operations that incur PCM writes The. and does not model the details of this operation Because sorted trees pay the cost of moving the sorted array of en. PCM write latency is significantly longer than its read la tries in a node to accommodate insertions and deletions In. tency cache line write backs may actually keep the PCM contrast the unsorted schemes all save PCM writes by al. busy for a sufficiently long time to stall front end cache line lowing entries to be unsorted upon insertions and deletions. fetches Therefore we implemented a 32 entry FIFO write This saving increases as the node size increases Therefore. 8E 7 8 3E 9,num bits modified,m bits modified,0E 0 0 0E 0. insert delete search insert delete search insert delete search. a Total wear 2 line nodes b PCM energy 2 line nodes c Execution time 2 line nodes. 1 5E 8 10 3E 9,num bits modified,m bits modified,1 0E 8 2E 9. 5 0E 7 1E 9,0 0E 0 0 0E 0, insert delete search insert delete search insert delete search. d Total wear 4 line nodes e PCM energy 4 line nodes f Execution time 4 line nodes. 3E 8 15 5E 9,num bits modified,m bits modified,0E 0 0 0E 0. insert delete search insert delete search insert delete search. g Total wear 8 line nodes h PCM energy 8 line nodes i Execution time 8 line nodes. sorted unsorted unsorted leaf unsorted leaf bmp, Figure 6 B tree performance 50 million entries in trees 75 full insert inserting 500 thousand.
random keys delete randomly deleting 500 thousand existing keys search searching for 500 thousand. random keys, the performance gaps widen as the node size grows from 2 a factor of 7 7 436X energy consumption by a factor of. cache lines to 8 cache lines 1 7 2 5X and execution time by a factor of 2 0 2 5X for. Second compared to the conventional sorted trees the insertions and deletions while achieving similar search per. scheme with all nodes unsorted suffers from slower search formance If the index workload consists of mainly insertions. time by a factor of 1 13 1 46X because the hot top tree and searches with the tree size growing we recommend the. nodes stay in CPU cache and a search incurs a lot of in normal unsorted leaf If the index workload contains a lot. struction overhead in the unsorted non leaf nodes In con of insertions and a lot of deletions e g the tree size stays. trast the two schemes with only unsorted leaf nodes achieve roughly the same we recommend the unsorted leaf scheme. similar search time as the sorted scheme with bitmap. Third comparing the two unsorted leaf schemes we see. that unsorted leaf with bitmap achieves better total wear 4 3 Hash Joins. energy and time for deletions This is because unsorted. We implemented the three hash join algorithms as dis. leaf with bitmap often only needs to mark one bit in a leaf. cussed in Section 3 3 simple hash join cache partitioning. bitmap for a deletion and the total wear is about 5E5 bits. and virtual partitioning We model in memory join opera. modified while unsorted leaf has to overwrite the deleted. tions where the input relations R and S are in main mem. entry with the last entry in a leaf node and update the. ory The algorithms build in memory hash tables on the R. counter in the node On the other hand the unsorted leaf. relation To hash a R record we compute an integer hash. with bitmap suffers from slightly higher insertion time be. code from its join key field and modulo this hash code by. cause of the instruction overhead of handling the bitmap. the size of the hash table to obtain the hash slot Then we. and the holes in a leaf node, insert hash code pointer to the R record into the hash. Overall we find that the two unsorted leaf schemes achieve. slot Conflicts are resolved through chained hashing To. the best performance Compared to the conventional sorted. probe an S record we compute the hash code from its join. B tree the unsorted leaf schemes improve total wear by. key field and use the hash code to look up the hash ta. 20B 40B 60B 80B 100B 20B 40B 60B 80B 100B,record size record size. a Total wear b PCM energy c Execution time, 2 matches per R record 2 matches per R record 2 matches per R record. 1 2 4 6 8 1 2 4 6 8, Z num matches per R record num matches per R record.
d Total wear 60B records e PCM energy 60B records f Execution time 60B records. simple hash join cache partitioning virtual partitioning. Figure 7 Hash join performance 50MB R table joins S table varying the record size from 20B to 100B. and varying the number of matches per R record from 1 to 8. ble When there is an entry with the matching hash code size and M atchP erR settings in the experiments fall in the. we check the associated R record to make sure that the join region where virtual partitioning wins in Figure 5 There. keys actually match The join results are sent to a high level fore the experimental results confirm our analytical com. operator that consumes the results In our implementation parison in Section 3 3. the high level operator simply increments a counter. Figure 7 compares the three hash join algorithms The 4 4 PCM Parameter Sensitivity Analysis. R relation is 50MB large Both relations have the same In this section we vary the energy and latency parameters. record size We vary the record size from 20B to 100B in of PCM in the simulator and study the impact of the pa. Figures 7 a c We vary the number of matches per R rameter changes on the performance of the B tree and hash. record M atchP erR from 1 to 8 in Figures 7 d f in join algorithms Note that we still assume data comparison. other words the size of S varies from 50MB to 400MB We writes for PCM write. report total wear energy consumption and execution times Figure 8 varies the energy consumed by writing a PCM. for every set of experiments bit Ewb from 2pJ to 64pJ The default value of Ewb is. The results in Figure 7 confirm our analytical comparison 16pJ and 2pJ is the same as the energy consumed by read. in Section 3 3 First cache partitioning performs poorly ing a PCM bit From left to right Figures 8 a c show. in almost all cases because it performs a large number of the impact of varying Ewb on the energy consumptions of. PCM writes in its partition phase This results in much B tree insertions B tree deletions and hash joins First. higher total wear higher energy consumption and longer we see that as Ewb gets smaller the curves become flat the. execution time compared to the other two schemes energy consumption is more and more dominated by the. Second compared to simple hash join when varying record cache line fetches for reads and for data comparison writes. size from 20B to 100B virtual partitioning improves total Second as Ewb gets larger the curves increase upwards be. wear by a factor of 4 7 5 2X energy consumption by a factor cause the larger Ewb contributes significantly to the overall. of 2 3 1 4X and execution time by a factor of 1 24 1 12X energy consumption Third changing Ewb does not quali. When varying M atchP erR from 1 to 8 virtual partitioning tatively change our previous conclusions For B trees the. improves total wear by a factor of 8 6 1 5X energy con two unsorted leaf schemes are still better than sorted B. sumption by a factor of 1 61 1 59X and execution time by trees Among the three hash join algorithms virtual parti. a factor of 1 19 1 11X tioning is still the best, Overall virtual partitioning achieves the best performance Figure 9 varies the latency of writing a word to PCM. among the three schemes in all the experiments Compared Tw from 230 cycles to 690 cycles The default Tw is 450. to cache partitioning virtual partitioning avoids copying cycles and 230 is the same latency as reading a cache line. data in the partition phase by remembering record IDs per from PCM From left to right Figures 8 a c show the. partition Compared to simple hash join virtual partition impact of varying Tw on the execution times of B tree in. ing avoids excessive cache misses due to hash table accesses sertions B tree deletions and hash joins We see that as. Therefore virtual partitioning achieves good behaviors for Tw increases the performance gaps among different schemes. both PCM writes and cache accesses Note that the record become larger The performance gap between simple hash. sorted sorted simple hash join, 25 unsorted leaf 20 unsorted leaf 50 cache partitioning. 20 unsorted leaf bmp unsorted leaf bmp 40 virtual partitioning. 2 4 8 16 32 64 2 4 8 16 32 64 2 4 8 16 32 64,Ewb pJ Ewb pJ Ewb pJ. a B tree insertions b B tree deletions c Hash joins 2 matches per R record 60B. Figure 8 Sensitivity analysis varying energy consumed for writing a PCM bit Ewb. sorted sorted simple hash join, 8 unsorted leaf 8 unsorted leaf cache partitioning. cycles x1e9,cycles x1e9,cycles x1e9, unsorted leafbmp unsorted leafbmp 10 virtual partitioning.
230 450 690 230 450 690 230 450 690,Tw cycles Tw cycles Tw cycles. a B tree insertions b B tree deletions c Hash joins 2 matches per R record 60B. Figure 9 Sensitivity analysis varying latency of writing a word to PCM Tw. join and virtual partitioning is 6 when Tw is 230 cycles Rio cache has also been integrated into databases as a per. We find that previous conclusions still hold for B trees and sistent database buffer cache 18 The Conquest file sys. hash joins tem 29 uses BBDRAM to store small files and metadata. eNVy 30 placed flash memory on the memory bus by us. ing a special controller equipped with a BBDRAM buffer. 5 RELATED WORK to hide the block addressable nature of flash WAFL 13. PCM Architecture As discussed in previous sections keeps file system changes in a log in BBDRAM and only. several recent studies from the computer architecture com occasionally flushes them to disk While BBDRAM may. munity have proposed solutions to make PCM a replacement be an alternative to PCM PCM has two main advantages. for or an addition to DRAM main memory These studies over BBDRAM First BBDRAM is vulnerable to correlated. address various issues including improving endurance 15 failures for example the UPS battery will often fail ei. 21 22 32 improving write latency by reducing the number ther before or along with primary power leaving no time. of PCM bits written 8 15 31 32 preventing malicious to copy data out of DRAM Second PCM is expected to. wear outs 26 and supporting error corrections 25 How scale much better that DRAM making it a better long term. ever these studies focus on hardware design issues that are option for persistent storage 3 On the other hand using. orthogonal to our focus on designing efficient algorithms for PCM requires dealing with expensive writes and limited en. software running on PCM durance a challenge not present with BBDRAM Therefore. PCM Based File Systems BPFS 9 a file system de BBDRAM based algorithms do not require addressing the. signed for byte addressable persistent memory exploits both challenges studied in this paper. the byte addressability and non volatility of PCM In addi Main Memory Database Systems and Cache Friendly. tion to being significantly faster than disk based file sys Algorithms Main memory database systems 11 maintain. tems even when they are run on DRAM BPFS provides necessary data structures in DRAM and hence can exploit. strong safety and consistency guarantees by using a new DRAM s byte addressable property As discussed in Sec. technique called short circuit shadow paging Unlike tradi tion 3 1 the traditional design goals of main memory algo. tional shadow paging file systems BPFS uses copy on write rithms are low computation complexity and good CPU cache. at fine granularity to atomically commit small changes at performance Like BBDRAM based systems main mem. any level of the file system tree This avoids updates to the ory database systems do not need to address PCM specific. file system triggering a cascade of copy on write operations challenges In this paper we found that for PCM friendly. from the modified location up to the root of the file system algorithms one important design goal is to minimize PCM. tree BPFS is a file system and hence it does not consider writes Compared to previous cache friendly B trees and. the database algorithms we consider Moreover BPFS has hash joins our new algorithms achieve significantly better. been designed for the general class of byte addressable per performance in terms of PCM total wear energy consump. sistent memory and it does not consider PCM specific issues tion and execution time. such as read write asymmetry or limited endurance, Battery Backed DRAM Battery backed DRAM BB 6 CONCLUSION. DRAM has been studied as a byte addressable persistent A promising non volatile memory technology PCM is ex. memory The Rio file cache 16 uses BBDRAM as the buffer pected to play an important role in the memory hierarchy. cache eliminating any need to flush dirty data to disk The in the near future This paper focuses on exploiting PCM. as main memory for database systems Based on the unique the performance of cache conscious B trees In. characteristics of PCM as opposed to DRAM and NAND SIGMETRICS 2003. flash we identified the importance of reducing PCM writes 13 D Hitz J Lau and M Malcolm File system design. for optimizing PCM endurance energy and performance for an NFS file server appliance In USENIX Winter. Specifically we applied this observation to database algo Technical Conference 1994. rithm design and proposed new B tree and hash join de 14 Intel Corp First the tick now the tock Intel. signs that significantly improve the state of the art micro architecture Nehalem http www intel com. As future work in the PCM DB project we are interested technology architecture silicon next gen 319724 pdf. in optimizing PCM writes for different aspects of database 15 B C Lee E Ipek O Mutlu and D Burger. system designs including important data structures query Architecting phase change memory as a scalable. processing algorithms and transaction logging and recovery DRAM alternative In ISCA 2009. The latter is important for achieving transaction atomicity 16 D E Lowell and P M Chen Free transactions with. and durability BPFS proposed a different solution based Rio Vista Operating Systems Review 31 1997. on shadow copying and atomic writes 9 It is interesting 17 S Nath and P B Gibbons Online maintenance of. to compare this proposal with conventional database trans very large random samples on flash storage The. action logging given the goal of reducing PCM writes VLDB Journal 19 1 2010. Moreover another interesting aspect to study is the fine 18 W T Ng and P M Chen Integrating reliable. grain non volatility of PCM Challenges may arise in hierar memory in databases The VLDB Journal 7 3 1998. chies where DRAM is explicitly controlled by software Be. 19 PCM DB http www pittsburgh intel, cause DRAM contents are lost upon restart the relationship. research net projects hi spade pcm db, between DRAM and PCM must be managed carefully for. example pointers to DRAM objects should not be stored in 20 PTLsim http www ptlsim org. PCM On the other hand the fine grain non volatility may 21 M K Qureshi J P Karidis M Franceschini. enable new features such as instant reboot that resumes V Srinivasan L Lastras and B Abali Enhancing. the execution states of long running queries upon crash re lifetime and security of PCM based main memory. covery so that useful work is not lost with start gap wear leveling In MICRO 2009. 22 M K Qureshi V Srinivasan and J A Rivers, Scalable high performance main memory system using.
7 REFERENCES phase change memory technology In ISCA 2009. 1 P A Boncz S Manegold and M L Kersten,23 J Rao and K A Ross Making B trees cache. Database architecture optimized for the new,conscious in main memory In SIGMOD 2000. bottleneck Memory access In VLDB 1999, 24 Samsung Samsung ships industry s first multi chip. 2 L Bouganim B Jo nsson and P Bonnet uFLIP,package with a PRAM chip for handsets http. Understanding flash IO patterns In CIDR 2009,www samsung com us business semiconductor.
3 G W Burr M J Breitwisch M Franceschini newsView do news id 1149 April 2010. D Garetto K Gopalakrishnan B Jackson B Kurdi,25 S E Schechter G H Loh K Straus and D Burger. C Lam L A Lastras A Padilla B Rajendran,Use ECP not ECC for hard failures in resistive. S Raoux and R S Shenoy Phase change memory,memories In ISCA 2010. technology J Vacuum Science 28 2 2010,26 N H Seong D H Woo and H H S Lee Security. 4 S Chen A Ailamaki P B Gibbons and T C,refresh Prevent malicious wear out and increase.
Mowry Improving hash join performance through, durability for phase change memory with dynamically. prefetching In ICDE 2004,randomized address mapping In ISCA 2010. 5 S Chen P B Gibbons and T C Mowry Improving,27 A Shatdal C Kant and J F Naughton Cache. index performance through prefetching In SIGMOD, conscious algorithms for relational query processing. In VLDB 1994,6 S Chen P B Gibbons T C Mowry and,28 H W Tseng H L Li and C L Yang An.
G Valentin Fractal prefetching B trees Optimizing,energy efficient virtual memory system with flash. both cache and disk performance In SIGMOD 2002,memory as the secondary storage In Int l Symp on. 7 S Cho Personal communication 2010 Low Power Electronics and Design ISPLED 2006. 8 S Cho and H Lee Flip N Write A simple 29 A I Wang P L Reiher G J Popek and G H. deterministic technique to improve PRAM write Kuenning Conquest Better performance through a. performance energy and endurance In MICRO 2009 disk persistent RAM hybrid file system In USENIX. 9 J Condit E B Nightingale C Frost E Ipek B C Annual Technical Conference 2002. Lee D Burger and D Coetzee Better I O through 30 M Wu and W Zwaenepoel eNVy a non volatile. byte addressable persistent memory In SOSP 2009 main memory storage system In ASPLOS 1994. 10 E Doller Phase change memory and its impacts on 31 B D Yang J E Lee J S Kim J Cho S Y Lee. memory hierarchy http www pdl cmu edu SDI and B G Yu A low power phase change random. 2009 slides Numonyx pdf 2009 access memory using a data comparison write scheme. 11 H Garcia Molina and K Salem Main memory In IEEE ISCAS 2007. database systems An overview IEEE TKDE 4 6 32 P Zhou B Zhao J Yang and Y Zhang A durable. 1992 and energy efficient main memory using phase change. 12 R A Hankins and J M Patel Effect of node size on memory technology In ISCA 2009.


Related Books

The Probate Referee Guide

The Probate Referee Guide

Trustees and their legal representatives need asset values for inventory, tax basis, and distribution purposes. Using regular fee appraisers for such valuations can be quite costly. Often such fee appraisers can only appraise one type of property, making it necessary to hire several appraisers. And if assets are located in more than one county, it

PENGENALAN KALIBRASI ALAT LABORATORIUM PT. TELEKOMUNIKASI ...

PENGENALAN KALIBRASI ALAT LABORATORIUM PT TELEKOMUNIKASI

lebih sepuluh hari dan dapat menyelesaikan laporan kerja praktek tahun 2016 dengan tepat waktu. Maksud dan tujuan dari penulisan laporan kerja praktek adalah untuk memenuhi persyaratan kelulusan mata kuliah kerja praktek pada semester enam tingkat tiga di Universitas Telkom. Penulis merasa bahwa dalam menyusun

Speedy Phrases - Florida Center for Reading Research

Speedy Phrases Florida Center for Reading Research

Speedy Phrases er Minute 1st y 3rd y 2nd y 4th y 5th y phrases correct per minute phrases correct per minute ... Set of decodable books or passages

GLOKALISASI KURIKULUM CAMBRIDGE DI SEKOLAH DASAR YANG ...

GLOKALISASI KURIKULUM CAMBRIDGE DI SEKOLAH DASAR YANG

seberapa jauh dua sekolah dasar Islam tadi memparaktikkan standar dan implementasi kurikulum Cambridge. Kemudian menyusuri dan menganalisis tentang proses kurikulum Cambridge diadopsi. Selain itu melihat juga bagaimana kultur budaya lokal yang ada di sekolah-sekolah Islam tersebut dapat beradaptasi dan tetap eksis.

E- Book for General Knowledge Notes for SSC CGL

E Book for General Knowledge Notes for SSC CGL

E-Book for General Knowledge Notes for SSC CGL Welcome to SSCPORTAL.IN ... Tricks, Books, Syllabus, Free Downloads, and much more.... http://sscportal.in .

ANALISIS KESULITAN MEMBACA PERMULAAN SISWA KELAS I SD ...

ANALISIS KESULITAN MEMBACA PERMULAAN SISWA KELAS I SD

pendidikan dasar dan sekolah dasar (SD) merupakan satuan pendidikan yang memberikan kemampuan dasar tersebut sebagaimana yang dinyatakan dalam Bab II pasal 6 ayat 6 PP No. 19 tahun 2005 tentang Standar Nasional Pendidikan. Selain itu, sekolah dasar sebagai lembaga pendidikan formal diharapkan dapat

Permendiknas Nomor 4 Tahun 2010 ttg Ujian sekolah

Permendiknas Nomor 4 Tahun 2010 ttg Ujian sekolah

Satuan pendidikan adalah Sekolah Dasar (SD), Sekolah Dasar Luar Biasa (SDLB), ... Ujian Sekolah/Madrasah wajib bersikap jujur, menjaga kerahasiaan, keamanan, dan kelancaran penyelenggaraan Ujian Sekolah/Madrasah. 6 (2) Perorangan, kelompok, dan/atau lembaga yang melakukan pelanggaran atau penyimpangan dalam penyelenggaraan Ujian Sekolah/Madrasah dikenakan sanksi sesuai dengan peraturan ...

A Self Instructional Text - Amazon Web Services

A Self Instructional Text Amazon Web Services

A Self Instructional Text. 4th Edition. Freida L Carson. PhD, HT(ASCP) Department of Pathology (retired) Baylor University Medical Center. Dallas, Texas. Christa Hladik Cappellano. BS, HT(ASCP) cm, QIHC Manager, Workflow Consultant. Roche Tissue Diagnostics Roche Diagnostics Corporation. Fishers, IN

An Introduction to Ecological Genomics - units.it

An Introduction to Ecological Genomics units it

published and subjects encompassed by evolutionary genomics, such as comparative genomics, phylogenetic analysis, and molecular evolution, can now be considered as fields in their own right. They are certainly too large to be covered in an introductory book on ecological genomics; indeed, evolutionary genomics deserves a textbook of its own.

The Queen Exchange, Part One - phpwebhosting.com

The Queen Exchange Part One phpwebhosting com

Mark Dvoretsky [Find us on Facebook.] Translate this page Play through and download the games from ChessCafe.com in the ChessBase Game Viewer. The Queen Exchange, Part One I haven't done a statistical analysis, but I think I'm right in saying that the queens are exchanged in at least half the games of any player. Sometimes the