Datenbank Spektrum DOI 10.1007/s13222-017-0262-9
SCHWERPUNKTBEITRAG
Energy Efficiency in Main-Memory Databases Stefan Noll1
· Henning Funke1 · Jens Teubner1
Received: 30 May 2017 / Accepted: 11 July 2017 © Springer-Verlag GmbH Deutschland 2017
Abstract As the operating costs of today’s data centres continue to increase and processor manufacturers are forced to meet thermal design power constraints when designing new hardware, the energy efficiency of a main-memory database management system becomes more and more important. Plus, lots of database workloads are more memoryintensive than compute-intensive, which results in computing power being unused and wasted. This can become a problem because wasting computing also means wasting electrical power. In this paper, we experimentally study the impact of reducing the clock frequency of the processor and the impact of using fewer processor cores on the energy efficiency of common database algorithms such as scans, simple aggregations, simple hash joins, and state-of-the-art join algorithms. We stress the fundamental trade-off between peak performance and energy efficiency, as opposed to the established race-to-idle strategy. Ultimately, we show that reducing unused computing power significantly improves the energy efficiency of memory-bound database algorithms.
Stefan Noll
[email protected] Henning Funke
[email protected] Jens Teubner
[email protected] 1
Databases and Information Systems Group, TU Dortmund University, Otto-Hahn-Straße 14, 44227 Dortmund, Germany
1 Introduction While a lot of current research in the domain of main-memory database systems (MMDBs) focuses on improving performance, only a small part of the research community focuses on improving energy efficiency. However, the energy consumption of a main-memory database system makes for an interesting research objective for several reasons. First, reducing the energy consumption also means reducing the operating costs, because the system needs less power and cooling. As electricity costs increase due to the growing demand for computing power, it is expected that the trend towards increased operating and cooling costs will intensify in the near future [8]. In addition, energy efficiency is a major aspect for chip design. In fact, semiconductor engineers have to meet fixed thermal design power (TDP) constraints. This means for example that not all cores of a processor can be fully powered at the same time without damaging parts of the integrated circuit. As a result, underutilised areas of the chip must be kept powered off (dark) or have to be operated at a low voltage and clock frequency, which is referred to as the Dark Silicon problem [4]. The aim of this work is to experimentally study as well as to understand the relationship between energy consumption and software to improve energy efficiency. We show that most algorithms used in an MMDB differ from algorithms used in other software. In fact, most database algorithms are memory-bound instead of compute-bound. As a result, database algorithms often have to wait for memory. The resulting stall cycles lead to energy inefficiency. We hypothesize that energy efficiency can be improved by artificially slowing down the CPU, thereby bringing compute and memory demand into balance and avoiding inefficient stalls. Therefore, we propose to reduce the unused com-
K
Datenbank Spektrum
puting power of a processor for example by lowering the clock frequency and voltage, which decreases the power consumption of a processor. We conduct a series of experiments with a computebound microbenchmark, with different memory-bound microbenchmarks simulating database algorithms as well as with two state-of-the-art join algorithms. We analyse the impact of a limited clock frequency on the energy efficiency of the different workloads. In particular, we stress the fundamental trade-off between peak performance and energy efficiency, as opposed to the established race-to-idle strategy, which is the state-of-the-art mechanism to maximise energy efficiency. Ultimately, we show that reducing unused computing power significantly improves the energy efficiency of memory-bound database algorithms. In summary, we make the following contributions: ●
●
●
●
We develop microbenchmarks that simulate a computebound algorithm and different memory-bound algorithms. We experimentally study the impact of a limited clock frequency on the energy efficiency of the microbenchmarks and two state-of-the-art join algorithms. We experimentally study the impact of using less processor cores on the energy efficiency of the microbenchmarks. We use the results of the experiments to propose concepts for energy-aware software intended to improve the energy efficiency of an MMDB.
2 Basics In this section, we highlight characteristics of database workloads and we introduce power management features of modern processors. These insights guide us to the experimental design in the following section. In addition, they allows us to reason about the observed effects in the evaluation of the experiments. 2.1 Characteristics of Database Workloads
Most database operations of an MMDB such as scans, aggregations or joins heavily depend on the memory bandwidth of the computer system. Thus, database algorithms are usually more memory-intensive than compute-intensive. In fact, most database algorithms are typically memorybound or more specifically bandwidth-bound or latencybound, because they perform lots of (random) memory accesses but only a few computations. If an algorithm is memory-bound, the CPU stalls. It has to wait for data from the main memory until it can proceed with the execution of the algorithm. As a consequence, a lot of clock cycles are wasted. However, wasting computing
K
power also means wasting electric power. Consequently, we argue that a high amount of computing power is not needed if an algorithm is memory-bound. To improve the energy efficiency of a database algorithm we propose to carefully reduce unused computing power. 2.2 Power Management Features
The Advanced Configuration and Power Interface (ACPI) [13] defines several power management features of a modern computer system. This includes power and performance states a computer system can enter to manage power consumption. In fact, the standard specifies two groups of states for a processor: the processor power states (C-States) and the processor performance states (P-States). C-States describe the capability of an idle processor to save power by powering down unused parts of the processor. When individual processor cores or entire packages are idle, the operating system can put the hardware in a lowpower state. Therefore, C-States are used to implement the race-to-idle strategy. Race-to-idle means executing a task as fast as possible and subsequently putting the processor in a sleep mode to save power. P-States define the capability of an active processor to save power by lowering the clock frequency and voltage. A more common term used to describe this concept is dynamic voltage and frequency scaling (DVFS). In fact, a P-State can be interpreted as a pair consisting of a clock frequency and voltage. Higher (numerical) P-States result in slower processing speeds but also in a lower power consumption. The lowest P-State refers to the maximum clock frequency using turbo mode (e.g. ‘‘Turbo Boost” by Intel).
3 Experimental Design The key contribution of this work is a thorough evaluation of the performance/energy characteristics of different (database) algorithms. Before we delve into the evaluation and its results, here we first detail the design of our experiments, including the benchmarks that we prepared. First, we introduce the database algorithms used in the experiments: one compute-bound microbenchmark, four memory-bound microbenchmarks and two state-of-the-art join algorithms. Afterwards, we introduce the hardware platform used to execute the experiments. Following this, we describe the measurement metrics and the experiments. 3.1 Benchmarks
The microbenchmark compute is designed to simulate a compute-bound algorithm. It induces a heavy load situation on all processor cores but avoids any accesses to main
Datenbank Spektrum
memory. The benchmark creates several worker threads that spin on calculating the square root of a random number and on performing a multiplication. The other four microbenchmarks scan, local_scan, dira and local_dira are designed to simulate memory-bound, read-intensive database algorithms. The benchmarks scan and local_scan feature a sequential memory access pattern and resemble a scan or a simple aggregation. The microbenchmarks create worker threads that read a different section of a shared data array containing randomly generated integers (500 MB per thread). Each thread adds up all the integers of its section and prints the sum. In addition, we let each thread execute its work ten times to prolong the execution time in order to reduce variations and confounding factors. The microbenchmarks dira and local_dira feature a data-independent random memory access (DIRA) [2] pattern. They are designed to simulate a simple hash join. The memory access pattern is called data-independent, because the processor can perform a random access without knowing the result of the previous random access. Each thread reads a different section of a shared data array containing randomly generated integers (500 MB per thread). These integers are used as indices to randomly access a second shared data array (8 GB). Thus, the microbenchmark simulates accesses to a look-up or hash table. The microbenchmarks process integers with a size of 64 bits. If not stated otherwise, they are executed with 32 worker threads. Moreover, we use the program library libnuma to allocate the input data on specific NUMA nodes. In case of the microbenchmarks scan and dira, the data is allocated on one NUMA node resulting in both local and remote memory accesses. In case of the microbenchmarks local_scan and local_dira, the data is equally distributed across both NUMA nodes resulting in local memory accesses only. In addition, we implement two benchmarks featuring two state-of-the-art join algorithms by using the program code of Balkesen et al. [1]. We use the implementation of the optimised parallel radix hash join and the implementation of the NUMA-aware sort-merge join with either the multi-way or the multi-pass merging strategy. We slightly modify the source code to include the measurement metrics described in Section 3.3. We execute the algorithms using 16 threads and join two relations with a size of 4 GB each. 3.2 Hardware Platform
The hardware platform is configured as a dual-socket machine. It has two Intel Xeon E5-2690 processors featuring 8 processor cores (per socket), which are clocked at a base frequency of 2.9 GHz. Using turbo mode, the processor can theoretically achieve a maximum clock frequency of up
to 3.8 GHz. Furthermore, the system has 32 GB of DDR3 main memory per socket. The maximum memory bandwidth amounts to 51.2 GB/s for one socket or to 102.4 GB/s for both sockets. 3.3 Measurement Metrics
First, we determine the execution time of a benchmark to evaluate the performance. The execution time refers to the time each benchmark needs to process all of the input data. It does not include the time it takes to generate the data. Second, we measure the total energy consumption of both processors by using the energy estimations provided by the Running Average Power Limit (RAPL) interface [6]. The energy estimations refer to the energy consumption of the entire processor die including the core and uncore parts of the processor but excluding the attached DRAM. Third, we compute the average power consumption of both processors by dividing the energy consumption by the execution time. In addition, we use an external power meter1, which we plug between the power supply socket and the power supply unit to measure the power consumption of the entire computer system. Fourth, we measure the memory read bandwidth by accessing hardware performance counters [14] which provide information about the amount of bytes that are read from the memory controller of both sockets. Note that this includes additional memory traffic caused by cache misses if a new cache line has to be fetched but not all the data of the cache line is used by the program. We do not use a specific metric to evaluate the energy efficiency. Instead, we identify if the amount of energy consumed by the processor is associated with an proportionate amount of work done – referred to as energy proportional computing by Barroso et al. [3]. In fact, we argue that we improve the energy efficiency of a benchmark if the percentage decrease of the energy consumption is more significant than the percentage decrease of the performance. The point of comparison is the measurement data obtained from the execution without limiting the clock frequency, which is the default setting of the operating system. In theory, the processor can thereby reach a clock frequency of up to 3.8 GHz using turbo mode. 3.4 Experiments
Due to their high memory demand, database algorithms often have to wait for memory. The resulting stall cycles lead to energy inefficiency because of unused computing power. We hypothesize that energy efficiency can be improved by artificially slowing down the CPU, thereby bringing com1
ELV Energy Master Basic 2 Energiekosten-Messgerät
K
Datenbank Spektrum
Fig. 1 Comparison of the energy estimations provided by the RAPL with the data obtained from an external power meter interface . The dashed line marks the start of the turbo mode
Fig. 2 Percentage decrease of the energy consumption and the of the benchmark compute at different runtime performance clock frequency limits
pute and memory demand into balance and avoiding inefficient stalls. We conduct experiments to study the impact of reduced computing power by lowering the clock frequency of the processor and by using fewer processor cores. In the first series of experiments, we limit the maximum clock frequency the Intel P-State driver of the Linux kernel intel_pstate is able to select. The driver can still select a clock frequency which lies below the limit but not above the limit. We execute every benchmark at different clock frequency limits from 1.2 GHz to 3.8 GHz. In the second series of experiments, we limit the number of worker threads and we control which thread is executed in which processor core. Thus, we limit the number of physical processor cores used by the microbenchmarks. During the execution of the benchmarks, we measure the metrics described in Section 3.3.
We observe that the two curves differ in approximately 65 W. Only in the frequency range of the turbo mode the two curves differ in up to 85 W. Note that the external power meter measures the power consumption of the entire computer system. This includes the power consumption of components such as the motherboard, the disks, the main memory, the cooling system or the power supply unit. Thus, the difference between the measurement data can be explained by the power consumption of these components. In addition, the constant difference indicates that the energy estimations are reliable enough to evaluate the energy efficiency of the benchmarks. We verify that reducing the power consumption of both processors significantly improves the energy balance of the entire computer system. Furthermore, we notice that both curves break at a clock frequency limit of 3.4 GHz. We observe that the processor is not able to achieve a higher clock frequency using all cores due to thermal constraints. Next, we evaluate the impact of a limited clock frequency on the energy efficiency of the benchmarks. The experimental results of the microbenchmark compute depicted in Fig. 2 reveal that the performance of the benchmark heavily depends on the clock frequency. We verify that the benchmark is only compute-intensive because the measured memory read bandwidth is approximately 0 GB/s (Table 1). Limiting the clock frequency to, e.g., 1.9 GHz decreases the performance by more than 42%. At the same time, the energy consumption only degrades by 26%. Thus, we observe that limiting the clock frequency does not improve the energy efficiency. Instead, it seems best to clock the processor at the highest frequency, finish the tasks as fast as possible and then put the processor in a sleep mode to save power (race-to-idle). This behaviour is consistent with folklore. Ever-decreasing gate lengths have resulted in significant leakage currents in modern chip designs. Such currents imply energy wastage that does not benefit computation. Race-to-
4 Evaluation In this section, we present the measurement results of the experiments. First, we study the impact of reducing the clock frequency on the energy efficiency of the benchmarks. Then, we analyse the impact of using fewer processor cores on the energy efficiency of the microbenchmarks. 4.1 Impact of Reducing the Clock Frequency
Before we start evaluating the energy efficiency of the benchmarks, we first take a look at the accuracy of the energy estimations provided by the RAPL interface. Using the example of the microbenchmark compute, we compare the power estimations of the processor with the measurement data obtained from an external power meter while we vary the clock frequency of the processor. Fig. 1 illustrates the experimental results.
K
Datenbank Spektrum
Table 1 Execution time, energy consumption and memory read bandwidth of every benchmark at the lowest and highest clock frequency limit: 1.2 GHz / 3.8 GHz Benchmark
Execution Time (s)
Energy Consumption (J)
Memory Read Bandwidth (GB/s)
compute scan local_scan dira local_dira Parallel Radix Hash Join Sort-Merge Join (multi-way) Sort-Merge Join (multi-pass)
16.1 / 5.9 6.6 / 5.4 3.5 / 1.7 5.5 / 4.5 3.3 / 1.9 4.3 / 2.5 8.5 / 3.7 10.0 / 5.6
956 / 1202 455 / 973 268 / 452 387 / 837 248 / 465 295 / 448 611 / 842 744 / 1233
0.0 / 0.0 24.3 / 30.2 45.0 / 90.4 25.5 / 31.2 42.0 / 71.5 15.4 / 24.8 9.9 / 21.9 18.5 / 32.3
a
b
a
b
Fig. 3 Percentage decrease of the energy consumption and the of the benchmarks scan (a) and dira (b) runtime performance at different clock frequency limits
Fig. 4 Percentage decrease of the energy consumption and the of the benchmarks local_scan (a) and runtime performance local_dira (b) at different clock frequency limits
idle minimizes the overall execution time of the program, thereby minimizing the power loss through leakage. Fig. 3a and Fig. 3b illustrate the experimental results of the microbenchmarks scan and dira. In contrast to the results of the microbenchmark compute, we notice that limiting the clock frequency to, e.g., 1.9 GHz only results in performance degradations of up to 4%. However, the energy consumption is reduced by 47%. Thus, limiting the clock frequency improves the energy efficiency of both benchmarks. Moreover, we notice that the microbenchmarks scan and dira achieve a memory read bandwidth of up to 30 GB/s. The value lies below the hardware limit of 51 GB/s per socket. In fact, we determine that both benchmarks are limited by the bandwidth of the Quick Path Interconnect (QPI) links which connect both sockets. That is because all threads executed on the second socket perform remote memory accesses to the input data which is located on the first socket. As a result, the execution slows down considerably. Fig. 4a and Fig. 4b illustrate the results of the microbenchmarks local_scan and local_dira. We observe that limiting the clock frequency to, e.g., 1.9 GHz saves more than 40% of the energy. At the same time, we lose 20% of the performance. Thus, limiting the clock
Fig. 5 Percentage decrease of the energy consumption and the of the parallel radix hash join at different runtime performance clock frequency limits
frequency improves the energy efficiency of both benchmarks. Furthermore, we learn that the microbenchmark local_scan reaches a memory read bandwidth of up to 90 GB/s (Table 1), which is very close to the hardware limit of 102.4 GB/s for both sockets. In fact, the benchmark is bandwidth-bound. The microbenchmark local_dira on the other hand only reaches a memory read bandwidth of up to 72 GB/s (Table 1). We conclude that the benchmark is latency-bound, i.e., by the time it takes to transfer data from main memory to the processor.
K
Datenbank Spektrum
a
b
Fig. 6 Percentage decrease of the energy consumption and of the sort-merge join using either the the runtime performance multi-way merging strategy (a) or the multi-pass merging strategy (b) while limiting the clock frequency
a
b
Fig. 7 Percentage decrease of the energy consumption and the of the benchmarks scan (a) and dira runtime performance (b) executed with a varying number of physical processor cores. The dashed line marks the separation of the cores between the two sockets. The left side represents the first and the right side the second socket
The experimental results support our hypothesis. We observe that we can improve the energy efficiency by artificially slowing down the CPU, thereby bringing compute and memory demand into balance and avoiding inefficient stalls. Fig. 5 depicts the measurement results of the parallel radix hash join. We learn that limiting the clock frequency to, e.g., 1.9 GHz degrades the performance by 20%. At the same time, the energy consumption is reduced by 31%. Thus, limiting the clock frequency improves the energy efficiency of the parallel radix hash join, too. Fig. 6a and Fig. 6b illustrate the results of the sort-merge join algorithm using the multi-way and the multi-pass merging strategy. We observe that in case of multi-way merging, limiting the clock frequency to, e.g., 1.9 GHz deteriorates the performance by 34% while the energy consumption is reduced by 28%. In case of the multi-pass merging, limiting the clock frequency to 1.9 GHz degrades the performance by 21%. At the same time, we save 36% of the energy. Thus, limiting the clock frequency significantly improves
K
Fig. 8 Breakdown of the execution times of the different phases of the sort-merge join algorithm using either multi-way merging or multipass merging [1]
the energy efficiency of the sort-merge join algorithm if multi-pass merging is used. In order to explain the results, we have to visualise the difference between the two algorithms. Fig. 7 illustrates the execution times of every phase of the sort-merge join algorithm using either multi-way or multi-pass merging. We learn that the partitioning phase, the sorting phase and the joining phase all take approximately the same amount of time. However, we observe that the merging phase lasts significantly longer if the algorithm uses the multi-pass merging technique. Considering that the in-cache sorting is compute-intensive while merging is memory-intensive, it is plausible that the energy efficiency of the sort-merge join algorithm using multi-way merging does not improve. The algorithm is mainly compute-intensive. Thus, it heavily depends on the clock frequency of the processor. The sort-merge join using the multi-pass merging on the other hand is mainly memory-intensive. Therefore, we can reduce unused computing power of the processor to improve energy efficiency. 4.2 Impact of Using Fewer Processor Cores
In this section we evaluate the impact of using fewer processor cores on the energy efficiency of the microbenchmarks. Fig. 8a illustrates the experimental results of the microbenchmark scan. We observe that using 4 cores is enough to get close to the best performance reaching a memory read bandwidth of 42 GB/s (Table 2) while the performance is also improved by 31% compared to using all 16 physical cores. At the same, time we are able to reduce the energy consumption by 49%. Thus, we are able to improve the energy efficiency of the microbenchmark scan by using fewer processor cores. In addition, we learn that using more than 12 cores reduces
Datenbank Spektrum
Table 2 Execution time, energy consumption and memory read bandwidth of the microbenchmarks executed on a different number of processor cores Microbenchmark
Execution Time (s)
Energy Consumption (J)
Memory Read Bandwidth (GB/s)
# Active Processor Cores
scan dira local_scan local_dira
3.8 / 5.0 4.0 / 4.3 1.8 / 1.7 2.2 / 1.9
463 / 902 580 / 780 348 / 452 440 / 469
42.2 / 32.2 35.1 / 32.5 85.9 / 90.0 63.7 / 70.4
4 / 16 8 / 16 8 / 16 12 / 16
a
b the energy consumption increases. Thus, we notice that we are unable to improve the energy efficiency of the microbenchmark local_dira by limiting the number of physical cores. In contrast, we were able to improve the energy efficiency of the other microbenchmarks by using fewer physical cores.
5 Discussion Fig. 9 Percentage decrease of the energy consumption and the of the benchmarks local_scan (a) and runtime performance local_dira (b) executed with a varying number of physical processor cores. The number of physical cores is increment by one for each socket
the performance because an increasing amount of memory accesses have to be performed remotely using the QPI links. The experimental results of the microbenchmark dira are depicted in Fig. 8b. The data shows that the benchmark needs all cores of the first socket to get close to the best performance reaching a memory read bandwidth of 35 GB/s. We notice that by utilising only the first socket, we can improve the performance by 6% while we save 26% of the energy and improve the energy efficiency. Furthermore, we observe that using more than 12 cores results in performance degradations – similar to the observations made for the microbenchmark scan. Again, we explain the slowdown by the increasing number of remote memory accesses causing more and more data transfers over the comparatively slow interconnect of the two sockets. Fig. 9a illustrates the experimental results of the NUMAlocal microbenchmark local_scan. The results reveal that using 8 physical cores – 4 cores per socket – suffices to get good performance. We notice that the benchmark reaches a memory read bandwidth of 86 GB/s. The performance degrades by 8% compared to using all cores of both sockets. At the same time, we can reduce the energy consumption by 23%. We learn that using fewer processor cores results in improved energy efficiency. Fig. 9b depicts the experimental results of the microbenchmark local_dira. We observe that limiting the number of physical cores used to execute the benchmark reduces the performance significantly. At the same time,
In this section, we discuss the measurement results of the evaluation. First, we review the impact of reducing the clock frequency of the processor on energy efficiency. Second, we review the impact of using fewer processor cores on energy efficiency. Finally, we discuss how the results may help to write energy-efficient software. 5.1 Impact of Reducing the Clock Frequency
The evaluation of the first series of experiments reveals that the energy efficiency of the microbenchmark compute does not improve if we limit the clock frequency. We conclude that the most energy-efficient execution strategy for a compute-bound algorithm is the race-to-idle strategy. In contrast, we learn that limiting the clock frequency improves the energy efficiency of the memory-bound microbenchmarks. If the input data is allocated on only one NUMA node, reducing the clock frequency hardly degrades the performance. If the input data is equally distributed across all NUMA nodes, limiting the clock frequency has a bigger impact on the performance. However, the energy savings still prevail. Moreover, we observe that the impact of a limited clock frequency does not differ between a sequential memory access pattern (bandwidth-bound) and a random memory access pattern (latency-bound). In addition, we learn that reducing the clock frequency improves the energy efficiency of the parallel radix hash join algorithm, too. In case of the sort-merge join algorithm, we observe that the impact of a limited clock frequency depends on the merging strategy. If we use the multi-way merging, the algorithm is mostly compute-intensive. Reducing the clock frequency does not improve the energy efficiency. If we
K
Datenbank Spektrum
use multi-pass merging, the algorithm is mostly memoryintensive. Limiting the clock frequency improves the energy efficiency. Therefore, we conclude that we can improve the energy efficiency of a database algorithm by reducing the computing power if the algorithm is more memory-intensive than compute-intensive. We showed that we can, e.g., limit the clock frequency to improve the energy efficiency of database algorithms in the following way: If we execute a compute-bound algorithm, we should increase the clock frequency to the maximum (race-to-idle). If we execute a memory-bound algorithm, we should decrease the clock frequency. 5.2 Impact of Using Fewer Processor Cores
The evaluation of the second series of experiments demonstrates that using all available physical cores of both sockets deteriorates the performance as well as the power savings for the microbenchmarks scan, dira and local_scan except for the microbenchmark local_dira. For the microbenchmarks scan and local_scan featuring sequential memory accesses it suffices to use only half of the physical cores of the socket, where the input data is located. Using four cores of only the first socket for the benchmark scan or using four cores of both sockets for the benchmark local_scan improves the performance as well as saves energy compared to using all cores. The microbenchmark dira featuring a random memory access pattern needs all cores of the first socket for the best performance. Using only the first socket causes the benchmark to consume the lowest amount of energy compared to using a different number of physical cores. In case of the microbenchmark local_dira, we learn that the benchmark finishes as fast as possible if executed on all cores of both sockets. At the same, time the microbenchmark also consumes the least amount of energy if all cores of both sockets are used. Thus, we observe a clear difference between the two microbenchmarks featuring random memory accesses and the two microbenchmarks featuring sequential memory accesses. 5.3 Towards Energy-Efficient Software
The results demonstrate that we can reduce unused computing power to improve the energy efficiency of database algorithms. This allows us to develop ideas for energy-efficient software used in an MMDB. First, the results can influence the design of database algorithms. If an algorithm consists of a compute-bound phase and a memory-bound phase, we propose to balance both phases in order to avoid energy waste. In case of, e.g., the sort-merge join algorithm, we could overlap the com-
K
pute-bound sorting phase with the memory-bound merging phase. A possible implementation would depend on the merging strategy of the the algorithm. If, for example, a simple merging strategy is used, two sorted partitions are enough to start merging and to overlap both phases. Second, we propose to extend the optimiser of an MMDB to avoid energy waste. When the optimiser decides upon the physical database operations of an execution plan, it should balance compute-bound operations and memorybound operations. Thus, the optimiser needs specific cost models. These can be built based on the ratio between compute intensity and memory intensity of a database operation. Third, we propose to extend the execution engine of an MMDB by a live monitoring component. The engine could use hardware performance counters to periodically calculate the ratio between compute intensity and memory intensity of database operations at runtime. This can be done on multiple levels: for the entire system, individual sockets or individual cores. The collected statistics can be used to dynamically change settings such as clock frequency, thread-to-core mappings or data placement at runtime. Fourth, we argue that our results can influence the design of new hardware. We propose to develop an over-designed processor with special hardware features for various use cases such as different memory-intensive or compute-intensive tasks. It does not matter if it is impossible to operate every part of the processor at full power without damaging the chip. Instead, we propose to use energy-aware software which is able to control the power dissipation of different parts of the processor to avoid overheating. At the same, time the software should be able to dynamically control the hardware to optimise both performance and energy efficiency depending on the current task.
6 Related Work In this section we present related work from the research community specialising on the topic of energy efficiency for DBMS. Poess et al. [8] highlight the importance of energy efficiency by analysing the power consumption of transaction processing systems with the TPC-C benchmark. They develop a power consumption model and apply it to a large set of experimental results from the TPC-C benchmark. Poess et al. conclude their study by arguing that the performance of new hardware generations is increasing and that the price-performance is decreasing (as predicted by Moore). However, the performance-per-watt is increasing at a much slower rate. Thus, energy efficiency will be a key challenge in a modern data centre. Furthermore, they argue that hardware enhancements alone are not enough to satisfy the demand for energy efficiency in the near future.
Datenbank Spektrum
Software needs to be adapted to be more energy-efficient. Harizopoulos et al. [5] argue that hardware optimisations are only part of the solution towards improving energy efficiency. They predict that software will be the key to improve energy efficiency and propose to focus on systemwide configuration parameters and query optimisations. In fact, they propose to consolidate resource utilisation and power in order to facilitate powering down unused hardware and to redesign data-structures and algorithms. Thus, their observations align with our own results. Xu et al. [15] explore the power consumption of a DBMS by performing experiments with PostgreSQL using workloads generated from the TPC benchmarks. They develop a power-aware query optimiser using a modified PostgreSQL query optimizer equipped with a power-aware query evaluation model. Tu et al. [12] propose a storage manager which puts unused disk drives in a low power mode and employs energyaware data placements. In addition, they present a prototype DBMS, which uses a feedback loop using DVFS based on the current performance and a performance threshold. They claim to achieve significant energy savings. Tsirogiannis et al. [11] present power profiles for core database operations such as scans, sorts and hash joins and evaluate the impact of DVFS as well as the number of used cores. Moreover, they perform experiments with an opensource DBMS and a commercial DBMS. They conclude that the CPU is responsible for up to 85% of the power increase in their system and that the best performing configuration is also the most energy-efficient. In fact, they argue that lowering the clock frequency does not improve the energy efficiency. However, they only perform their experiments at two different clock frequencies: at a maximum frequency of 2.66 GHz and at a reduced frequency of 2.0 GHz. Similar to our observations, they conclude that a resource consolidation across underutilised nodes leads to power savings without performance losses. The work of Psaroudakis et al. [9] further supports our results. They propose a fine-granular approach for improving energy efficiency by dynamically scheduling tasks as well as using DVFS. They conclude that using simultaneous multithreading, a moderate clock frequency and a thread placement that fills both sockets can significantly improve the energy efficiency of memory-intensive workloads. In addition, they propose to use a memory allocation policy that uses the memory bandwidth of all available sockets. Others propose to use a distributed cluster to improve energy efficiency. Schall et al. [10] present a distributed DBMS called WattDB using clustered commodity hardware. Their system automatically powers off individual nodes of the cluster based on the current workload. They argue that lightweight nodes are more flexible than oversized
servers allowing energy consumption to be proportional to the current load. Lang et al. [7] conclude that if queries are not scalable, reducing the cluster size can improve energy efficiency. In addition, they claim that a heterogeneous cluster design can improve the performance as well as the energy efficiency compared to a homogeneous cluster design.
7 Conclusion The characteristics of typical database workloads matter. In fact, most database algorithms are memory-bound instead of compute-bound. As a result, computing power as well as electrical power gets wasted if the processor has to wait for data from main memory. We experimentally explored the impact of reducing the computing power on the energy efficiency of several benchmarks representing common database algorithms such as scans, simple aggregations and joins. Our evaluation reveals that limiting the clock frequency deteriorates the energy efficiency of the computebound microbenchmark, while it improves the energy efficiency of the memory-bound benchmarks including stateof-the-art joins. We show that database workloads behave considerably different than other workload types and that reducing the computing power, e.g., by limiting the clock frequency can significantly improve energy efficiency. Acknowledgment This work was supported by the DFG, Collaborative Research Center SFB 876, A2.
References 1. Balkesen C, Alonso G, Teubner J, Özsu MT (2013) Multi-Core, Main-Memory Joins: Sort vs. Hash Revisited. Proc Vldb Endow 7(1):85–96 2. Barber R, Lohman G, Pandis I, Raman V, Sidle R, Attaluri G, Chainani N, Lightstone S, Sharpe D (2014) Memory-Efficient Hash Joins. Proc Vldb Endow 8(4):353–364 3. Barroso LA, Hölzle U (2007) The Case for Energy-Proportional Computing. Computer (Long Beach Calif) 40(12):33–37 4. Esmaeilzadeh H, Blem E, Amant Sankaralingam Burger StRKD (2011) Dark Silicon and the End of Multicore Scaling. Sigarch Comput Archit News 39(3):365–376 5. Harizopoulos S, Shah MA, Meza J, Ranganathan P (2009) Energy Efficiency: The New Holy Grail of Data Management Systems Research. Biennial Conference on Innovative Data Systems Research (CIDR) January 4-7, 2009. Asilomar, California (arXiv preprint arXiv:0909.1784) 6. Intel Corporation (2016) Intel 64 and IA-32 Architectures Software Developer’s Manual – Volume 3B 7. Lang W, Harizopoulos S, Patel JM, Shah MA, Tsirogiannis D (2012) Towards Energy-Efficient Database Cluster Design. Proc Vldb Endow 5(11):1684–1695 8. Poess M, Nambiar RO (2008) Energy Cost, the Key Challenge of Today’s Data Centers: A Power Consumption Analysis of TPC-C Results. Proc Vldb Endow 1(2):1229–1240
K
Datenbank Spektrum 9. Psaroudakis I, Kissinger T, Porobic D, Ilsche T, Liarou E, Tözün P, Ailamaki A, Lehner W (2014) Dynamic Fine-grained Scheduling for Energy-Efficient Main-memory Queries. Proc Damon Acm Pp 1(7):1–1 10. Schall D, Härder T (2013) Energy-proportional Query Execution Using a Cluster of Wimpy Nodes. Proceedings of the International Workshop on Data Management on New Hardware. ACM, DaMoN ’13, pp 1:1–1:6 11. Tsirogiannis D, Harizopoulos S, Shah MA (2010) Analyzing the Energy Efficiency of a Database Server. Proceedings of the SIGMOD International Conference on Management of Data. ACM, SIGMOD ’10, pp 231–242
K
12. Tu Y, Wang X, Zeng B, Xu Z (2014) A System for Energy-Efficient Data Management. Sigmod Rec 43(1):21–26 13. Unified EFI, Inc (2016) Advanced Configuration and Power Interface Specification 6.1. http://www.uefi.org/specifications 14. Willhalm T, Dementiev R, Fay P (2016) Intel Performance Counter Monitor - A better way to measure CPU utilization. http://www. intel.com/software/pcm 15. Xu Z, Tu Y, Wang X (2010) Exploring Power-Performance Tradeoffs in Database Systems. Proceedings of the International Conference on Data Engineering (ICDE)., pp 485–496