A large data processing algorithm for energy efficiency in a heterogeneous cluster (pdf)

A PDF file should load here. If you do not see its contents the file may be temporarily unavailable at the journal website or you do not have a PDF plug-in installed and enabled in your browser.

Alternatively, you can download the file locally and open with any standalone PDF reader:

https://www.itm-conferences.org/articles/itmconf/pdf/2018/02/itmconf_wcsn2018_03023.pdf

A large data processing algorithm for energy efficiency in a heterogeneous cluster

ITM Web of Conferences A large data processing algorithm efficiency in a heterogeneous cluster for energy Lei Wang Weichun Ge Zhao Li Zhenjiang Lei Shuo Chen ICT Department State Grid Liaoning Electric Power Co. Shenyang China . It is reportedi that the electricity cost to operate a cluster may well exceed its acquisition cost, and the processing of big data requires large scale cluster and long period. Therefore, energy efficient processing of big data is essential for the data owners and users. In this paper, we propose a novel algorithm MinBalance to processing I/O intensive big data tasks energy efficiently in heterogeneous cluster. In the former step, four greedy policies are used to select the proper nodes considering heterogeneity of the cluster. While in the latter step, the workloads of the selected nodes will be well balanced to avoid the energy wastes caused by waiting. MinBalance is a universal algorithm and cannot be affected by the data storage strategies. Experimental results indicate that MinBalance can achieve over 60% energy reduction for large sets over the traditional methods of powering down partial nodes. 1 Introduction With the development and application of information technology, the data produced is presented. How to store, manage, and apply these data to become an explosive growth. A general concern of the business community and academia. You know, there is great value in big data, so research based on big data is also very much. Many scholars call it the fourth paradigm of scientific research [1]-[2]. Cloud computing as a kind of Emerging economies based on economies of scale have become big data the first platform for storage and processing. The open source cloud meter is the platform Hadoop, HBase, and HadoopDB have been widely studied and the application. More and more businesses are building their own big data points the platform deals with growing business data and even offers Various services based on big data [3]. A lot of hardware resources are required to handle big data. Include servers, PCS, and even mobile devices. The making of these devices takes a lot of energy, mainly electricity, to be used globally electricity is generated mainly by thermal power so the big data also has great challenges to energy and environment [4]. In 2005, show that a server is within the lifetime of its use the total amount of electricity consumed has exceeded the purchase cost. And research show that, in 2008, the world's 4400 servers consumed electricity 0.8% percent, if you go like that, at that rate By 2020, that proportion will be 3.2%. Epa (US Environmental Protection Agency) issued a report statement in 2006, the total electricity consumption of American IT agencies was 61 billion KWh, the electricity bill alone is $4.5 billion [5]-[7]. So that's a concern Big data storage and processing performance must also be used for energy consumption Give enough attention [8]. This paper mainly discusses the large data processing tasks of I/O intensive. The computationally intensive tasks are affected by the real-time running state of the processor large, and different hardware and operating system provided processor control machines. There are differences in system, so this paper does not consider computationally intensive large Numbers According to the task. Because of the data-intensive task for the processor with a small dependency, for a server, the processing of each data block is reduced the time and power consumed can be regarded as basically the same. A cluster consisting of n heterogeneous nodes processes a Map-Reduce tasks, assume that the nodes involved in task processing are C, the total energy consumed during task processing is Toal cost  maxTi Pi ni C ni C The Ti and pi represent the processing time of the I node Power consumption. By type ( 1 ) the total energy is mainly affected by two factors: use the nodes that perform the tasks and the maximum processing time of the nodes. Type ( 1 ) The two kinds of high-efficiency data processing methods: 1) To select Some suitable nodes perform tasks to reduce total power consumption; 2) Equilibrium The load of a node reduces the maximum execution time. According to the actual situation of the node, determine which tasks each node performs, namely Equalize the load of each node, reduce task execution time, and further Reduces total energy consumption of the system. The method has three distinct advantages: 1) Fully consider the heterogeneity of the nodes; 2) There is no copy storage strategy shut; 3) Comprehensively consider the total number of nodes and load balance the factor of consumption. 2 Problem description I/O intensive large data processing tasks for heterogeneous clusters. Energy efficient processing problems can be formalized as follows: given a set, the group is composed of h isomeric nodes and N = {n1, n2, n3, …, nh}, of which A node ni (1≤i≤h) takes the time and work required to process a block of d (...truncated)