# PERFORMANCE ANALYSIS OF VARIOUS ALU ARCHITECTURES FOR IMPLEMENTATION ON RECONFIGURABLE LOGIC

Priyanka Thakre<sup>1</sup>, Nitesh Dodkey<sup>2</sup>, Siddarth Singh Parihar<sup>3</sup> <sup>1</sup>M.Tech Scholar, <sup>2</sup>HOD ECE, <sup>3</sup>Assistant Professor Department of Electronics and Communication Engineering, Surabhi Group of Institutions, Bhopal (M.P.) India

Abstract: In this work we have implemented an arithmetic and logic unit with clock gating and hardware sharing method. In our design we have used clock gating technique to implement arithmetic and logic unit. Using this method we have successfully reduced the dynamic power consumption by reducing switching activity inside the design. The dynamic power consumption is decreased by 21% for 8 operation design compared to other design available in literature and 14.9% for 15 operation design.. We have also used hardware sharing method for four instructions; this reduces the hardware usage of FPGA. The Resource usage is reduced by 29% for 8 operations design and by 2.2 & for 15 operation design.

Keywords: Clock gating, dynamic power consumption, FPGA, hardware sharing, arithmetic and logic unit

#### I. INTRODUCTION

This is an era of hand held devices and equipments, most of these devices runs on battery, this puts a constraint on standby time, to increase standby time more and more battery life is needed, one way of solving this issue is to reduce power consumption of device or equipment. These days almost every device is intelligent, this intelligence came from using processors, and in forthcoming years this trend is likely to be increase. But these processors consume lot of the power of device as lot of switching activity is going inside. ALU (Arithmetic and Logic Unit) is the heart of any processor; this also consumes most of the processor power. In this work we worked in order to reduce power consumption of ALU. We have designed a sixty four bit optimized ALU. A two level optimization is implemented, first we have reduced the FPGA resource consumption by reusing them for different operations, details are given in forthcoming sections, this will cut down FPGA resource consumption and also power consumption of design, several blocks are designed to implement specified operations, in second level of optimization we enable only one block at a time which is currently selected and all other blocks are disabled using clock gating technique, this reduces dynamic power consumption of device and makes our design more greener. In this work four architectures of ALU are presented as listed below:

- Conventional 8 operations design
- Area and Power optimized 8 operations design
- Conventional 15 operations design
- Area and Power optimized 15 operation design

The architecture of different designs and their performance analysis is discussed in forthcoming sections.

#### II. CONVENTIONAL EIGHT OPERATION DESIGN

The operation supported by the reference are listed in table 1 and the block diagram of reference design is shown in figure 1. As shown in table 1 that the reference design can perform 8 logical and arithmetic and logical operations only. Shift, Rotate and BCD operations are not supported by this design. Eight units are used to implement the eight operations. At any given time instant only one of the eight units will be in use to perform one of the eight operations but clock signal CLK is assigned to all the eight units at all times, this increases the dynamic power consumption of the design. Table 1: Operations Supported by 8 Operations Conventional



Figure 1: Conventional 8 Operation Design - Internal Architecture

Figure 2 shows the internal structure of logical AND operation. The AND operation is implemented by an array of AND gates followed by a D flip flop.



Figure 2: Internal Structure - Logical AND

Similarly internal structures of all the units are shown in figure 2 to figure 9.



## III. AREA AND POWER OPTIMIZED EIGHT OPERATION DESIGN

In the conventional design all the eight units are connected to the clock at all times which increases the dynamic power consumption, so reduce the dynamic power consumption the clock gating technique is adopted. The clock gating logic shown in figure 10 divides the input clock to 5 different clock signals and only one out of 5 clock signals is active at any given time instant. This reduces the dynamic power consumption of the design.

The second level of optimization is the reduction in the number of units from 8 to 5. In this design the addition, subtraction, increment and decrement operation are implemented using a single unit called the adder + 2's complement unit.



Figure 10: Area and Power Optimized 8 Operation Design - Internal Architecture

In the earlier designs to implement 8 operations, 8 separate blocks are implemented. But there are similar operations like addition, subtraction, increment and decrement. These 4 operations are similar to each other.

Addition is needed in all the four operations, subtraction can be implemented by first doing 2's complement to a number and then adding the other number to it. Similarly increment can be implemented. So to implement these 4 operations we have implemented a single block called addition and 2's complement block. By reusing the adder hardware we have reduced the hardware cost of the design.



Figure 11 shows the architecture of adder and 2's complement block. The architecture consists of two 8 bit adders adder 1 and adder 2. Adder 1 is employed in 2's complement unit along with a 8 bit inverter unit. Adder 2 is the main adder which is used to perform the six operations in conjunction with 2's complement block. Three customized multiplexer are also used to implement the adder and 2's complement block. Table 2 shows the input lines selected of two multiplexer for different values of SEL operation selection signal.

Table <u>4.2</u>: Operation of Adder + 2's complement Unit

| SEL  | Operation | MUX 1 | MUX 2 |
|------|-----------|-------|-------|
| 0111 | DEC A     | 1     | 1     |
| 0110 | INC A     | 2     | 0     |
| 0100 | A + B     | 0     | 0     |
| 0101 | A – B     | 1     | 0     |

The clock gating logic shown in figure 10 bifurcates the input CLK signal to five different clock signals. Table 3 shows the active clock signals with respect to the currently selected operation. The first four operation activates the respective logical units and the next four selection signals selects the adder+2's complement unit to perform the addition, subtraction, increment and decrement operation.

| Table 3: Clock Gating |     |                               |  |  |  |  |
|-----------------------|-----|-------------------------------|--|--|--|--|
| S.No                  | SEL | Selected CLK IN               |  |  |  |  |
| 1                     | 000 | CLK_Logical_AND               |  |  |  |  |
| 2                     | 001 | CLK_Logical_XNOR              |  |  |  |  |
| 3                     | 010 | CLK_Logical_XOR               |  |  |  |  |
| 4                     | 011 | CLK_Logical_OR                |  |  |  |  |
| 5                     | 100 |                               |  |  |  |  |
| 6                     | 101 | CLK Adder 2's Complement Unit |  |  |  |  |
| 7                     | 110 | CLK_Adder_2's_Complement_Unit |  |  |  |  |
| 8                     | 111 |                               |  |  |  |  |

#### IV. CONVENTIONAL FIFTEEN OPERATION DESIGN

This section discusses the conventional ALU supporting fifteen operations, this ALU is based on the same design as the reference ALU supporting eight operations i.e. clock is available to all the units at all time and different units are used for performing addition, subtraction, increment and decrement. Here in this design we have incorporated 7 more instructions to perform shift, rotate and BCD operations. Table 4 shows the supported operations by this design.

Table 4: Reference ALU – 15 Instructions

| Table 4. Reference ALU – 15 Instructions |           |                    |  |  |  |
|------------------------------------------|-----------|--------------------|--|--|--|
| Serial No.                               | Selection | Operation          |  |  |  |
| 1                                        | 0000      | Logical ANDing     |  |  |  |
| 2                                        | 0001      | Logical XNORing    |  |  |  |
| 3                                        | 0010      | Logical XORing     |  |  |  |
| 4                                        | 0011      | Logical ORing      |  |  |  |
| 5                                        | 0100      | Binary Addition    |  |  |  |
| 6                                        | 0101      | Binary Subtraction |  |  |  |
| 7                                        | 0110      | Binary Increment   |  |  |  |
| 8                                        | 0111      | Binary Decrement   |  |  |  |
| 9                                        | 1000      | Rotate Right       |  |  |  |
| 10                                       | 1001      | Rotate Left        |  |  |  |

| 11 | 1010 | Shift Right        |
|----|------|--------------------|
| 12 | 1011 | Shift Left         |
| 13 | 1100 | BCD Addition       |
| 14 | 1101 | BCD Subtraction    |
| 15 | 1110 | BCD Multiplication |
| 16 | 1111 | NOP                |

Figure 12 shows the internal architecture of reference ALU - 15 operations, no clock gating logic is used and separate blocks are used to perform addition, subtraction, increment and decrement.



The rotate right operation is shown in figure 13, it consists of connections with input and out ports and 64 D flip flops (FD) as shown in figure. The first input signal I0 is connected to output signal O63, I1 to O0, I2 to O1 and so on.



Figure 13: Rotate Right - Internal Structure

Similarly architectures of rotate and shift operations are shown in figure 13 - 16.





The BCD adder/Subtractor unit shown in figure 17 consists of a binary adder with a BCD correction unit to perform the addition, to perform the subtraction also a 9'complement unit is also incorporated shown in figure 18. Here in figure 17 a multiplexer is also incorporated to select between the BCD addition and BCD subtraction. To perform BCD addition the data path is the simply the multiplexer, then the binary adder and then the BCD correction unit. To perform the subtraction first the 9's complement of the second number is calculated using the 9'complement unit and the multiplexer passes this complemented number to the binary adder to perform the subtraction, the final result of BCD subtraction is obtained after the BCD correction has been performed.



The 9's complement is performed using 4 xor gates, which XORes the input number with 1 and a binary adder to perform the addition of 0110 with the xored number.

Figure 19 represents an area optimized BCD digit multiplier. This multiplier produces the result of multiplication in binary and we need a binary to BCD converter shown in figure 20. The B is the higher nibble of the multiplication and C is the lower nibble of multiplication.

Figure 21 shows the parallel multiplication process of a 4 x 4 BCD multiplier. Xi and yi are single digit BCD numbers. These numbers are multiplied using the single digit BCD multiplier shown in figure 19 and 20. pyixiH and pyixiL are higher and lower nibble of multiplication respectively.

Figure 22 depicts the 4 x 4 multiplier architecture to implement the algorithm shown figure 4.24. In the process of floating point multiplication this 4x4 multiplier is extended to implement 16 x 16 multiplier.



|        |        |        |        | x3     | x2     | x1     | x0     |
|--------|--------|--------|--------|--------|--------|--------|--------|
|        |        |        |        | у3     | y2     | y1     | y0     |
|        |        |        |        | P0003L | P0002L | P0001L | p0000L |
|        |        |        | P0003H | P0002H | P0001H | P0000H |        |
|        |        |        | P0103L | P0102L | P0101L | P0100L |        |
|        |        | P0103H | P0102H | P0101H | P0100H |        |        |
|        |        | P0203L | P0202L | P0201L | P0200L |        |        |
|        | P0203H | P0202H | P0201H | P0200H |        |        |        |
|        | P0303L | P0302L | P0301L | P0300L |        |        |        |
| P0303H | P0302H | P0301H | P0300H |        |        |        |        |
| P7     | P6     | P5     | P4     | P3     | P2     | P1     | P0     |

Figure 21: Array BCD Multiplication Process



#### V. AREA AND POWER OPTIMIZED FIFTEEN OPERATION DESIGN

In the reference design all the fifteen units are connected to the clock at all times which increases the dynamic power consumption, so reduce the dynamic power consumption the clock gating technique is adopted. The clock gating logic shown in figure 23 divides the input clock to 11 different clock signals and only one out of 11 clock signals is active at any given time instant. This reduces the dynamic power consumption of the design.

The second level of optimization is the reduction in the number of units from 14 to 11. In this design the addition, subtraction, increment and decrement operation are implemented using a single unit called the adder + 2's complement unit.



Figure 23: Area and Power Optimized 15 operation ALU – Internal Architecture

The clock gating technique is used to reduce the dynamic power consumption of the system. Table 5 depicts the active clock signals for currently selected operation.

| Table 5: Clock Gating - 15 operation design |           |                               |  |  |  |  |
|---------------------------------------------|-----------|-------------------------------|--|--|--|--|
| Serial No.                                  | Selection | CLK signal selected           |  |  |  |  |
| 1                                           | 0000      | CLK_Logical_AND               |  |  |  |  |
| 2                                           | 0001      | CLK_Logical_XNOR              |  |  |  |  |
| 3                                           | 0010      | CLK_Logical_XOR               |  |  |  |  |
| 4                                           | 0011      | CLK_Logical_OR                |  |  |  |  |
| 5                                           | 0100      |                               |  |  |  |  |
| 6                                           | 0101      | CLK_Adder_2's_Complement_Unit |  |  |  |  |
| 7                                           | 0110      |                               |  |  |  |  |
| 8                                           | 0111      | 1                             |  |  |  |  |
| 9                                           | 1000      | CLK_Rotate_Right              |  |  |  |  |
| 10                                          | 1001      | CLK_Rotate_Left               |  |  |  |  |
| 11                                          | 1010      | CLK_Shift_Right               |  |  |  |  |
| 12                                          | 1011      | CLK_Shift_Left                |  |  |  |  |
| 13                                          | 1100      | CLK_BCD_Adder_Subtractor      |  |  |  |  |
| 14                                          | 1101      |                               |  |  |  |  |
| 15                                          | 1110      | CLK_BCD_Multiplier            |  |  |  |  |

#### VI. PERFORMANCE ANALYSIS

Table 6 shows the resource and power consumption of reference ALU and area and power efficient ALU. The base paper design and the conventional design are same, we have implemented the base paper design [1] and named it conventional design, and then we have optimized this conventional design in order to reduce the resource usage and dynamic power consumption and named it low power and low area design. As seen from table 6 that the logics used in optimized low power and low area design is least among the other designs available in literature. The percentage change is 29%. The dynamic power consumption of the base paper design is 43.53mW whereas the dynamic power consumption of proposed design is only 34mW, so the percentage change is 21.8%. So it can be concludes from the table the proposed low power and low area design has low resource usage and lower power consumption as compared to the other designs available in literature.

| On Chin          | Design        |          | Conventional  |          | Low Power &<br>Low Area |          |
|------------------|---------------|----------|---------------|----------|-------------------------|----------|
| On Chip          | Power<br>(mW) | Resource | Power<br>(mW) | Resource | Power<br>(mW)           | Resource |
| Clock            | 0.16          | -        | 02            | -        | 01                      | -        |
| Logic            | 0.76          | 496      | 02            | 511      | 01                      | 355      |
| Signal           | 6.63          | 752      | 04            | 764      | 04                      | 802      |
| IOS              | 35.98         | 197      | 37            | 196      | 29                      | 196      |
| Static<br>Power  | 45.32         | -        | 45            | -        | 45                      | -        |
| Dynamic<br>Power | 43.53         | -        | 44            | -        | 34                      | -        |
| Total            | 88.85         | -        | 90            | -        | 79                      | -        |

Table 6: Performance analysis – 8 Operation designs

| Table 7: Performance anal | ysis – 15 Operation designs |
|---------------------------|-----------------------------|
| ruore /. remonnance ana   | gois is operation acoigns   |

|               | Conventional  |          | Low Power & Low<br>Area |          |  |
|---------------|---------------|----------|-------------------------|----------|--|
| On Chip       | Power<br>(mW) | Resource | Douvor                  | Resource |  |
| Clock         | 06            | -        | 0                       | -        |  |
| Logic         | 75            | 15253    | 74                      | 14909    |  |
| Signal        | 102           | 17618    | 95                      | 17502    |  |
| IOS           | 44            | 197      | 25                      | 197      |  |
| Static Power  | 46            | -        | 46                      | -        |  |
| Dynamic Power | 228           | -        | 194                     | _        |  |
| Total         | 273           | -        | 239                     | -        |  |

Table 7 shows the resource and power consumption of conventional ALU and area and power efficient ALU. The base paper design and the conventional design are same, we have implemented the base paper design and named it conventional design, and then we have optimized this conventional design in order to reduce the resource usage and dynamic power consumption and named it low power and low area design. As seen from table 7 that the logics used in optimized low power and low area design is least among the other designs available in literature. The percentage change is 2.2%. The dynamic power consumption of the conventional design is 228mW whereas the dynamic power consumption

of proposed design is only 194mW, so the percentage change is 14.9%. So it can be concludes from the table the proposed low power and low area design has low resource usage and lower power consumption as compared to the other designs available in literature.

### VII. CONCLUSION

In this work we have successfully implemented an arithmetic and logic unit with clock gating and hardware sharing method. In this work four designs are implemented and compared, first design is the conventional ALU supporting 8 operations with no clock gating and hardware sharing method being used. In order to reduce dynamic power consumption of the design we have used clock gating method, in this method the clock is inactive for the units which are not in use for the currently selected operation and clock is active only for the unit which is currently being used, this reduces the switching activity inside the ALU and hence in-turn reduces the dynamic power consumption. The percentage decrease in dynamic power consumption is around 21% for 8 instruction design. Another level of optimization we made in this design is the inclusion of hardware sharing method, it is known that addition, subtraction, increment and subtraction can be implemented using a adder and 2's complement unit and hence in this work we have used this one unit called adder + 2'complement unit to implement all the four operations namely: addition, subtraction, increment and decrement. This reduces the hardware resource usage of the optimized design and makes the design more cost effective. The third design is conventional design supporting 15 instructions, in this design we have increased the functionality of the base paper design by employing 7 more instructions making the design to support a total of 15 operations. The new 7 operations employs rotate right, rotate left, shift right, shift left, BCD addition, BCD subtraction and BCD multiplication. This design does not use clock gating and hardware sharing method. The fourth design is proposed low power and low area design which employs the clock gating technique and hardware sharing method which in-turn reduces the dynamic power consumption and resource usage. The percentage decrease is 2.2% for resource usage and the percentage decrease is 14.9% for dynamic power consumption.

#### REFERENCES

- Shruti Murgai, "Energy Efficient And High Performance 64-bit Arithmetic Logic Unit Using 28nm Technology", IEEE 2015
  J. Shinde, and S. S. Salankar, "Clock gating-A
- [2]. J. Shinde, and S. S. Salankar, "Clock gating-A power optimizing technique for VLSI circuits" *Annual IEEE India Conference (INDICON)*, pp. 1-4, 2011.
- [3]. J. Castro, P. Parra, and A. J. Acosta, "Optimization of clock-gating structures for low-leakage highperformance applications", Proceedings of IEEE International Symposium on Efficient Embedded Computing, pp. 3220-3223, 2010.
- [4]. V. Khorasani, B. V. Vahdat, and M. Mortazavi, "Design and implementation of floating point ALU

on a FPGA processor", IEEE International Conference on Computing, Electronics and Electrical Technologies (ICCEET), pp.772-776, 2012.

- [5]. S. Cisneros, J. J. Panduro, J. Muro, and E. Boemo, "Rapid prototyping of a self-timed ALU with FPGAs", International Conference on Reconfigurable Computing and FPGAs, pp. 26-33, 2012.
- [6]. B. S. Ryu, J. S. Yi, K. Y. Lee and T. W. Cho, "A design of low power 16-bit ALU", Proceedings of the IEEE TENCON Conference, pp.868- 871, 1999.
- [7]. T. Lam, X. Yang, W. C. Tang and Y. L. Wu; , "On applying erroneous clock gating conditions to further cut down power," Design Automation Conference (ASP-DAC), 2011 16th Asia and South Pacific, vol., no., pp.509-514, 25-28 Jan. 2011.
- [8]. B. Pandey and M. Pattanaik, "Clock Gating Aware Low Power ALU Design and Implementation on FPGA", 2nd International Conference on Network and Computer Science (ICNCS), Singapore, April 1-2, 2013
- [9]. E. Arbel, C. Eisner and O. Rokhlenko, "Resurrecting infeasible clockgating functions," Design Automation Conference, 2009. DAC '09. 46<sup>th</sup> ACM/IEEE, vol., no., pp.160-165, 26-31 July 2009.
- [10]. Thomas D. Burd, "Energy-Efficient Processor System Design", Ph.D Thesis, University of California, Berkeley, 2001.
- [11]. Thomas D. Burd and Robert W. Brodersen, "Design Issues for Dynamic Voltage Scaling", ISLPED 2000, Rapallo, Italy.
- [12]. Pouwelse, J., Langendoen, K., and Sips, H., "Energy priority scheduling for variable voltage processors", ISLPED 2001, Huntington Beach, CA, USA.
- [13]. C. Lee, J. Lee, T. Hwang, and S. Tsai., "Compiler Optimization on Instruction Scheduling for Low Power", 13th International Symposium on System Synthesis, ACM, September 2000.
- [14]. Parik A, Kandemir M, Vijaykrishnan N and Irwin M.J, "Instruction Scheduling Base on Energy and Performance Constraints", Proceedings IEEE Computer Society Workshop VLSI, 27-28 April 2000.
- [15]. S. Cisneros, J. J. Panduro, J. Muro, and E. Boemo, "Rapid prototyping of a self-timed ALU with FPGAs," in *Proc. International Conference on Reconfigurable Computing and FPGAs*, pp. 26-33, 2012.
- [16]. B. S. Ryu, J. S. Yi, K. Y. Lee, and T. W. Cho, "A design of low power 16-bit ALU," in *Proceedings of* the IEEE TENCON Conference, pp.868-871, 1999.
- [17]. J. Monteiro, J. Rinderknecht, S. Devadas and A. Ghosh, "Optimization of combinational and sequential logic circuits for low power using precomputation," Advanced Research in VLSI, 1995. Proceedings., Sixteenth Conference on , vol., no., pp.430-444, 27-29 Mar 1995.

- [18]. Frank Emnett, Mark Biegel, Power Reduction Through RTL Clock Gating, SNUG San Jose, 2000.
- [19]. Gary K. Yeap, Practical Low-Power Digital VLSI Design, Power, EE Times India, January 2008.
- [20]. John F. Wakerly, Digital Design Principles and Practices, Prentice Hall, 2005.
- [21]. Hubert Kaeslin, ETH Zurich, Digital Integrated Circuit Design from VLSI Architectures to CMOS Fabrication, Cambridge University Press, 2008.
- [22]. P.J. Shoenmakers, J.F.M. Theeuwen, Clock Gating on RT- Level VHDL, Proc. of the int. Workshop on logic synthesis, Tahoe City, CA, pp. 387-391, June 7-10,1998.
- [23]. L. Benini, G. De Micheli, E. Macii, M. Poncino, and R. Scarsi, Symbolic Synthesis of Clock-Gating Logic for Power Optimization of Synchronous Controllers, ACM Trans. Des. Autom. Electron, Oct. 1999.
- [24]. Safeen Huda, Muntasir Mallick, Jason H. Anderson, Clock Gating Architectures For FPGA Power Reduction, FPL 2009.
- [25]. Vojin G. Oklobdzjja, Vladlmlr M. Stojanovic, Dejan M. Markovic, Nikola M. Nedovic, DIGITA L SYSTEM CLOCKING High-Performance and Low-Power Aspects, Wiley Interscience, U.S., 2003.
- [26]. Vishwanadh Tirumalashetty, Hamid Mahmoodi, Clock Gating and Negative Edge Triggering for Energy Recovery Clock, ISCAS 2007, New Orleans, LA, pp. 1141-1144, 2007.
- [27]. Bishwajeet Pandey, Jyotsana Yadav, M Pattanaik, Nitish Rajoria "Clock Gating Based Energy Efficient ALU Design and Implementation on FPGA" 2014 IEEE.