| By Lennart Yseboodt, Michael De Nil, Jos Huisken, Mladen Berekovic, Qin Zhao, Frank Bouwens, and Jef Van Meerbergen Many emerging applications require extremely low-power DSPs. This article shows how to design such a DSP, using an electrocardiogram application as an example. We show how to achieve low power by tuning the algorithm, processor architecture, and memory system, as well as through clock gating. Throughout the article we present detailed power results to demonstrate the impact of each optimization.
Introduction A new generation of biomedical monitoring devices is emerging. These applications are typically powered by a tiny battery or an energy scavenger, and have extremely low power budgets. Typical power budgets are around 100 ìW for the whole system, including radio processing, data processing and memories.
To reduce power dissipation of the radio transmitter, system designers often employ feature extraction and/or data compression to reduce the number of bits transmitted. This shifts the power bottleneck from the radio to the data processor, which is the focus of our article. The goal of our work is to create a C-programmable, application-specific DSP optimized for low power. We use a reconfigurable processor from Philips' technology incubator Silicon Hive [4] as starting point.
The Technology The processor used in our work is programmed in C using a retargetable compiler. Programmability, as opposed to fixed-function hardware, is important because the digital subsystem must be able to run different algorithms, such as switching between ECG or EEG analysis. The system may also need to run new algorithms from the biomedical domain. Programmable platforms also require less development effort and results in more portable code, i.e., code that can be recompiled for other hardware platforms with minimal effort.
In this article we differentiate between dynamic and static power consumption. Dynamic power is the power consumed by switching nodes; the power used inside cells due to short-circuits; and all power consumed by internal nets. This includes the functional units, memories, controller and clock.
Static power is the leakage power lost whether the circuit is active or is idle. Current CMOS technology trends indicate that leakage is becoming more dominant with every new process generation. In our experiments, leakage power was a critical factor—we measured up to 100 ìW of leakage.
Our work has focused on reducing both static and dynamic power by minimizing the time the processor is active. As a case study we examine an ECG algorithm running on the proposed platform. From this example we have been able to make more general system level conclusions.
System level architecture A generic sensor node consists of several subsystems, as depicted in Figure 1. This node consists of:
- A digital processing subsystem with level 1 local memory (L1)
- A level 2 (L2) memory subsystem (including RAM and non-volatile memories)
- An array of sensors and possibly actuators
- A radio system
- A power subsystem including a source and power manager, which is responsible for waking up various parts of the node when needed
This conceptual model can be applied to a multi-die implementation, leaving open several packaging technologies. If L2 memories are kept off-die, for example, then the size of L2 memory can be varied without creating a new sensor chip.
 Figure 1. Wireless sensor node.
In current systems the power is supplied by a small battery or from energy scavengers. Battery powered nodes have the disadvantage of requiring maintenance. Different forms of energy scavenging are possible, but in this article we assume a power budget of around 100 ìW [5]. This number includes power consumed by the radio and sensors—in other words, the power budget of the entire sensor node.
From a power point of view, the biggest consumers are the radio, the memory, and the digital subsystem. Commercially available radios consume 150 nJ/bit [7], and as a consequence the transmission of raw data can be expensive. An algorithm that reduces the amount of data via compression or feature extraction usually is a good compromise. In addition to the reducing the radio power consumption, most subsystems exploit duty cycling and sleep modes to reduce dissipation.
Duty cycling and sleeps modes are most effective when the DSP spends a minimal number of cycles implementing our algorithm, so this is an obvious area for optimization. We must also minimize the power consumed by the memory subsystem. Specifically, we needed a hierarchical memory subsystem which reduces the size of the lowest level memories, as fetching data from these memories consumes the most power. Application One of the basic features extracted from an ECG signal is the ventricular contraction—when the heart pumps blood to the lungs and the body. In an ECG we call this event the R peak, situated in the QRS complex (Figure 2).
 Figure 2. QRS complex.
In our test case, we detect the R peak using code based on the open-source ECG detection program from EP Limited [3]. The algorithm uses the Pan-Tomkins [1] method for R peak detection. The Pan-Tomkins method uses filtering to detect the frequency that is unique to the steep R peak.
If we were to detect the R peak offline by transmitting the raw ECG data, the power dissipation of the radio would be 480 ìW, well above our power budget. This assumes a minimum frequency for ECG analysis of 200Hz, a 16 bit sample width, and 150 nJ/bit dissipation. The Pan-Tomkins method reduces this to 4B/s or 4.8 ìW, a 100x reduction. The 4 bytes transmitted hold all the information extracted by this algorithm: the time between R peaks, the height of the R peak and the baseline drift.
DSP Optimization After removing the power bottleneck in the radio, the problem shifts to the DSP. We have chosen an ASIP (Application Specific Instruction set Processor [12]) approach that allows us to tune the core to the application domain. First we describe the reference core; later we will reveal the power optimizations.
Reference core Because flexibility is in important aspect of the design, we chose an easy-to-modify PearlRay processor from Silicon Hive [6]. The processor is configurable, i.e., it is generated from a parameterized description. The same description is also used to generate a C-compiler for the processor.
The top level configuration file controls certain aspects of the processor such as data widths, functional unit placement, and configuration of the issue slots. We generated a standard configuration with 32kB of data memory and 32kB of program memory. The processor is a VLIW architecture with three issue slots, 128 bit wide instructions, and is synthesized for a speed of 100 MHz. We found this speed to be the 'sweet spot' for this design. Synthesizing the core for several clock frequencies has shown us that speeds above 100 MHz make the design grow exponentially in area and leakage as depicted in Figure 3.
 Figure 3. Clock frequency vs. area and leakage.
To optimize the filters in the C code, several expensive divisions were replaced by shifts and multiplies. (The PearlRay does not have a hardware divider and relies on a software divider taking 25 cycles per division.) After these optimizations, the cost of analyzing one sample of ECG data at a 200 Hz sampling frequency was 250 cycles, assuming no beat is detected. When a beat is detected this number rises to 1200 cycles. A beat typically occurs once or twice every second, so on average the algorithm takes 198 * 250 + 2 * 1200 = 51900 cycles per second. If the PearlRay is running at 100 MHz, the resulting duty cycle is 51900/100 * 10^6 = 0.05%.
Using this data, we analyzed power dissipation of the PearlRay. We obtained power figures for the processor using Synopsys PrimePower with layout extracted capacitances. The input was a vector file from a netlist simulation, which was generated using Cadence Ncsim. Simulations were based on the processor netlist after layout on a 90nm CMOS process.
Three power modes are identified: active, idle and sleep. In active mode, the processor is running a program and processing samples. In idle mode, no program is executing but the clock is still running. In sleep mode the clock is off and the only dissipation is due to leakage.
Figure 4 shows a graphical breakdown of the dissipation. In this diagram the x axis represents the elapsed time and the y axis represents power consumption. The area of the bars represents the energy consumed. The lightest bars represent active energy, which can vary dependent on the input sample. We also observed this behavior in our ECG software. The middle bar represents idle energy and the darkest block is the ever-present leakage energy. As shown in Figure 4, the ECG application is an example of an algorithm that does not use very much of the DSP's processing power.
 Fig. 4: Causes of power consumption.

Table 1 shows the power characteristics of the standard version of the PearlRay, which is used as a reference. At first glance the active power is dominant, but the processor is only "active" for a small fraction of the time. The actual energy usage attributed to active mode constitutes only 0.4% of the total energy consumption. The power used in idle mode is the dominant factor due to the length of time spent idling.
 Table 1. Standard version of the PearlRay used as a reference. The last column shows the energy for one input sample and one ECG computation.
To lower idle energy we used coarse-grained clock gating. The PearlRay reference core already used fine-grained, low-level clock gates, but the top-level clock gate was not implemented. The top level-clock gate disconnects the clock from the entire clock tree. When this gate is open, no switching occurs in the processor. As a consequence, an external piece of circuitry must revive the processor when it needs to begin processing.
Such a clock gate was very important, as shown by the results in Table 2. After this optimization the dominant energy component is leakage, at 96% of the total.
 Table 2. Results of top-level clock gating.
Reducing leakage Now we are faced with dominant leakage power. Analyzing dominant leakage power, our total leakage is 100 ìW, of which 50 ìW is caused by the data memory, 40ì W by the program memory, and 10 ìW by the processor itself, as shown in Table 3. Clearly, the large majority of our leakage comes from the memories.
 Table 3. PrimePower output results for reference PearlRay while active. The coreio contains the data memory.
To reduce leakage we attempted four things.
- Reducing the data memory size to 2kB. This was possible since the ECG program requires only 1.2 kB of program memory and 120 bytes of stack. This change reduced the leakage to 65.6 ìW, a 34.5% improvement.
- Removing one of the three issue slots in the PearlRay processor and reducing the size of the immediates. This allowed the instruction width to be reduced form 128 bits to 64 bits. This increased the instruction count 27% due to decreased parallelism in each instruction, but this was more than offset by the 50% reduction in instruction width. We were able to reduce the program memory from 32 kB to 16 kB, resulted in a reduced leakage power of 82 ìW, a 18% improvement.
- Using memory modules designed in a technology with a high threshold option (High Vt). This drastically reduced the leakage of the memories. The maximum clock rate of the memories did decrease, but were still able to operate at 100 MHz. Using these memories reduced leakage to 16.2 ìW, an 84% improvement.
- Reducing the datapath width from 32 bits to 16 bits. Since the samples are only 16 bits wide, it was optimal to scale the core to this width. This gave a moderate improvement in leakage to 94.7 ìW, or 5.3%.
When combining these techniques together with floorplan optimizations, the results shown in Table 4 were obtained. As can be seen, leakage was reduced 94.5% to 5.45ìW. Furthermore, scaling down the datapath to 16-bit contributed to a reduced active mode power dissipation.
 Table 4. Combination of anti-leakage techniques.System level optimization In this section we describe system level optimizations that are currently in progress. We are currently experimenting with power gating and L2 memories that can be used to save the state when the core is shut down.
Power down the core From Table 4 we can see that the leakage power is still dominant. To solve this, we can power down the core, save the system state in L2 memory, and restore it when the next batch of samples need to be processed. This is accomplished via a hierarchical memory subsystem with a small, frequently-accessed L1 memory and larger, rarely-accessed L2 memory. This architecture is similar to the memory hierarchy found in computer architectures, but optimized for power dissipation instead of performance. L2 memory (or part of it) can also used for other purposes. For example, it can be used to collect the samples that arrive while the core is powered down, or to store multiple applications that are not active simultaneously.
Now let's apply this to the ECG example. The state of the ECG application includes not only data (1.2 kB) but also the program (16 kB). This data that needs to be saved includes the state of the filters and several other variables such as the baseline drift.
We need to decide when to power down the processor. If we do so on a sample basis, this can become quite expensive. Assuming a L2 memory size of 32kB and a low power 90 nm process, the cost of an access is 0.875 pJ/B, and the leakage equals 2.5ìW. If the processor is powered down after every sample, the cost is 28.8ìW. Calculation of these numbers is given in Table 5.
 Table 5. Level 1 to level 2 state save calculation.
As shown in Table 5, we can reduce dissipation further by grouping the samples in groups of 50. This reduces the cost of saving and restoring data by a factor of 50, which translates into an acceptable cost of 3.0 ìW. This can be further improved to 0.5 ìW by using non-volatile memory (i.e., flash).
The swapping between L2 and L1 memories can be done for not only complete applications, but also for parts of an application. The Pan-Tomkins algorithm for ECG is a good example. As mentioned above, it consists of two parts: filtering and feature extraction. Both parts have similar code size. The filtering is executed for every sample, but the feature extraction is executed only when a beat is detected. Therefore it is possible to reduce the L1 code memory by a factor of two, which reduces its power consumption. However, the programmer or the compiler must be aware of this swapping scheme and must insert statements for code swapping.
Results Table 6 shows a system level overview of power consumption components. A 90 nm process is assumed. The first row shows the baseline ECG case with 1 channel as discussed above. The second row shows a 3-channel ECG. The next row is again a 1 channel ECG, but uses a more complex algorithm that analyzes parameters such as Q&S peaks and average beat rate. The fourth row also uses the complex algorithm, but with 3 channels. The last two rows show FFT analysis on 1 and 10 channels, respectively.
The columns represent the different contributors to the power dissipation. The first column shows the radio power assuming 150 nJ/bit. Columns two and three show the processor power when it is active, and its leakage when idle. The next column shows the dissipation due to saving and restoring state using a 32 KB L2 SRAM memory. The last two columns show the total dissipation for two different scenarios. The last column assumes the processor is in power down mode and that the L2 memory is used. The previous column assumes the opposite.
 Table 6. Power consumption with different assumptions, all numbers represent micro watts (ìW).
We conclude that for various use scenarios, different components can have the largest impact on power consumption. Therefore power consumption is difficult to predict, and a careful analysis is needed for each situation. Conclusion Power dissipation is the most important constraint for wireless sensor nodes applied to healthcare applications. This article describes the development of an architecture using a single channel ECG application as an example. It shows that a 100 ìW solution is feasible.
For minimum power dissipation, there is an optimum balance between computation and communication. Transmitting raw data is usually not optimal. A significant reduction in the amount of transmitted bits is obtained via compression or feature extraction. As a consequence the bottleneck shifts towards the DSP.
Static as well as dynamic dissipation must be tackled. Both are reduced by tuning the core to the target domain through use of application specific instructions, proper memory sizes, etc. In an optimized architecture, the L1 memories have a limited size due to the high number of accesses in active mode. When the processor is inactive, it can be powered down while the state is saved in L2 memory. However, the granularity of this power-down scheme must be chosen carefully.
In analyzing different ECG applications, we have shown that optimizing the digital processing technology is important. This is why we have chosen to focus on digital processing to reduce power consumption. Using ECG as a test case, and adopting a bottleneck-driven step-by-step approach, a factor of 100 reduction of power dissipation of the DSP core was measured via simulations. This is a result of the following actions:
- Algorithm level optimization and simplification of the code.
- Architecture level optimizations, e.g., reducing L1 memory size by a factor of two for instructions and a factor of 16 for data
- Gate level optimizations, e.g., clock gating
- Technology with high Vt
References 1. Rangaraj Rangayyan: Biomedical Signal Analysis. USA: Wiley 2002. ISBN 0-471-20811-6.
2. J. Pan and W.J. Tompkins: A Real-Time QRS Detection Algorithm. IEEE Transactions Biomedical Engineering 1985, BME-32(3): 230-236
3. EP Limited http://www.eplimited.com
4. Silicon Hive http://www.siliconhive.com
5. Bert Gyselinckx. Human++: emerging technology for body area networks.
6. T.R. Halfhill. Silicon Hive Breaks Out. Dec. 1st 2003, Microprocessor Report, www.MPRonline.com
7. True System-on-Chip with Low Power RF Transceiver and 8051 MCU, TI Datasheet CC1110, SWRS033A
8. Low power DSP, TI MSP430F149, www.ti.com
9. Coolflux DSP, www.coolfluxdsp.com
10. Virantha N.Ekanayake, Clinton Kelly, IV and Rajit Manohar. BitSNAP: Dynamic Significance Compression For a Low-Energy Sensor Network Asynchronous Processor, Proc. ASYNC, pp.144-154, Mar.2005.
11. Brett A.Warneke and Kristofer S.J.Pister. An Ultra-Low Energy Microcontroller for SmartDust Wireless Sensor Networks, Proc.ISSCC, Feb.2004
12. H. Meyr, System-on-chip for communications: The dawn of ASIPs and the dusk of ASICs, Proc. IEEE Workshop on Signal Processing Systems (SIPS'03), Seoul, Korea, Aug. 2003
|