From presence detection for smart doorbells and security cameras in home control, to object counting for inventory in retail applications, to object and presence detection in industrial applications, a growing number of edge applications are driving new AI solutions to market . According to IHS Markit (now Omida), the number of IoT devices will reach 40 billion in 2018-2025, and by 2022, nearly 50% of the data generated by all enterprises will be processed in places other than traditional data centers or the cloud.
But at the same time, the market requires designers to develop solutions with higher performance than ever before; on the other hand, latency, bandwidth, privacy, power and cost issues limit their reliance on the computing resources of the cloud to perform analysis. How to solve the system’s increasingly stringent power consumption (milliwatt level) and small size (5mm2 to 100mm2) requirements? How can I quickly get the appropriate hardware and software tools, reference designs, demo examples and design services? Lattice has made a useful attempt to do so.
Lattice sensAI gets another major update
As the industry’s first complete set of solutions for device-side AI processing at the network edge from Lattice, sensAITM provides all the resources developers need to evaluate, develop and deploy FPGA-based machine learning/AI solutions , including modular hardware platforms, demo examples, reference designs, neural network IP cores, software development tools and custom design services.
Figure 1: Block diagram of the sensAI architecture
In the first half of 2019, sensAI ushered in a 10x performance improvement through updates, which was facilitated by multiple optimizations, including through updates to CNN IP and neural network compilers, new 8-bit activation quantization, intelligent layer merging, and dual DSP engines, etc. characteristic. What is most exciting is that it adds and optimizes reference designs for quickly implementing common AI applications at the network edge, adding more powerful features for keyword detection, face recognition, people detection, people counting, and more.
Figure 2: Supporting 8-bit quantization during training enables higher accuracy during neural network model training
To demonstrate the capabilities of the keyword detection system, engineers used a HiMax HM01B0 UPduino shield development board with an iCE40 UltraPlus FPGA. The board has two I2S microphones directly connected to the FPGA, external flash memory for FPGA design, weight activation memory, and LED indicators to indicate if a keyword is detected. Users can speak directly into the microphone, and once a keyword is detected, the LED will light up.
Figure 3: Keyword Detection Demonstration System
Left in Figure 4 is a demonstration of people detection using a CMOS image sensor optimized for low-power operation, delivering 64 x 64 x 3 resolution over a VGG8 network, the system running at 5 frames per second using the iCE40 UltraPlus The FPGA consumes only 7mW; on the right is a demonstration of a performance-optimized people counting application, also using a CMOS image sensor, providing a resolution of 128 x 128 x 3 over the VGG8 network. The demo runs at 30 frames per second and consumes 850mW of power using the ECP5-85K FPGA.
Figure 4: These reference designs demonstrate the power and performance options offered by sensAI
The Lattice Person Recognition Reference Design is also used in the vending machine to detect the presence of a person and wake up the core of the vending machine. The purpose of reducing power consumption is achieved by reducing false triggering caused by people not approaching, or false triggering caused by people passing by.
In May 2020, sensAI was successfully upgraded to version 3.0.
Based on the previous support for ECP5/ECP5-5G and iCE40 UltraPlus modular hardware platforms, the new sensAI version 3.0 supports CrossLink-NX™ series FPGAs, and CrossLink-NX FPGAs running sensAI software reduce power consumption by half compared to previous versions , while doubling the performance, bringing another breakthrough in power consumption and performance for intelligent vision applications in surveillance/security, robotics, automotive and computing fields. At the same time, it also has customized convolutional neural network CNN IP and added support for MobileNet v2, SSD and ResNet models. These flexible accelerator IPs can simplify the implementation of common CNN networks and can be optimized to take full advantage of the parallel processing capabilities of FPGAs , developers can easily compile and download trained neural network models into CrossLink-NX FPGAs.
Figure 6: sensAI supports multiple AI algorithm models
CrossLink-NX FPGA is manufactured using 28nm FD-SOI process, which can reduce power consumption by 75% compared with similar FPGA competitors. When running the solution on a CrossLink-NX FPGA, sensAI provides up to 2.5Mb of distributed memory, RAM blocks, and additional DSP resources, and MIPI I/O provides instant-on performance that completes itself in less than 3ms configuration, and the configuration of the entire device takes only 8ms. In the CrossLink-NX-based object counting demo, the -VGG-based object counting demo has a performance of 10 frames per second and consumes only 200mW.
When AI meets ultra-low power FPGA
The iCE40 UltraPlus FPGA with 5K LUTs enables neural network pattern matching required for intelligent applications that are live in real-time at the network edge. It has 5280 4-input LUTs, custom I/O, embedded memory up to 80Kb and 1Mb, sleep current as low as 75uA, operating current as low as 1-10mA, power consumption as low as 1mW, and hardware platform size as small as 5.5 mm2. To meet the needs of a wide variety of applications, packages ranging from the ultra-small 2.15 mm x 2.50 mm x 0.45 mm WLCSP package optimized for consumer electronics and IoT devices to the 0.5mm pitch 7x7mm QFN package for low-cost applications are available. Various packaging options.
The reason why power optimization is far ahead is due to its Distributed Heterogenous Processing (DHP) architecture. By not using the cloud to execute algorithms, but using the built-in digital signal processor (DSP) to perform repetitive digital processing tasks, the computational load on the power-hungry application processor (AP) is greatly reduced, enabling longer time sleep mode to extend battery life. On the other hand, the built-in neural network soft IP and compiler enable flexible machine learning/artificial intelligence applications, eliminating the delay brought by cloud-based intelligent applications and reducing the cost of the entire system solution.
Figure 7: Distributed Heterogeneous Processing (DHP) architecture adopted by iCE40 UltraPlus
Figures 8 and 9 describe how resource differences between different FPGAs affect the performance and power consumption of face detection and person detection applications. In the 32×32 input example on the left side of Figure 8, the orange parts represent the cycles that run on the convolutional layers. Of the four examples, UltraPlus has the least number of multipliers, and the other three ECP5 FPGAs have increasing numbers of multipliers. As the number of multipliers increases, the number of cycles required for the convolutional layer decreases; the 90×90 input example on the right has a large blue area at the bottom of each bar. This is because the design is more complex and requires external DRAM, which compromises performance.
Figure 8: Performance, power, and footprint for entry-level and advanced face detection on UltraPlus and ECP5 FPGAs
The situation of people detection application is similar, the two groups adopt the case of 64×64 input and 128×128 input respectively. Likewise, more multipliers reduce the burden on the convolutional layers, while relying on DRAM can impact performance.
Figure 9: Performance, Power and Footprint for Simple and Complex Person Detection on UltraPlus and ECP5 FPGAs
In fact, the most common way to design an AI model is to use a processor, which may be a GPU or DSP, or a microcontroller (MCU). However, low-end MCUs may not be able to handle even simple AI models, and high-performance processors may violate the power and cost requirements of the device, but this is where low-power FPGAs come into play. Instead of augmenting the processor to handle algorithms, the Lattice iCE40 UltraPlus FPGA can act as a co-processor to the MCU, handling complex tasks that the MCU cannot handle while keeping power consumption within the required range.
Another idea is to use the low-power FPGA as a separate and complete AI engine, at which time the DSP in the FPGA plays a key role. Even if network edge devices have no other computing resources, AI capabilities can be added without exceeding power, cost, or board size budgets, not to mention the flexibility and scalability required to support rapidly evolving algorithms.
Either approach means designers can use Lattice sensAI and a low-power iCE40 UltraPlus FPGA to preprocess sensor data, minimizing the cost of transferring data to the SoC or cloud for analysis. For example, if it is used in a smart doorbell, sensAI will initially read the data from the image sensor. If it is judged that it is not a human, such as a cat, then the system will not wake up the SoC or connect to the cloud for further processing. Therefore, this approach minimizes data transfer costs and power consumption. If the preprocessing system determines that the object at the door is a person, it will wake up the SoC for further processing. This greatly reduces the amount of data that the system needs to process, while reducing power consumption requirements, which are critical for real-time network edge applications.
Figure 10: sensAI based on iCE40 UltraPlus FPGA preprocesses sensor data to determine if the data needs to be sent to the SoC for further processing
Lattice’s FPGAs are uniquely positioned to meet the rapidly changing market demands of network edge devices. One of the ways that designers can quickly provide more computing resources to devices at the edge of the network without relying on the cloud is to use the native parallel processing capabilities in FPGAs to accelerate neural network performance. Additionally, by using low-density, small-footprint FPGAs optimized for low-power operation, designers can meet the stringent power and size constraints of new consumer and industrial applications.