Deep learning developed rapidly during the past decade, thanks to the introduction of open-source neural networks and the availability of large annotated datasets. Both were made possible by the efforts of academics and large technology platforms. The concept of a perceptron neural network was first described in the 1950s, although it was not until recently that the adequate training data, neural-network frameworks, and requisite processing power came together to help launch the artificial intelligence (AI) revolution.
Developers make many decisions when designing cameras with onboard intelligence at the edge. The most impactful is the selection of the vision processor. Until recently, this choice was basically limited to NVIDIA, due to their superior graphic processing technology that was leveraged for the highly parallel computational demands of neural networks. Most of the open-source intellectual property (IP) for network training and runtime deployment was developed on and for the NVIDIA ecosystem. While NVIDIA continues to be a powerful platform, cost, size, and power consumption can potentially limit the viability of compact smart cameras.
Leading vision processor suppliers including Qualcomm, Intel, Ambarella, Xilinx, Altera, and MediaTek have developed chip architectures in recent years featuring neural network cores or computational fabrics designed to process neural network computational loads at significantly lower power and cost.
A growing number of suppliers offer powerful vision processors, but it is often not feasible for smaller developers to source cutting-edge processors directly from manufacturers who typically direct smaller- volume customers to partner firms offering system-on-module (SOM) solutions and technical support. While this is a good option for some developers, it is often advantageous to work directly with the vision processor supplier because the integration of complex multi-threaded runtime routines requires close support from the vision processor suppliers.
There are several important considerations when evaluating AI compute platforms. The first is the effective number of arithmetic logic units (ALUs) available to perform AI workloads. It is common to use neural accelerators, GPUs, DSPs, and CPUs for portions of the workload. When making a hardware selection, it is important to understand the strengths and weaknesses of each one, and to budget resources accordingly. The ability to run multiple software routines simultaneously is critical for automatic target recognition software. If there is not adequate processing power, the software must fit the various routines into the time slice available, which results in dropped frames. The second consideration is the type and amount of memory the processor can access. Sufficient fast memory is important to achieve inference at high frame rates while running software routines, such as warp perspective, optical flow, object tracking, and the object detectors.
Teledyne FLIR selected processors with at least 8 GB of integrated LPDDR5 memory to be designed into several intelligent cameras, including the new Triton security camera. The Teledyne FLIR AI stack software requires between 3 W and 10 W when running on Ambarella CV-2 or Qualcomm RB5165. Power consumption is managed by selecting networks and proposal routines to fit the power budget specified by the integrator. Object-detection performance is impacted by these configurations, but performance gains continue to be made with more efficient neural networks and new node-generation vision-processor hardware.
Convolutional neural network considerations, performance
While the number of processor choices for running models at the edge is increasing, model training typically occurs on NVIDIA hardware due to the very mature deep learning development environment built on and for NVIDIA GPUs. Neural network training is very computationally demanding and when training a model from scratch, a developer can expect training times of up to five days on a high-end multi-GPU machine. Network training can be done on popular cloud service platforms. However, compute costs on these platforms are expensive and the long data upload and download time is also a consideration.
The second decision a developer must make is the type of neural network architecture. Within the context of computer vision, a neural network is typically defined by its input resolution, operation types, and configuration/number of layers. These factors all translate to the number of trainable parameters that influence computational demand. Computational demands translate directly to power consumption and the thermal loads that must be accounted for during the design of products.
The trade space dictates tradeoffs between object-detection accuracy and high frame rates for a given vision processor’s computational bandwidth. Video camera users typically demand fast and accurate object detection that enable both human and automatic response by motion-control systems or alarms. A good example is automatic emergency braking (AEB) for passenger vehicles, in which a vision-based system can detect a pedestrian or other objects within milliseconds and initiate braking to stop the vehicle. Another example is a counter-drone system that must track objects of interest and provide feedback to motion-control systems to direct countermeasures to disable drones.
Mean average precision (mAP) is the most common scoring metric in object detection. Intersection over union (IoU) is used to determine if an object detection is a “match” or a “miss” (see Fig. 1). High matches, low misses, and low false positives correlate with higher mAP scores.Along with understanding the performance of models, data scientists need to analyze the cause of false positives and false negatives, or “misses.” Teledyne FLIR developed Conservator, which is subscription-based dataset management cloud software that includes a local tool capable of visualizing model performance. It can be used to interactively explore and identify areas where the model performs poorly, enabling data scientists to investigate the specific training dataset images that cause the missed detections. Developers can quickly modify or augment the training data, retrain, retest, and iterate until the model converges on the performance required. Figure 2 includes an example output illustrating how the model performs at each point within the video sequence. Bounding box area from ground truth, object matches, object misses, and false positives are plotted side by side to help pinpoint areas of concern.
It is instructive to understand the number of calculations neural networks are performing. For video applications such as automotive safety systems, computations are performed on every video frame. This makes it critical to get rapid object detections to eliminate response delay. For other applications, including counter-unmanned aircraft systems (C-UAS) or counter-drone systems, quick detection and object location metadata is a critical input into a video tracker that controls the camera and countermeasure pointing actuators. These estimations do not account for how well the architecture utilizes the specific hardware, so it is important to note that the most reliable way to benchmark a model is to run it on the actual device.At the end of a training process, the model typically needs to be converted to run on the target vision processor’s specific execution fabric. The translation and fit process is extremely complex and requires a skilled software engineer. This has been a significant point of friction in faster deployment of AI within cameras. In response, an industry consortium established ONNX AI, an open-source project that established a model file format standard and tools to facilitate runtime on a wide range of processor targets (see Fig. 3). As ONNX becomes fully supported by the vision processor suppliers and the developer community, the efforts required to deploy models on different hardware will significantly reduce a pain point for developers.
The Teledyne FLIR AI stack uses a combination of computer vision technologies to manage computation demands while maximizing confidence in the target’s state. The framework houses a wide range of application-optimized networks and configurations that seamlessly combine routines, including motion detection, stochastic search, fine-grain classification, multi-object tracking, and sensor fusion with external inputs such as a radar. This allows any combination of networks and routines to be selected at runtime using application-specific configurations.
Current models offer pixel inputs as large as 1024 × 768, which reduces the need to decimate an image to fit into a model when running inference on megapixel cameras. This can translate directly to long-range object detection performance by maximizing the number of pixels on target. It also ensures maximum pixel input into fine-grain classifiers that can output object features such as a specific vehicle model or friend-or-foe detection.
AI stack developmentFigure 4 includes examples of the frameworks, datasets, libraries, neural networks, and hardware that make up the typical AI stack. Teledyne FLIR develops and manufactures longwave-infrared (LWIR), midwave-infrared (MWIR), and visible light cameras well suited for using with AI. Given the requirements and lack of mature tools associated with multispectral sensing, unique software, datasets, and more have been developed to support AI using Teledyne FLIR sensors.
Teledyne FLIR uses the PyTorch framework, which is tightly integrated with Python, one of the most popular languages for data science and machine learning. PyTorch supports dynamic computational graphs allowing the network behavior to be changed programmatically at runtime. In addition, the data parallelism feature allows PyTorch to distribute computational work among multiple GPUs as well as multiple machines to decrease training time and improve accuracy.
Datasets for object detection are large collections of images annotated and curated for class balance and characteristics such as contrast, focus, and perspective. It is industry best practice to manage a dataset such as software source code and to use revision control to track changes. This ensures that machine learning models maintain consistent and reproducible performance. If a developer encounters performance issues with a model, data scientists can quickly identify where to augment the dataset to create a continuous improvement lifecycle. Once a verified improvement has been made, the data change is recorded with a commit entry that can then be reviewed and audited. For example, Teledyne FLIR Conservator dataset-management software facilitates revision control on terabyte-scale data lakes along with strong data protection and access features to enable distributed teams.
While open-source datasets such as COCO are available, they are visible-light image collections containing common objects at close range captured from a ground-level perspective. Teledyne FLIR focuses on applications requiring multispectral images taken from air to ground, ground to air, and across water, and of unique objects like military objects. To facilitate the evaluation of thermal imaging by automotive safety and autonomy systems developers, Teledyne FLIR created an open-source dataset featuring over 26,000 matched thermal and visible frames.
In the real world, objects are viewed in near-infinite combinations of distance, perspective, background environments, and weather conditions. The accuracy of machine learning models largely depends on how well training data represents field conditions. A tool developed by Teledyne FLIR analyzes a dataset’s imagery and quantifies the data distribution based on object label (percent of images of person, car, bicycle, etc.), object size, contrast, sharpness, and brightness. The tool is then able to correlate model performance and data characteristics to produce a PDF-based datasheet for each new model release.
The time and expense needed to build large training datasets is significant and requires field data collection, curation of frames, annotation, and quality control over label accuracy. This is a bottleneck in deploying AI. Training data is a critical component of the AI stack; the field of synthetic data is now an option. Teledyne FLIR, for example, works closely with synthetic data technology company CVEDIA to develop the tools and IP necessary to create multispectral data and models using computer-generated imagery (CGI). This powerful tool enables the creation of multispectral imagery of almost any object from any perspective and distance. The result is the ability to create datasets of unique objects such as foreign military vehicles (see Fig. 4) that would be extremely challenging to do if relying on a field data collection.
AI at the edge in production
There is a convergence of development and technology enabling a clearer path to deploy affordable and functional AI at the edge. Lower-cost hardware is being released with improved processing performance that can be used with more efficient neural networks. Software tools and standards to simplify model creation and deployment are promising and ensure developers can add AI to their cameras with a lower investment. The open-source community and model standards from the ONNX community are also contributing to the acceleration of AI at the edge.
As integrators demand AI at the edge in industrial, automotive, defense, marine, security, and other markets, it is important to recognize the engineering effort required to move a proof-of-concept demonstration of AI at the edge to production. Developing training datasets, addressing performance gaps, updating training data and models, and integrating new processors requires a team with diverse skills. Imaging systems developers will need to carefully consider the investment required to build this capability internally or when selecting suppliers to support their AI stack.
1. See https://bit.ly/337RRV1.
2. See https://bit.ly/33tGvtY.
3. See https://bit.ly/3nhpvyg.
4. See https://bit.ly/3nksOot.