Machine Vision and Applications (2009) 20:379–394 DOI 10.1007/s00138-008-0133-3
ORIGINAL PAPER
A hybrid system for embedded machine vision using FPGAs and neural networks Miguel S. Prieto · Alastair R. Allen
Received: 26 February 2007 / Accepted: 25 February 2008 / Published online: 27 March 2008 © Springer-Verlag 2008
Abstract This paper presents a hybrid model for embedded machine vision combining programmable hardware for the image processing tasks and a digital hardware implementation of an artificial neural network for the pattern recognition and classification tasks. A number of possible architectural implementations are compared. A prototype development system of the hybrid model has been created, and hardware details and software tools are discussed. The applicability of the hybrid design is demonstrated with the development of a vision application: real-time detection and recognition of road signs. Keywords SOM
Embedded machine vision · FPGA · ANN ·
1 Introduction The types of vision application which will form the basis for mass implementation over the next few years are those requiring the embedding of intelligent pattern recognition systems into industrial processes, instrumentation and portable systems. These systems will make extreme demands on the hardware, requiring the integration of traditional image processing (IP) with artificial intelligence (AI) techniques, and be implemented in designs with small footprint and low power consumption. In these kind of designs, there is often the need for filtering and preprocessing of the images,
M. S. Prieto · A. R. Allen (B) School of Engineering and Physical Sciences, University of Aberdeen, Aberdeen AB24 3UE, Scotland, UK e-mail:
[email protected]
preparing them for high-level analysis which frequently involves the recognition or classification of certain features which may be present in the image. This paper investigates possible architectures of a hybrid system combining programmable hardware (for IP tasks) with a neural array (for classification tasks). Additionally, this paper also presents a hybrid prototype system (HPS) that facilitates the development and test of vision applications on the hybrid system. Designing a prototype of the hybrid system has two main purposes. The first purpose is to reduce the cost in time and effort of implementing an embedded vision application. Following the methodology described and using the suggested set of tools, the implementation of an application is greatly simplified. The second purpose of creating the HPS is that the specific requirements of an application can be better studied through testing its performance on the HPS. Once the tests have been analysed, the choice of architecture to be used in the final implementation of the hybrid system, for that particular application, becomes clear. In this way, a new vision application would first be implemented and tested on the HPS, and then a suitable architecture would be chosen according to the results obtained during the tests. This process is illustrated during the development of the demonstrator application included in this paper. The next section presents an overview of previous work in the fields related to this project. Section 3 studies the advantages and disadvantages of several architectures of the hybrid system, and Sect. 4 presents a prototype system for the development of applications based on any of the hybrid system architectures presented in Sect. 3. As a demonstrator of the hybrid system, a road sign recognition (RSR) application has been implemented, and this is outlined in Sect. 5. Finally, Sect. 6 draws some conclusions and proposes some future research directions of this work.
123
380
2 Background The first part of this section reviews the work that has been carried out towards the development of hardware systems for the pre-processing of images. Section 2.2 introduces various techniques of post-processing analysis of the images, and concentrates on the application of artificial neural networks (ANN) to IP, and the advances in the development of hardware systems for ANN. To conclude this section, Sect. 2.3 discusses some of the issues related to combining these two technologies and gives a short introduction to the hybrid model chosen as the main theme of this paper. 2.1 Pre-processing Most machine vision systems use a combination of IP techniques to perform a complete inspection. Amongst the more common IP techniques are: thresholding, segmentation, measurement, filtering, texture and colour analysis, and edge detection. Some of these techniques require complex and time-consuming calculations. The large amount of computation required by some techniques can easily exceed the capacity of a conventional microprocessor, when the system is expected to process images fed by a camera at real-time speed. In the past, the most common solution to this problem was the use of application specific integrated circuits (ASIC), hardware designed specifically for each application. However, this solution is not very practical, specially for low volume applications, since ASICs are expensive to produce in small quantities, require a great amount of experience in microelectronics, and their development is a very time-consuming task. Another common approach for real-time image processing has been the use of parallel computing [59]. In IP, due to the nature of its operations, it is often possible to partition the image across several processors, and to process each partition in parallel. This mechanism requires additional communication between the processors to share the data, which can be a problem for high-latency networks. As a result, many multiprocessor systems have difficulties with real-time video processing due to communication overheads. Additionally, parallel computing is costly, and the lack of stability and software support for the parallel machines can also be a disadvantage [7]. Fortunately, there have recently been great advances in the development of programmable hardware, which substantially reduces the time spent in the design phase of an application. Field programmable gate arrays. A field programmable gate array (FPGA) consists of an array of logic and routing resources that can be reprogrammed after it is manufactured. FPGAs
123
M. S. Prieto, A. R. Allen
are generally slower than purpose-built hardware, and draw more power. On the other hand, the advantages that they offer are of great importance: a shorter time-to-market, lower development costs, and allow the reprogramming of the device [12,57]. The greatest problem of FPGAs used to be their very limited size. Complex operations, such as some IP algorithms, required impractical designs implemented across multiple devices, with the communication problems that this implied. Even though by the early 1990s there were already some FPGA-based IP systems [1,27], it was not until the end of the 1990s, that the significant improvements in FPGA capacities (and FPGA technology in general) made this technology a serious candidate for hardware, and more applications started to appear [17,26]. Since then, FPGA technology has only improved, with a consequent increase in the number of FPGA implementations of IP. Various researchers have developed FPGA-based systems for IP applications. Salcic et al. [53] and Bouridane et al. [7] produced real-time co-processors based on FPGA. Bouridane et al. also created a library of common IP algorithms that can be called from the PC hosting the co-processor. This original work evolved into a different kind of library in [5], which is a collection of reusable hardware skeletons of common IP operations, which can be combined through high-level algorithmic descriptions to obtain hardware solutions. Another processing system for machine vision applications was developed by Dunn et al. [18]. An important part of their work is the development of a high-level language that could be compiled into hardware by automated techniques. More details on these techniques follow in the next section. Similar to these works, there is a growing number of other kinds of architectures based on FPGAs designed for real-time IP [4,34,42]. Recent technological advances have provided new functionalities for the FPGAs of latter generations, such as dynamic reconfiguration. Dynamic reconfiguration is the ability to reconfigure a part or the whole of the FPGA during execution time. Tanougast et al. [60] have introduced methods to optimise the use of the FPGA, exploiting this new functionality, and created a dynamically reconfigurable embedded real-time system. Nevertheless, dynamic reconfiguration is still a very new area of research that still requires some improvement to facilitate its use. Despite the great advantages of the new functionalities of FPGAs, such as dynamic reconfiguration, possibly the most important area of research around FPGA technologies at the moment is centred on the programming of the FPGAs. More specifically, the development of high-level languages that can be compiled directly into hardware by automatic techniques can neutralize the main disadvantage of FPGAs, i.e. the programming model for FPGAs has been up to now at gate level. This issue is examined in more detail in the following section.
A hybrid system for embedded machine vision using FPGAs and neural networks
Hardware compilation. The development and testing of algorithms at gate level is a very time-consuming task and requires a considerable expertise in microelectronics. Hardware compilation [14,57] is a recent approach to this problem in which algorithms are first written in a high-level language, and then compiled into hardware by automatic techniques. Some high-level languages used in hardware compilation are: the CCGL designed specifically for Xilinx FPGAs by Dunn et al. [18], the SA-C developed by Draper et al. [16], and Handel-C [10], which has been extended considerably more than other languages. The familiarity of writing in a highlevel language, combined with the simple compilation and simulation environment, makes hardware compilation very suitable for the adaptation of IP algorithms into hardware implementations [44,57,58]. Handel-C is a programming language that includes most of the capabilities of conventional C. In addition, HandelC also includes some parallel constructs, which enable the exploitation of the inherent parallelism of algorithms. In this way, the implementation of an algorithm in hardware follows three steps: translation of the C/C++ program into Handel-C, parallelisation of the code, and adaptation of the code to the hardware platform that it is going to be running on (by fixing variable widths, high-level interfacing of devices, and so on). After this step, all that is needed is to use the Handel-C hardware compiler to automatically generate the gate-level hardware. 2.2 Post-processing The computational complexity of recent machine vision applications has demanded new AI techniques to be added to the traditional methods of IP. Two common approaches to analysing the data are from the point of view of statistics or ANN. In general, there are many parallelisms between the two fields, and there is a direct correspondence between many of the techniques. An excellent summary of the similarities, differences, and the ideal situations in which different techniques should be used, can be found in the Neural Network FAQ by Sarle [54]. This section is concerned mainly with the use of ANNs in the context of IP, and the development of hardware implementations of ANN. One of the reasons why ANN are appealing to this project is because they are biologically inspired systems. Since one of the global aims of this project is an attempt to produce a suitable system for the simulation of cognitive models of the human mind, it seems appropriate to use methods that are biologically inspired. Concerning the use of ANNs in IP applications, there has been a large amount of work to evaluate the performance of different networks for each kind of IP task. A very comprehensive review of these studies can be found in [19]. In this study, Egmonton-Petersen et al. point out that the most fre-
381
quently applied architectures are: feed-forward ANNs, selforganizing maps (SOM) and Hopfield ANNs. From these three, feed-forward ANNs seem to be less useful for some very important IP tasks, and Hopfield ANNs are sometimes difficult to apply to a particular problem and convergence to a global optimum cannot be guaranteed [19]. The major problem of using ANNs in real-time IP is that most of them are very costly computationally and therefore unsuitable for real-time processing, unless implemented in hardware. For this reason, there has been a great effort on producing suitable architectures for ANN [43]. Neural architectures are commonly classified into the following main categories [28,39]: accelerator boards, multiprocessor systems, and digital, analogue and hybrid (digital and analogue) neurochips. Heemskerk [28] mentions that the successive order of these categories corresponds to increasing speed-up but decreasing maturity of the techniques used. In general, hybrid neurochips [6,23] have only been used in laboratories, while accelerator boards have been commercially available for some time. At the moment, digital techniques seem to offer the best possibilities for implementing flexible, general-purpose neurocomputers. There are many reviews of neural hardware in the literature. Some very complete reviews are [28,31,39]. In this last work, Ienne et al. argue that the quest for generality has influenced the design of many new processing elements, making new architectures tending to be less attached to a particular algorithm or ANN, and looking for a more general design. Other reviews can be found in [13,55,56]. An interesting comment by Dias et al. in their recent work [13] is that many neurochips are being removed from the market, and very few new ones appearing. They suggest that this might be because neurochips are expensive to build and there is still little known about the real commercial prospects for working implementations, but that the appearance of new hardware solutions in the coming years may change the present state of the ANN hardware market. Having those considerations in mind, in order to produce a system with real hardware, the ANN chosen for the system developed in this project is the SOM, for various reasons. There seems to be a significant amount of research on applications of the SOM to IP and with apparently very positive results [3,8,9,19,38]. Another reason is that there are available hardware implementations of the SOM (more details in the next sections). The following section presents the basic theory of the SOM and introduces the VindAX processor, a digital implementation of this kind of network. Self-organising maps. In 1982, Tuevo Kohonen presented a new model of unsupervised competitive ANN called SelfOrganising Map (SOM) [24,35]. The model consists of an array of neurons, each of which is connected to all the inputs and to some neighbourhood of the surrounding neurons.
123
382
M. S. Prieto, A. R. Allen
This architecture resembles the way biological nets are organised. In the brain, the number of connections within each group is much greater than the connections to outside of the group. Moreover, the physical proximity of two biological neurons reflects some kind of similarity between the impulses that activate them. In order to implement this feature, while classical competitive learning only updates the weights of the winning neuron, the Kohonen learning algorithm also extends the competition over spatial neighbourhoods. This extension is achieved by updating the weights of the neurons within the proximity of the winner neuron, thus allowing the formation of clusters of nodes within the array. As in the biological system, the neurons grouped in a cluster share some sort of similarity between the features of the inputs that activate them. In this way, it is possible to perceive the underlying structure of multidimensional data by projecting it over the array of neurons and observing the array auto-organise in clusters. The SOM algorithm. In the SOM algorithm, the multidimensional Euclidean input space n , is mapped into a twodimensional output space 2 . The reference vector of each neuron in the network is m i = [µi1 , µi2 , . . . , µin ] ∈ n , where µi j are scalar weights, i is the neuron index and j the weight index. An input vector x = [ξ1 , ξ2 , . . . , ξn ] ∈ n is presented to all neurons in the network, and the neuron with the closest matching (i.e. greatest similarity) vector c becomes the active neuron, i.e. c = argmin {|x − m i |},
(1)
i
which means the same as x − m c = min {x − m i } , i
where x − m i is the distance between the input vector x and the reference vector m i , using a similarity metric such as the Euclidean distance. During the training of the ANN, after the active neuron has been identified, the reference vectors of the neurons in the neighbourhood of the active neuron need to be updated to bring them closer to the input vector x. The amount of change is determined by the distance of the neuron from the active neuron. The updating rule is: m i (t + 1) = m i (t) + α (t) [x (t) − m i (t)] if i ∈ Nc (t) and / Nc (t) , m i (t + 1) = m i (t) if i ∈ where Nc (t) is the current neighbourhood, and t is discrete time (i.e. t = 0, 1, 2, . . .). The training is considered successful if the neurons form clusters of similar reference vectors. Each cluster of neurons
123
would correspond to a different class of the data. This clustering is a product of the design of the network and occurs without any supervision. Once the network is successfully trained it can be used to classify new data. The new data would be codified into a new input vector, which would then be passed to the network. The neuron that becomes active by Equation 1 determines the class of the input data. Hardware implementations of the SOM. Over the years, there have been many hardware implementations of the SOM. In most cases, the SOM algorithm is implemented on a neural chip designed to support different kinds of networks. An example of this is the work by Ienne et al. [32] where they used a chip with a Single Instruction stream Multiple Data stream (SIMD) architecture called MANTRA I [61] to implement the SOM algorithm. The authors suggested a few differences in the algorithm to improve the parallelisation and its implementation in hardware. More recent examples of SOM implementation on neural hardware can be found in [6,21] and [48]. The last work by Porrmann et al. presents a dynamically reconfigurable hardware acceleration based on FPGA technology for the simulation of SOM. The system, equipped with 5 FPGA modules, achieved a respectable maximum performance of more than 50 GCPS (giga connections per second) during recall and 5 GCUPS (giga connection updates per second) during learning. Other publications from the same research group giving more details of the system and results can be found in [49] and [52]. Another fully digital hardware implementation of the SOM algorithm is the Modular Map [40,41]. It is composed of a neural array and a module controller that provides the interface between the host and the array which is a SIMD array of processors configured to provide a highly parallel processing system. Each neuron of the array is implemented separately as a simple RISC processor, and they interact with each other creating a network of 16 × 16 neurons with the topology equivalent to that of a SOM. The design adopts a modular strategy, which permits the neural array to work as a fully functional self-contained network or as a part of a bigger network by interconnecting modules. The commercialised version of the hardware implementation of the Modular Map design is known as the VindAX processor, and is developed by AXEON Ltd [2]. The manufacturing company provides a PCI development board, which contains one VindAX processor and a software package to run on a PC for the development of applications. Through the Vindax Development Board, the VindAX processor can be accessed as one 16 × 16 network or as various partitionings of the neural array. The partitionings can either be in the dimension of the neural map (e.g. 4 networks of 4 × 4 neurons with reference vectors of 16 elements) or in the dimension of the reference vectors (e.g. 2 networks of
A hybrid system for embedded machine vision using FPGAs and neural networks
16 × 16 neurons with reference vectors of 8 elements). A Register Transfer Level synthesisable VHDL description of the VindAX processor is available for Intellectual Property ware applications [29]. In the case of embedded IP applications, the Modular Map soft core could be included as part of the hybrid system designed in this project, and optimised for the use of each particular application. In the VindAX processor implementation of the SOM algorithm, covers the range (0, 255) and (0 < n ≤ 16) in the multidimensional Euclidean input space n , which means that the vectors have 16 elements with values up to 255. Regarding Eq. 1, a variety of distance metrics can be used as a measure of similarity. Since the equation is implemented in hardware, the Manhattan distance metric has been found to be a valid alternative to the more widely used Euclidean distance [40,51]. The Manhattan distance is less expensive in terms of computational resources. The hardware implementation of the SOM used in the present work is the VindAX. Some of the advantages that are directly relevant are: the availability as an Intellectual Property core, the ability to partition the neural array into various sub-maps, and the user-friendly development system.
2.3 Combination of the pre-processing and post-processing Because of the requirements for increasingly complex vision processing, there is a clear motivation towards creating systems that allow the merging of the two aspects of image analysis presented in the previous sections: the pre-processing and the post-processing. The pre-processing of the images comprises all IP tasks which deal with the image at a very lowlevel, whereas the post-processing tasks include the high-level analysis of information with AI techniques, such as ANN. There appears to have been relatively little research invested in the development of systems combining traditional IP techniques with ANN. One company that seems active in this line of research is General Vision [25]. Their main product is an image recognition engine named CogniSight™. CogniSight™ includes a Xilinx Virtex FPGA of 50K gates together with a parallel neural network based on the Zero Instruction Set Chip (ZISC ) [20] from IBM. This system has been designed to analyse the colour, shape and texture of visual objects, learn these signatures with the ANN, and then recognize identical or similar objects to produce a response. One of the main problems of the CogniSight engine is that the kind of pre-processing is limited to a pre-defined set of operations, which are not very complex in nature, due to the limited size of the FPGA. The usage of the system has been greatly simplified so as to reduce the required technical knowledge to design the pre-processing of the images or the training of the ANN. Even though this may be useful for a limited amount
383
Fig. 1 FPGA used in both stages of the hybrid system, the pre-processing and classification
of applications, the design is in general too rigid for a broader use in machine vision applications. This paper proposes a hybrid system consisting of a number of image processing (IP) stages followed by a neural classifier. The pre-processing stages of the system would be implemented on programmable hardware, in the form of an FPGA, and the post-processing tasks of classification and pattern recognition would be performed by a digital implementation of the SOM, namely the VindAX processor. More details regarding this hybrid system can be found in Sect. 3. In order to facilitate the application development process for the hybrid system being designed, and following the lines of Benkrid et al. in [5], a library of IP algorithms has been developed. The main idea is that this library would provide a set of common IP algorithms that would avoid a great amount of work in the development of applications on the hybrid system. In contrast with the work by Benkrid, the algorithms contained in the library have been created by using hardware compilation techniques in order to facilitate their development and to increase the re-usability of their code.
3 Possible architectures of the hybrid system This section discusses three possible architectures of a hybrid system consisting of a number of IP stages followed by a neural classifier. The IP stages are performed by an FPGA, and the neural classifier corresponds to an implementation of the SOM algorithm either in the FPGA or in dedicated hardware (the VindAX processor). The advantages and disadvantages of each architecture are analysed in detail. 3.1 FPGA standalone In the simplest case, the FPGA would perform both stages of the hybrid system, i.e. the pre-processing and the classification (see Fig. 1).
123
384
In this model, the FPGA would contain all the necessary algorithms for the pre-processing of the images, as well as a full implementation of a SOM. It is clear that because of its simplicity this design is ideal for embedded applications. The problem with this method is that having the full implementation of a SOM on the chip would require the use of a very large FPGA, or otherwise very little space for the IP algorithms would be left available. The Intellectual Property core of the VindAX processor could be used for the full implementation of a SOM in this architecture. Advantages of this architecture: − The interconnection and communication between the pre-processing and the SOM is well established, does not need to be analysed in particular for each new application, and is likely to be fast. − Having the solution as just an FPGA makes it ideal for embedded applications, i.e. one single chip means less power consumption and easier to integrate within embedded systems. − Including the full implementation of the SOM allows the system to have on-line learning capabilities, making it capable of adapting to new situations during the operation of the device. Disadvantages of this architecture: − A full implementation of the SOM requires many resources causing a limitation in the maximum size of the neural array and in the complexity of the pre-processing of the images. − In applications with very intensive use of the neural array the speed of the system at processing vectors might not be fast enough for the speed requirements of a real-time application.
M. S. Prieto, A. R. Allen
Fig. 2 The two modes of the FPGA used for applications in which the learning of new vectors by the SOM is performed off-line
same time that a smaller amount of resources than the full implementation are being used. When the application required the learning of new vectors, the FPGA could be set to learning mode, and a full implementation of the SOM would be included in the chip. In the case that there were not enough resources for the pre-processing and the full SOM in the same chip, then the pre-processing could be performed first, storing the obtained vectors for the training of the network, followed by the reprogramming of the FPGA with a full SOM implementation (without the preprocessing of the images) which could then be trained with the vectors stored in the previous step. This model solves the problem of resources to a certain degree. The classification stage of the SOM would still require a large area of the FPGA, unless only a subset of neurons was implemented, instead of the full neural array. In this case, the neural array would have to be partitioned, and the new vector would have to be classified with each partition in order. This method would reduce the speed of the classification by a factor proportional to the number of partitions. In cases where the classification speed is critical, the achievement of real-time performance by this design will not be possible. Advantages of this architecture:
3.2 FPGA standalone with off-line learning One of the problems with an implementation on a standalone FPGA, as mentioned in Sect. 3.1, is that a full SOM implementation may use too many resources on the chip, limiting the complexity of the pre-processing of the images. Figure 2 shows a solution to the problem, in which the working mode of the FPGA is differentiated from the learning mode. More specifically, in applications where the learning of new vectors does not need to be done in real-time, which is the case for many applications, the implementation of the SOM in working mode would only include those resources needed to find the active neuron. This process requires very simple processing (see Sect. 2.2), which means that very high classification speeds can be reached using this model, at the
123
− The interconnection and communication between the pre-processing and the SOM is well established and fast. − Solution integrated in a single FPGA, ideal for embedded applications. − By limiting the on-line capabilities of the chip to only classification, a large amount of resources are freed for the implementation of more complex pre-processing algorithms compared to the design in Sect. 3.1. Disadvantages of this architecture: −
Due to the time taken switching between the Working Mode and the Learning Mode, on-line learning is greatly reduced, and in many cases it might prove necessary to
A hybrid system for embedded machine vision using FPGAs and neural networks
Fig. 3 FPGA and VindAX processor working in parallel
−
−
come off-line during the time that it takes for the learning of new vectors. The implementation of only the classification step of the SOM does free a large amount of resources for the pre-processing, although this might not be enough for applications with a heavy load of pre-processing. In applications with very intensive use of the neural array, the speed of the system at processing vectors might still not be fast enough for the speed requirements of a real-time application.
3.3 FPGA and dedicated ANN processor The third solution proposed in this project combines the advantages of using FPGAs for the pre-processing of the images, with the computational power of dedicated hardware (the VindAX processor) for the classification. By using dedicated hardware, the classification stage can meet realtime requirements even when the network is continuously adapting its reference vectors (on-line learning). Furthermore, by dedicating the FPGA solely to the pre-processing of the images, a large amount of resources are freed which can in turn be used to implement more complex, and faster, IP algorithms. Figure 3 shows the design of the hybrid system combining the two chips, the FPGA and the VindAX processor, to work in parallel. The simplest mode of interaction between the two chips is that of a pipeline in which there is only one stage of preprocessing in the FPGA followed by one stage of classification in the VindAX processor. A more complex interaction between the two chips can be found in the Road Sign Recognition demonstrator application, in Sect. 5. In this last example, there is a continuous dialogue between the two chips analysing the data at different levels, moving down a classification tree. The design shown in Fig. 3 offers the flexibility of creating any sort of interaction between the two chips. Another example of how the system is useful in embedded applications could be to address one of the problems of the architec-
385
Fig. 4 An FPGA containing a SOM classifier and the VindAX processor learning in parallel
ture presented in Sect. 3.2. One of the disadvantages of that architecture was that the system had to be stopped in order to adapt it to new vectors and therefore on-line learning was not possible. This problem could be solved by connecting the system as in Fig. 2, with the difference that the learning mode of the FPGA would be substituted by the VindAX processor working in parallel with the FPGA. The design is illustrated in Fig. 4. During the normal operation of the system presented in Fig. 4, the FPGA would do both tasks of pre-processing and classification of the input images. At certain intervals, whenever the network required to be adapted (on-line learning), the same vectors that were being classified by the SOM classifier inside the FPGA would also be passed (in parallel) to the VindAX processor. The VindAX processor would then use these vectors to re-train the ANN and adapt the values of the models stored in the individual neurons of the neural array. At the end of each adaptation interval, the new reference vectors of the VindAX processor neural array would then be transferred to the FPGA in order to update the values of its SOM classifier. This adaptation could be useful for increasing the accuracy of the classification. During the periods when the network did not require on-line learning, the VindAX processor could obviously be deactivated in order to decrease the power consumption of the device. A very important factor to keep in consideration with this kind of system is the choice of communication channel between the two chips. If the speed of the channel is too low, the channel is likely to become the bottleneck of the design and even restrict its ability to meet the speed requirements of real-time applications. If the two chips are contained within the same board, this problem does not arise since the latency between the two chips would be insignificant. In many cases, however, a different approach which is probably easier to implement could also be considered. In these cases, the communication between the two chips could be performed by any of the following channels:
123
386
1.
2.
3.
4.
M. S. Prieto, A. R. Allen
PCI: A 64-bit PCI has a 133 MHz clock to enable data transfer speeds of up to 1 Gbit/s (otherwise a 32-bit PCI works at 33 MHz clock, giving 132 Mbit/s). Ethernet 100: In theory, the communication speed of this bus is 100 Mbit/s. However, this could not be achieved if the data were passed through a computer with an operating system that did not offer real-time capabilities. If that were the case, when the bits were being transferred up in the protocol layers for handling the communication between the two chips, the operating system might not grant total CPU usage to the operation, impeding the fluency of the transaction. Parallel port: In Enhanced Parallel Port (EPP) mode, data transfer takes place as a single software instruction, and the rest of the transfer is handled by hardware. This allows an EPP port to function as a 16- or 32-bit data transfer interface using 8-bit I/O hardware, in effect enabling EPP peripherals to achieve the same speed and efficiency as their ISA bus counterparts: 8 Mbit/s. Serial port: The typical data transfer speed of the serial port is 115,200 bit/s, although other speeds can also be used.
In any case, a detailed study of the communication channel to be used is necessary in order to prevent the channel from becoming the bottleneck of the design. Advantages of this architecture: −
−
The FPGA can be fully dedicated to the pre-processing of the images allowing the implementation of much more complex IP algorithms than the designs presented in previous sections. The use of dedicated hardware for the classification and learning of new vectors assures a performance compatible with real-time applications, as well as providing the ability to perform on-line learning. Disadvantages of this architecture:
− −
Detailed analysis of the communication channel between the two chips is required in order to prevent it from becoming the bottleneck of the design. The implementation of the design becomes more complex since it involves the management of two chips and the communication between them.
4 The hybrid prototype system The purpose of the work described in this section is to provide a useful tool for the development of embedded machine vision applications based on the hybrid model presented in the previous section.
123
4.1 Introduction Section 3 has introduced three possible designs for the development of the hybrid system, each with its own advantages and disadvantages. The differences between the implementation of these three designs can be quite significant. For this reason, an initial study of which is the most suitable architecture for the application is strictly necessary. The problem, though, is that this decision is often far from trivial. The difficulty of this decision lies in the fact that, especially in the early stages of the development of the application, there is great uncertainty about vital points such as: the speed at which the ANN will be required to work, what resources will be needed for the implementation of the pre-processing of the images, and what kinds of interaction between the two chips will be involved. The choice of a specific architecture for the system can be a determining factor as to whether or not its performance will be able to meet the specifications imposed by the application. This is, however, not the only determining factor since many other implementation decisions have great effect on the performance. For example, there are on the one hand critical decisions regarding the pre-processing: the algorithms to use, the method of optimisation used on them (i.e. optimise for speed or for silicon area), the method of vector extraction for the ANN, etc. And on the other hand, there are also some critical decisions regarding the neural classifier: the optimal size of the network, the size of the reference vectors, the quantity of maps used in the classification, and so on. All these decisions cannot be taken a priori and they should be investigated as the application is being developed. The problem with this method is that since the application is going to be based on hardware, any changes in the architecture or in any of the decisions mentioned above are likely to be quite time-consuming, and probably going to affect all other areas of the application as well. For example, if the decision has been taken to base the system on an architecture of the type FPGA standalone, presented in Sect. 3.1, an increase of the size of the ANN is likely to have an impact on the complexity and speed of the algorithms that can be used in the pre-processing of the images. The circularity of the problem is then evident, since this could in turn affect the way in which the vectors are extracted from the images, and therefore have some impact on the data to be passed on to the ANN. From the previous example, it is clear that, instead of deciding the architecture of the system from the start, it is preferable to first develop the application and then to perform the necessary adjustments to the system so that it can be implemented in one of the three architectures available. For that reason, it would be useful to have the tools to develop the application in a way that allowed each part of the hybrid system to be developed individually, and at the same time be
A hybrid system for embedded machine vision using FPGAs and neural networks
possible to abstract the view from the different parts of the system and see its operation as a whole. In conclusion, the main problem in the development of a hybrid system is that the tools for the development of applications based on such a system are limited. In general, it is difficult to design each part of the system as an individual (i.e. the IP or the ANN) and to still be able to see the operation of the system as a whole. At the same time, a methodology for the design, development and test of applications based on the hybrid system architecture is also required. The following sections discuss some available tools for the development and test of applications based on a hybrid system, and combine these tools into a holistic prototype system able to support embedded machine vision applications during their design phase, before they are ported to any of the architectures introduced in Sect. 3. 4.2 Tools for the development of applications based on the hybrid system This section presents the two development systems considered for the development of applications based on the hybrid system. 4.2.1 The RC200 development system The RC200 development system is composed of the RC200 board and the Celoxica DK Design Suite, from Celoxica Ltd. [11]. DK is a development environment for softwarecompiled system design, which is a methodology for designing electronic systems for programmable hardware. In other words, DK facilitates the development of applications for FPGAs. In DK, the programs are first written in a programming language called Handel-C [10], which has a similar structure to C with the addition of certain commands oriented towards a hardware implementation of the algorithms. The gate-level description logic of the hardware can be obtained from the Handel-C code using automated techniques called Hardware Compilation [57]. The main advantages of software-compiled system design relevant to this project are: 1.
2. 3.
High-level system design using a C-based language which can be directly compiled into optimised FPGA logic. Accurate simulator/debugger of the source code at hardware-level. Area and delay profiling of source code.
387
been transferred to Handel-C, it is easy to use the internal simulator/debugger of DK to verify the functionality of the code and to transfer it to the actual hardware. Finally, the area and timing analysis of the code can be used to identify parts of the algorithms which could be potential bottlenecks and to learn how they could be optimised. The area and timing analysis can also be useful to determine whether the pre-processing of the images will be able to share the resources in the FPGA with an implementation of the SOM in the same chip, corresponding to the architectures described in Sects. 3.1 and 3.2. The RC200 board is a standalone FPGA-based prototyping board, which provides an ideal tool for the testing of Handel-C code implemented with DK. The RC200 board is particularly useful for the development of machine vision applications, since it can be used as a standalone system containing the main resources needed for this kind of applications, namely: a high performance direct connection to the video decoder and video output, a powerful Xilinx FPGA capable of supporting complex IP operations, large memory banks to store the images, and a diverse array of methods to communicate with external devices. More specifically, the main components of the RC200 board are: − FPGA: Xilinx Virtex II XC2V1000 (1M system gates), 720 Kbits SelectRAM Blocks and 160 Kbits distributed RAM. − Memory: Two independent banks of 2Mb SRAM each, and a SmartMedia Socket. − Video decoder: 24 bit RGB or YCrCb signal. − Video output: DAC with a 24-bit colour map for a VGA monitor or projector. − Communication: 10/100 Ethernet, Parallel Port, RS232 port. − Connectors: keyboard, mouse, touch-screen. One of the advantages of using the RC200 board is that there exists a Platform Abstraction Layer (PAL) for DK, which makes the driving of the components of the board much easier. The PAL basically provides the tools for programming the various components of the board from a highlevel of abstraction, hiding most of the hardware interfacing details from the user. The RC200 comes with a fully-developed simulator of the board which includes all of its components. In this way, the behaviour of the board as a whole, and not only the FPGA, can be fully tested during the design and development of the application without having to require the actual hardware. 4.2.2 The VindAX development system
Using a C-based language (Handel-C) is very useful to develop and test parts of the application first in software (standard C) before adapting them to Handel-C. Once the code has
The VindAX development system (VDS) comprises a combination of hardware and software for the development of
123
388
M. S. Prieto, A. R. Allen
solutions to problems involving non-linear systems. In the core of the VDS lies the VindAX processor, which is a fully digital implementation of the SOM algorithm following the Modular Map Design (see Sect. 2.2). The VindAX processor is mounted on the VindAX Evolution 2 Board, which connects to a PC through the PCI port. The VDS provides all the necessary tools to interact with the VindAX processor. These tools come in the form of steps, where each step performs a specific function. These functions range from the formatting of the input vectors to the displaying of the reference vectors in 3D graphs. Even though the standard steps are flexible enough to be sufficient for most applications, the VDS also allows the incorporation of usercreated custom functions. The most important feature of the VDS is perhaps its simplicity to modify various parameters of the system (such as the size of the input vectors and of the neural array) and to easily test the learning capabilities of the new network specification. In conclusion, the VDS provides all the necessary tools for the creation of complex systems based on SOMs. In relation to hybrid systems, the VDS provides many useful tools for the creation of applications based on such systems. For example, it is possible to use the algorithms created for the pre-processing of the images, and compile them into a user-created custom step, which could then be used in the VDS. In this way, the whole hybrid system could first be simulated in software, before it was actually adapted for hardware. Actually, to be precise, the step corresponding to the VindAX processor would not be a simulation since the VDS actually uses the VindAX microprocessor mounted on a PCI card on the PC. Summarising, the main benefits of using VDS relevant to this project are: 1. 2. 3. 4. 5.
Rapid application development. Cost-effectiveness. Easy integration of user-created code as custom function steps. Wide range of tools for investigating different specifications of the neural array. Shorter training times and increased reliability due to training performed on actual hardware.
ment systems has been developed, the hybrid prototype system (HPS). The objective of the HPS is to provide the user with all the required tools for the development of embedded machine vision applications based on the hybrid system presented in Sect. 3. This section presents how the two development systems are combined to form the HPS, and how the HPS can be used to develop embedded applications. First of all, it is important to note that the HPS has been designed to facilitate the development and testing of embedded applications, albeit the applications running on it may not necessarily achieve real-time performance. The HPS has been designed, however, to provide the main framework on which embedded solutions with real-time performance could be based. The idea is to provide the necessary tools for the design of the application in a way that possible shortcomings could be detected at a very early stage, and so they could be prevented to make sure that the final solution can deliver the required performance. The HPS designed to this end is shown in Fig. 5. In the HPS, the RC200 would process the images received by a camera, and find the objects or features that required further analysis. The RC200 would then extract some vectors of characteristics from these objects, and transfer them to the VindAX processor through the PC. The VindAX processor would return the class to which these objects belong back to the RC200, and it would finally display the results on a monitor connected to the board. In the approach presented in Fig. 5, the VDS and the RC200 development system have been connected to a PC. This PC would be running the software of the systems: the DK and the VDS software. Through these software packages, the user would be able to design and implement the embedded application in progressive stages. The tests would first be performed in software, and finally in hardware. The final hardware tests would be performed on each of the individual hardware systems communicating through the PC. This method would facilitate the monitoring of the traffic between the two systems, allowing the user to easily identify any source of problems if they appeared. The complete life cycle in the development of an application on the HPS would be: 1.
For these reasons, the VDS is an ideal tool for the development and test of the SOM stage of any application based on the hybrid system developed in this project.
2.
4.3 The prototype system 3. In the previous sections we have described the RC200 Development System and the VindAX Development System. A prototype of a hybrid system combining these two develop-
123
Initial design of the application. Partitioning of the tasks to be implemented on the FPGA and specification of the kind of analysis to be performed by the SOM. Implementation of the pre-processing algorithms of the application using a standard tool such as Matlab. Extraction of initial set of vectors from a database of example cases of the application. Training of the neural array with the VDS using the vectors obtained from a database of examples. Adjustment of the parameters of the network in an effort to improve the classification accuracy.
A hybrid system for embedded machine vision using FPGAs and neural networks
389
Fig. 5 The hybrid prototype system being used in a road sign recognition application
4.
5. 6.
7.
8. 9.
10.
Initial testing of the system and analysis of the results. If the result is satisfactory, proceed to step 5, otherwise modify the algorithms used in the pre-processing and the methods employed in the vector extraction, and go back to step 2. Implementation and testing of the IP algorithms in C. Adaptation of the algorithms to Handel-C and compilation into hardware using the DK software. Loading of the netlist on the FPGA of the RC200 board and testing of the algorithms. From the area and delay profiling of the code, improvement of the key areas of the algorithm causing the greatest delay in time, if the application is more concerned about execution speed, or reduction of the silicon area used by the code, if the concern is more centred on the use of FPGA resources. Interconnection of the whole system through the PC and global testing of the application using the HPS. Analysis of the overall performance of the system and of the silicon area used by the pre-processing. Choice of architecture for the final embedded hybrid system from the designs proposed in Sect. 3. Migration of the solution into the architecture of choice and final testing of the embedded hardware.
By the time Step 5 in the life cycle of the application is reached, the algorithms to be used in the pre-processing of the images have already been decided. In order to get a simulation of the system working with the actual FPGA, it is necessary to code the algorithms in C, test them, adapt them to Handel-C, optimise them for hardware, and test them once more. This process can become rather tedious and it requires
a great amount of knowledge of programming techniques at hardware level. In order to facilitate this process of preparing the algorithms for hardware, a library of IP algorithms has been created. The basis for the creation of the library is twofold: IP algorithms which (1) are commonly used, and/or (2) exhibit widely useful design patterns. (1) Some IP algorithms are common to many machine vision applications, and so, it would be very helpful to have them already implemented and prepared for hardware. For this reason, a library has been compiled with these standard algorithms implemented in Handel-C, and optimised for their use in an FPGA. Regarding the Step 5 mentioned above, the job of the developer of a new application is now reduced to determining which of the standard algorithms from the library can be used for the application, and implementing the other more specific algorithms not present in the library. Furthermore, not only does the library save the time of having to implement and prepare these algorithms for hardware, but it also provides a fundamental core of standard techniques to process the image. (2) As mentioned above, very specialised knowledge is required for the programming of algorithms for hardware. This task, however, can be greatly aided by pre-existing methods and procedures to deal with similar kind of problems. For example, if a special kind of image filtering is required by the application, the developer can explore the solutions given in the IP library to filtering algorithms (such as median or Gaussian filtering), and base the new algorithm on the code provided for those solutions. By providing useful examples, the programming task of Step 5 and Step 6 can be greatly
123
390
M. S. Prieto, A. R. Allen
reduced, and the work of the developer is directed from technical and time-consuming tasks towards more creative aspects of the process. As mentioned previously, there are various possible ways of connecting the RC200 to a PC (i.e. ethernet, parallel or serial port). In principle, one would normally choose the fastest, however, there may also be other, practical considerations. The method used for many of the experiments with the HPS is the serial port, even though the speed that it can deliver does not correspond to the fastest available option. The reason why this method is employed is its simplicity of use: on the RC200 side, the PAL provides a series of functions for the transmission of bytes through the serial port, and on the PC side there also exist various libraries for serial communication. It is also important to bear in mind that many machine vision applications involve a considerable amount of data reduction in the preprocessing to extract a feature vector. Thus only a small amount of data is actually sent to the ANN classifier. Therefore, in many applications, the speed of the connection between FPGA and ANN is not a critical factor. Due to the speed limitations of the HPS, it is necessary to bear in mind that the overall performance of a solution being tested on the system will be significantly improved after its migration into the final embedded hardware. More specifically, the three points which limit the performance of the test of any system being developed on the HPS are: −
− −
The use of the development systems slows down the communication to and from the chips, while a direct approach with both chips in the same board would be significantly faster. Since DK and the VDS run on Windows , the complete dedication of the processor time to the system can not be assured due to the background processes always present. The serial cable, although easy to program, is the slowest method possible to connect the RC200 with the PC.
These points need to be taken into careful consideration when the architecture for the embedded solution is chosen.
5 The road sign recognition application As a proof of concept, a real-time machine vision application has been developed and tested using the HPS. Road sign recognition is a part of driver support systems. Its main aim is the increase of traffic safety by calling the driver’s attention to the presence of key traffic signs such as: stop signs, yield signs, speed limits, etc. Additionally, a vision-based system able to detect and classify traffic signs from road images in real-time would also be useful as a support tool for guidance and navigation of intelligent vehicles.
123
The problem of RS detection and classification might seem simple and well-defined since the colour, shape, dimensions and placement of the RS is tightly regulated. In reality, the problem is very complex due to several reasons: − − −
Since the images are acquired from a moving vehicle, they suffer from: vibrations, blurred scenes, and varying illumination of the captured scenes. The signs can be found damaged, partly occluded, clustered, and other kind of situations which make their detection more difficult. There are variations in the width of the sign borders and actual pictograms on the signs, in spite of the regulations.
This section presents a new method using self-organising maps for the detection and recognition of road signs. The RSR application is implemented and tested using the HPS presented in Sect. 4, and is used as a demonstrator of the hybrid system. The detection of traffic signs from outdoor images is the most complex step in a RSR system. There are many ways in which the characteristics of road signs (RS) (e.g. well established shapes and colours of the signs) could be exploited. However, the necessity to analyse the images in real-time is a limiting factor as to how much information available within the image should be extracted and analysed. Most approaches to the problem of RS detection use either colour information or shape information. A review of various approaches can be found in [33] and [37]. In most applications, once a region of interest (ROI) from the image has been detected, i.e. an area of the image that might contain a traffic sign, this ROI is passed to the recognition module in charge of identifying the sign. Most of the RSs contain a pictogram, a string of characters, or both. The recognition of RSs is therefore very often implemented with ANN [15,22,36], because of their pattern recognition abilities. Nevertheless, other approaches such as template matching [30,47] or Laplace kernel classifier [45,46] have also been successfully used in the recognition step. The approach chosen for the RSR application developed in this paper uses SOM in both the detection and recognition. In each task, a vector characterizing the ROI is extracted. The classification of the vector by the SOM will determine whether it corresponds to a potential RS or not, at detection level, and what kind of RS it is, at classification level. More details of the implementation of this application can be found in [50]. 5.1 The HPS in the RSR application The RSR application has been chosen as a demonstrator of the HPS presented in Sect. 4. RSR is an ideal problem for testing the use of the HPS in the development of embedded
A hybrid system for embedded machine vision using FPGAs and neural networks
machine vision applications. The real-time recognition of RSs requires a complex interaction between the FPGA and the ANN while working under strict speed constraints. In exploring the design space for this application, it is necessary to take into account the requirements of both the preprocessing and the learning/classification. In this case, by step 7 of the life-cycle, it is clear that the preprocessing will take up a considerable FPGA area. It is the object detection and extraction, and construction of the feature vector, which is the processing bottleneck, and therefore requires as much parallel hardware as possible. This rules out implementing both the preprocessor and the SOM classifier in the FPGA (Sects. 3.1 and 3.2). Moreover, the application also requires dynamic adaptation and on-line learning, and the reconfiguration time of the FPGA (O(102 ) ms) would render the method of Sect. 3.2 infeasible. The other consideration is the nature of the connection between the FPGA and the ANN. In the RSR application, the preprocessing produces a feature vector for each ROI in the image [50]. The VindAX processor requires a feature vector of 16 bytes, so a typical source image would result in only O(102 ) bits to be sent. There is even less data (O(10) bits) in the reverse direction (from ANN to FPGA). The communication time would be dominated by the latency of the RS232 or ethernet (O(1), O(10−1 ) ms respectively), and in either case would be insignificant compared with the FPGA preprocessing time. As mentioned in Sect. 4.3, an RS232 connection was chosen for pragmatic reasons. Figure 5 shows the HPS supporting the development of the RSR application, in the final test of the system corresponding to step 8 of the life cycle of an embedded application for the HPS proposed in Sect. 4.3. In Fig. 5, the image to be analysed corresponds to a frame of a road scene captured by a camera connected to the RC200. From this image, the RC200 analyses the different ROIs that could potentially be a RS and generates vectors of characteristics describing them. Each vector is sent to the PC, and then gets transferred to the VDS connected to a PCI port of the PC. The VindAX processor analyses each vector and returns the class that it belongs to. The PC transfers back the information to the RC200, and if the analysis seems to indicate that the object is a potential RS, then the RC200 extracts more information and sends it to the VindAX processor again, in order to determine exactly which kind of RS it is. Given the large number of RSs in some classes, the training of an ANN with all the RSs of a class at the same time is nearly impossible. A typical solution is to group the RSs into subclasses, according to the similarity between their pictograms, and to assign a neural map to each of the subclasses. In this way, the process of transferring vectors to the VindAX processor and receiving their class back is done a few times, as the RS is further analysed. Once the exact RS has been identified, the original frame captured by the camera is dis-
391
played in the Video Output, with the RS that has been detected marked appropriately. The Partitioning of the SOM. One of the features of the VindAX processor is its capacity to partition the neural net into various independent maps working in parallel. This feature can be used to great advantage in the RSR application. Dividing the VindAX processor into four 8 × 8 networks gives it the capacity to analyse each vector by four subnetworks at the same time. Each of the sub-networks could be trained with four or five different RSs, giving the processor the capacity to examine up to 20 individual RSs for classification at the same time. Dynamic adaptation. Another very important feature of the hybrid system is its ability to adapt the neuron values of the SOM map while processing vectors for classification. In order to best exploit this feature, the RSR system has been implemented in such a way that it is capable of adapting its SOM maps without having to stop its normal execution. By using the dynamic adaptation of the neural array at specific times, the system is able to adapt itself to small changes in the RSs being analysed. Clearly, the objective here is to keep the system learning from experience and evolving through time so as to always obtain the most accurate classification that is possible, even when changes in the appearance of RSs take place. These changes could be observed, for example, when the frontier between two countries is crossed. For this reason, this very same example is recreated in the experiments, where the system is trained with images from Spain and Czech Republic and tested with images from United Kingdom. 5.2 Results This section presents the results of the experiments. These experiments study the performance of the RSR application running on the HPS. The two most important points being tested by the experiments are: the accuracy to detect and classify RSs, and the processing speed of the system. 5.2.1 Detection and classification accuracy With appropriate lighting, the detection algorithm has shown a great reliability to detect RSs of the four main classes (stop, give way, warning and prohibition signs). In general, the RSs were perfectly detected and identified under variations of scale (as small as 25 pixels wide), rotation (up to 20◦ ) and occlusion (up to 15–20% of vertical and 5–10% of horizontal occlusion, depending on the area being occluded) of the signs. It has to be said that all those results were obtained after the network was adapted to work with the British standards
123
392
of RSs. As was mentioned in the previous section, the RSs used in the training of the SOM and the experiments are slightly different. Using the initial training of the maps, all RSs except cross-roads and traffic-light ahead were perfectly recognised by the system. In the case of these two, the SOM seemed to confuse them when the RSs were shown with the slightest rotation of the pictograms. To help the system to differentiate the signs, the system was asked to adapt the neural array values about 20– 40 frames for each sign. Through this procedure, the system perfectly learned the two classes and proceeded to correctly classify them, as well as the other signs, under the same kind of rotations tested on them. 5.2.2 Speed of the system Initially, the main concern of the experiment was that the HPS might not be able to achieve a high-speed processing time. After all, the HPS gathers a great amount of statistical data, communicates through a serial port, and part of the system runs under Windows , which is not a real-time operating system. In spite of all these limiting factors, the application ran at a surprising speed of 19–21 frames per second (fps), which can be considered as almost real-time (assumed to be approximately 25 fps). Furthermore, some experiments revealed that simply by disabling the logging and displaying of the data gathered by the interface program running on the PC, without any further modification of the system, was enough to boost the application speed to the desired 24–25fps. Considering the aforementioned speed limitations of the HPS, an implementation of the RSR application on a suitable architecture should be able to perform a much more complex pre-processing of the images as well as incorporating the classification of more RSs, and all this without compromising the performance of the application.
6 Conclusions In the past few years, machine vision applications have become more demanding on the hardware. These kinds of applications often require the integration of traditional IP with AI techniques, with an implementation in designs with small footprint and working at high speeds. The main aim of this paper was to present a system able to meet the requirements of embedded vision applications, and to provide the necessary tools for the development and testing of such kind of applications on this system. The system introduced in this paper combines programmable hardware for the IP tasks and a digital implementation of an (ANN) for the pattern recognition and classification tasks. The different architectures that this hybrid design might adopt
123
M. S. Prieto, A. R. Allen
have been compared in Sect. 3, where the advantages and disadvantages of each architecture have been analysed in detail. A prototype of the hybrid system, called the HPS, has been created to aid in the development and test of new embedded vision applications, before the final hardware of these applications is produced. The general methodology for the development of applications for the HPS has been outlined, and the capacities and limitations of this new prototype have been discussed. The demonstrator application developed in this project, included in Sect. 5, corresponds to a RSR application. The aim of the system was the real-time detection and classification of RS in road scene images provided by a camera. This application was chosen because of the high-speed requirements of the application, as well as because of the complex interaction involved between the FPGA and the neural array. The results obtained by the RSR application have shown great accuracy in the detection and recognition of RS with changes in position, scale, rotation and partial occlusion. Further, the system has demonstrated the ability to recognise non-standardised RSs and signs from different countries. The system was able to produce results in real-time, in spite of the limitations of its laboratory implementation. Finally, the hybrid prototype system has demonstrated its worth in the development of the RSR application. A number of preprocessing algorithms were experimented with, to optimise the feature extraction. The ability to use components from the IP library in various combinations facilitated this experimentation, and permitted a rapid estimation of FPGA area usage. In this application, it quickly became apparent that the preprocessing was the potential bottleneck, and the IP algorithms had to be implemented using the parallelism of Handel-C and the FPGA. This in turn meant that the FPGA being used (XC2V1000) would not have space for a SOM implementation in addition to the IP. The resultant necessity of using the ANN processor for learning and classification, meant that the HPS could be used to explore various SOM partitionings (permitted by the VindAX system) to optimise the recognition process. Another feature of the HPS tested by the RSR was the possibility to allow adaptation of the neural array parameters during execution time. The results were more than satisfactory, illustrating the capacity of the system to learn from new examples without having to be re-trained in the laboratory. The flexible logging facilities of the HPS were used to good effect: full logging of feature vectors and other run-time information was used during development, and then reduced or switched off when it was desired to more closely emulate the final system. The HPS in all these ways allowed the design space of the RSR system to be explored intelligently and optimised before committing to hardware. In conclusion, this demonstrator application has shown the great advantage of the HPS in aiding the design
A hybrid system for embedded machine vision using FPGAs and neural networks
and implementation of embedded vision applications for the hybrid system. Future research directions. A future direction of this demonstrator could be to extend the functionality of the RSR application into a full commercial application. The application of this particular example would not be restricted to intelligent vehicles, but it could also be of great use for robot navigation. An area where this work could be continued is clearly the development of other embedded vision applications using the HPS. A number of areas that could make immediate use of the HPS are for example: navigation, security and surveillance applications (object tracking, luggage inspection), medical applications (analysis of brain scans, medical X-rays, electron microscope images). This list is by no means complete, and it is worth remarking that these are not only research areas, but also areas where useful commercial applications could be developed. The HPS has been designed to aid in the implementation and testing of an application for the hybrid system, and not in order to be used as a final embedded solution. The development of a hardware implementation of the hybrid system that could be used as the final product would clearly be another area of research following this work. From the architectures discussed in Sect. 3 the most useful designs that could be developed would be the FPGA standalone (see Sect. 3.1) and the FPGA and VindAX processor (see Sect. 3.3). More specifically, the first design would incorporate both modules of the hybrid system in one standalone FPGA, and the second design would incorporate one chip for the FPGA and another with the VindAX processor connected together in one board. Acknowledgments The authors gratefully acknowledge the support of AXEON Limited for this work.
References 1. Arnold, J., Buell, D.A., Davis, E.G.: Splash-2. In: ACM Symposium on Parallel Algorithms and Architectures (ACM’92), pp. 316– 324 (1992) 2. AXEON Ltd: URL http://www.axeon.com 3. Batista, L.B., Gomes, H.M., Herbster, R.F.: Application of growing hierarchical self-organizing map in handwritten digit recognition. In: Proceedings of the Brazilian Symposium on Computer Graphics and Image Processing (SIBGRAPI’03) (2003) 4. Batlle, J., Martí, J., Ridao, P., Amat, J.: A new FPGA/DSP-based parallel architecture for real-time image processing. RealTime Imaging 8, 345–356 (2002) 5. Benkrid, K., Crookes, D., Smith, J., Benkrid, A.: High level programming for FPGA based image and video processing using hardware skeletons. In: Proceedings of the IEEE Symposium FieldProgrammable Custom Computing Machines (FCCM’01) (2001) 6. Bode, M., Freyd, O., Fischer, J., Niedernostheide, E.J., Schulze, H.J.: Hybrid hardware for a highly parallel search in the context of learning classifiers. Artif. Intell. 130, 75–84 (2001)
393
7. Bouridane, A., Crookes, D., Donachy, P., Alotaibi, K., Benkrid, K.: A high level FPGA-based abstract machine for image processing. J. Syst. Archit. 45, 809–824 (1999) 8. Brown, D., Craw, I., Lewthwaite, J.: A SOM based approach to skin detection with application in real time systems. In: Proceedings of the British Machine Vision Conference (BMVC’01) (2001) 9. Campbell, N.W., Thomas, B.T., Troscianko, T.: Automatic segmentation and classification of outdoor images using neural networks. Neural Syst. 8(1), 137–144 (1997) 10. Celoxica: Handel-C Language reference manual, Celoxica (2004) 11. Celoxica Ltd: URL http://www.celoxica.com 12. Crookes, D.: Architectures for high performance image processing: the future. J. Syst. Archit. 45, 739–748 (1999) 13. Dias, F.M., Antunes, A., Mota, A.M.: Artificial neural networks: a review of commercial hardware. Eng. App. Artif. Intell. 17, 945– 952 (2004) 14. Diniz, P., Hall, M., Park, J., So, B., Ziegler, H.: Automatic mapping of C to FPGAs with the DEFACTO compilation and synthesis system. Microprocess. Microsyst. 29, 51–62 (2005) 15. Douville, P.: Real-time classification of traffic signs. Realtime Imaging 6, 185–193 (2000) 16. Draper, B., Najjar, W., Bohm, W., Hammers, J., Rinker, B., Ross, C., Chawathe, M., Bins, J.: Compiling and optimizing image processing algorithms for FPGA’s. In: Proceedings of the IEEE Computer Architectures for Machine Perception (CAMP’00) (2000) 17. Drayer, T.H., IV, W.E.K., Tront, J.G., Conners, R.W.: A modular and reprogrammable real-time processing hardware. In: Proceedings of the IEEE FPGA-Based Custom Computing Machines (FCCM’95) (1995) 18. Dunn, P.A., Kearney, P.D., Jensen, M.J., Davey, P.I.: A modular configurable logic processor system. CSIRO Manufacturing Science and Technology (2002) 19. Egmont-Petersen, M., Ridder, D.de , Handels, H.: Image processing with neural networks - a review. Pattern Recognit. 35(10), 2279–2301 (2002) 20. Eide, A., Lindblad, T., Lindsey, C.S., Minerskjold, M., Sekhniaidze, G., Székely, G.: An implementation of the Zero Instruction Set Computer (ZISC036) on a PC/ISA-bus card. In: Proceedings of the Workshop on Neural Networks (WNN/FNN’94), pp. 319–330 (1994) 21. Eppler, W., Fischer, T., Gemmeke, H., Chilingarian, A., Vardanyan, A.: Neural chip SAND in online data processing of extensive air showers. Comput. Phys. Commun. 126, 63–66 (2000) 22. Estable, S., Schick, J., Stein, F., Janssen, R., Ott, R., Ritter, W.: Real-time traffic sign recognition system. In: Proceedings of Intelligent Vehicles’94 Symposium (1994) 23. Fiesler, E., Duong, T., Trunov, A.: Design of neural network-based microchip for color segmentation. In: Proceedings of SPIE, vol. 4055 (2000) 24. Fu, L.: Neural Networks in Computer Intelligence. McGraw-Hill, Inc, New york (1994) 25. General Vision: URL http://www.general-vision.com 26. Greenbaum, J., Baxter, M.: Increased FPGA capacity enables scalable, flexible CCMs: An example from image processing. In: Proceedings of the IEEE FPGA-Based Custom Computing Machines (FCCM’97) (1997) 27. Hamid, G.: An FPGA-based coprocessor for image processing. IEE Colloquium Integrated Imaging Sensors and Processing, pp. 6/1–6/4 (1994) 28. Heemskerk J.N.H. (1995) Overview of neural hardware. Neurocomputers for brain-style processing: Design, implementation and application. Ph.D. thesis, Unit of Experimental and Theoretical Psychology, Leiden University, Leiden (1995) 29. Hendry, D.C., Duncan, A.A., Lightowler, N.: IP core implementation of a self-organizing neural network. IEEE Trans. Neural Netw. 14(5), 1085–1096 (2003)
123
394 30. Hsu, S.H., Huang, C.L.: Road sign detection and recognition using matching pursuit method. Image Vis. Comput. 19, 119–129 (2001) 31. Ienne, P., Kuhn, G.: Digital systems for neural networks. SPIE Opt. Eng. Crit. Rev. Ser. CR57, 314–345 (1995) 32. Ienne, P., Viredaz, M.A.: Implementation of Kohonen’s selforganizing maps on MANTRA I. In: Proceedings of the International Conference on Microelectronics for Neural Networks and Fuzzy Systems, pp. 273–279 (1994) 33. Johansson, B.: Road sign recognition from a moving vehicle. citeseer.ist.psu.edu/570294.html 34. Kessal, L., Abel, N., Demigny, D.: Real-time image processing with dynamically reconfigurable architecture. RealTime Imaging 9, 297–313 (2003) 35. Kohonen, T.: Analysis of a simple self-organizing process. Biol. Cybern. 44(2), 135–140 (1982) 36. Krumbiegel, D., Kraiss, K.F., Schrieber, S.: A connectionist traffic sign recognition system for onboard driver information. In: 5th IFAC/IFIP/IFORS/IEA Symposium on Anlaysis, Design and Evaluation of Man–Machine Systems, pp. 201–206 (1992) 37. Lalonde, M., Li, Y.: Road sign recognition, survey of the state of the art. Tech. rep., Centre de recherche informatique de Montreal CRIM/IIT (1995) 38. Le, D.X., Thoma, G.R., Wechsler, H.: Document image analysis using integrated image and neural processing. In: Proceedings of the International Conference on Document Analysis and Recognition (ICDAR’95) (1995) 39. Liao, Y.: Neural networks in hardware: A survey. Tech. rep. bit. csc.lsu.edu/~jianhua/shiv2.pdf (2001) 40. Lightowler, N.: Modular maps: an implementation strategy for the self-organising map. Ph.D. thesis, University of Aberdeen, Aberdeen (1997) 41. Lightowler, N., Allen, A.R., Grant, H., Hendry, D.C., Spracklen, C.T.: The modular map. IJCNN (1999) 42. McBader, S., Lee, P.: An FPGA implementation of a flexible, parallel image processing architecture suitable for embedded vision systems. In: Proceedings of IEEE International Parallel and Distributed Processing Symposium (IPDPS’03) (2003) 43. Moerland, P.D., Fiesler, E.: Hardware-friendly learning algorithms for neural networks: an overview. In: Proceedings of the International Conference on Microelectronics for Neural Networks and Fuzzy Systems (MicroNeuro’96) (1996) 44. Muthukumar, V., Rao, D.V.: Image processing algorithms on reconfigurable architecture using HandelC. In: Proceedings of the IEEE Euromicro Systems on Digital System Design (DSD’04), pp. 218–226 (2004) 45. Paclik, P.: The automatical classification of road signs. Master’s thesis, Faculty of transportation science, Czech Technical University, Prague (1998) 46. Paclik, P., Novovicova, J., Pudil, P., Somol, P.: Road sign classification using laplace kernel classifier. Pattern Recogn. Lett. 21, 1165– 1173 (2000)
123
M. S. Prieto, A. R. Allen 47. Piccioli, G., Micheli, E.D., Parodi, P., Campani, M.: Robust method for road sign detection and recognition. Image Vis. Compu. 14, 209–223 (1996) 48. Porrmann, M., Franzmeier, M., Kalte, H., Witkowski, U., Ruckert, U.: A reconfigurable SOM hardware accelerator. In: Proceedings of the European Symposium on Artificial Neural Networks ESANN’02, pp. 337–342 (2002) 49. Porrmann, M., Witkowski, U., Kalte, H., Ruckert, U.: Implementation of artificial neural networks. In: Proceedings of the Euromicro Workshop on Parallel, Distributed and Network-based Processing (EUROMICRO-PSP’02), pp. 337–342 (2002) 50. Prieto, M.S., Allen, A.R.: Using self-organizing maps in the detection and recognition of road signs. Image Vis. Comput. (2005, submitted) 51. Rueping, S., Goser, K., Rueckert, U.: A chip for self-organizing feature maps. In: Proceedings of the International Conference on Microelectronics for Neural Networks and Fuzzy Systems, pp. 26–33 (1994) 52. Ruping, S., Porrmann, M., Ruckert, U.: SOM accelerator system. Neurocomputing 21, 31–50 (1998) 53. Salcic, Z., Sivaswamy, J.: Imeco: A reconfigurable FPGA-based image enhancement co-processor framework. RealTime Imaging 5, 385–395 (1999) 54. Sarle, W.S.: Neural Network FAQ. URL ftp://ftp.sas.com/pub/ neural/FAQ.html 55. Schoenauer, T., Jahnke, A., Roth, U., Klar, H.: Digital neurohardware: principles and perspectives. In: Proceedings of the Neuronal Networks in Applications (NN’98), pp. 101–106 (1998) 56. Seiffert, U.: Artificial neural networks on massively parallel computer hardware. In: Proceedings of the European Symposium on Artificial Neural Networks (ESANN’02), pp. 319–330 (2002) 57. Sheen, T.M.: Tools for portable parallel image processing. Ph.D. thesis, University of Aberdeen (1999) 58. Sheen, T.M., Allen, A.R., Lawrence, A.E., Page, I.: Hardware compilation technology for embedded image processing. In: High Performance Architectures for Real-Time Image Processing: IEE Colloquium Digest 1998/197, pp. 9/1–9/6 (1998) 59. Siegel, H.J., Armstrong, J.B., Watson, D.W.: Mapping computer vision-related tasks onto reconfigurable parallel-processing systems. IEEE Comput. 25(2), 54–63 (1992) 60. Tanougast, C., Berviller, Y., Brunet, P., Weber, S., Rabah, H.: Temporal partitioning methodology optimizing FPGA resources for dynamically reconfigurable embedded real-time system. Microprocess. Microsyst. 27, 115–130 (2003) 61. Viredaz, M.A.: MANTRA I: an SIMD processor array for neural computation. In: Proceedings of the Euro-ARCH’93 Conference, pp. 99–110 (1993)