J Real-Time Image Proc DOI 10.1007/s11554-014-0469-z
ORIGINAL RESEARCH PAPER
GPU-based segmentation of retinal blood vessels Francisco Argu¨ello • David L. Vilarin˜o Dora B. Heras • Alejandro Nieto
•
Received: 16 May 2014 / Accepted: 24 October 2014 Ó Springer-Verlag Berlin Heidelberg 2014
Abstract In this paper a fast and accurate technique for retinal vessel tree extraction is proposed. It consists of a hybrid strategy based on global image filtering and contour tracing. With the aim of increasing the computation speed, the algorithm has been tailored for efficient execution on commodity graphics processing units achieving low execution times and high speedups over the CPU execution. The performance of the proposed method was tested on publicly available databases, STARE and DRIVE, based on standard measures such as accuracy, sensitivity and specificity. Results reveal an average accuracy comparable to that reported for state-of-the art techniques. Our method performs the vascular tree segmentation of the images in the DRIVE and the STARE databases in an average of 14 ms and 18 ms, respectively. To the best of our knowledge, the proposal features the highest accuracy/performance rate in the retinal blood vessel extraction domain. Keywords Retinal imaging Vessel segmentation Medical image processing Graphics processing units
F. Argu¨ello D. L. Vilarin˜o D. B. Heras (&) A. Nieto CITIUS, Campus Vida, Universidad de Santiago de Compostela, 15782 Santiago de Compostela, La Corun˜a, Spain e-mail:
[email protected] F. Argu¨ello e-mail:
[email protected] D. L. Vilarin˜o e-mail:
[email protected] A. Nieto e-mail:
[email protected]
1 Introduction Retinal blood vessel features play an important role in the early detection of serious pathologies such as cardiovascular disease or diabetic retinopathy. Vessel tree characteristics such as vessel narrowing, tortuosity, diameter or branching angles represent a valuable aid for diagnosis purposes or for monitoring the progression of diseases. In addition, the vessel location can serve as a medium for finding other useful structures present in the ocular fundus, such as the optical disk or the fovea as well as for reducing the number of false positives in microaneurysms or exudates detection. Therefore, the development of automatic methods for accurate retinal vessel segmentation is crucial, as this is the first step before subsequent measurement and analysis stages. On the other hand, fast systems are also desirable to compute or register high volumes of patient images obtained at different times, alleviating the workload of experts during routine analysis and supervision. To date, many methods for retinal vessel segmentation based on different strategies have been proposed. Most of these methods are combinations of image filtering, vessel tracing and classification techniques. A comprehensive review of these techniques is reported in [7]. The image filtering methods attempt to exploit some of the following properties: the cross section of the vessel usually presents a Gaussian shape, the blood vessel curvature varies smoothly and the diameter of the vessels decreases with the distance to the optical disk. Among these techniques one of the most widely used is matched filtering [5]. In this technique the retinal images are convoluted with a set of kernels, and the maximum output is retained for each pixel. The set of kernels is obtained by rotating a Gaussian-shape filter to cover different vessel orientations. In the original approach, the reference kernel
123
J Real-Time Image Proc
is optimized for vessels with a piece-wise shape and constant width. As a consequence, the filtering may fail to detect vessels with a different profile like thin vessels. Furthermore, background pixels located close to the optical disk or the boundaries of the field of view are frequently labeled as vessels. A number of methods have been proposed to improve the accuracy by extensions of the original approach, including the derivative of the Gaussian filters [28], adaptive-threshold [9] or multi-stage strategies [2]. However, these methods do not work well on retinal images with pathologies, so they are usually combined with other image processing techniques. Related to these methods are those based on morphological operations as they try to exploit similar vessel features [27]. The vessel tracing approaches provide benefit for the network topology of the retinal vessel tree [11, 13, 22]. These methods usually start from the initial seed points situated inside the vessels. The neighboring pixels are then labeled as vessel/background based on local information. This propagation along the vessels leads toward a low rate of positive false values as only the surroundings of vessel locations are processed instead of the whole retinal image. However, the complexity of the intensity profile at crossing and bifurcation points may lead to complete sub-trees being missed. These methods include pre-processing stages for automatic selection of the initial contours. Related with these techniques are those based on deformable models [1, 4]. Other approaches, such as that reported in [14], use morphological operations to fill the blood vessels. The pattern recognition and classification techniques associate a feature vector to the image pixels and label them as vessel- or non-vessel based on local or global information. Usually, the classifier is trained by supervised learning from ground truth data [12, 20, 21, 26]. These methods usually provide higher accuracy than the previously mentioned strategies. However they are also dependent on the training dataset, so their performance decreases in pathological retinal images far from the datasets used in the learning stages. Therefore, these techniques often require re-training processes to optimize their performance in new datasets. To integrate the processing of ocular fundus images in computer-aided diagnostic systems, fast and reliable segmentation algorithms are also demanded. With this objective, in [19] a parallel implementation of a multiscale vessel segmentation algorithm is proposed. Another strategy for fast vessel segmentation is reported in [3]. This consists of a set of active contours that fit the external boundaries of the vessels and supports the automatic initialization of the contours, avoiding human interaction in all the process. This algorithm was designed specifically for execution in fine-grained SIMD
123
architectures to improve the computation time. In these conditions, the full computation of a retinal image from the DRIVE database (768 584 pixels) in a parallel cellular processor array architectures, in particular in a SCAMP-3 vision system, is possible in less than 0.2 s. However, limitations imposed by the host-device technology (integration density, data accuracy and memory volatility) affect the reliability seriously. On the other hand, the low execution time is constrained to a specific hardware which makes this solution less accessible and flexible than PC-based solutions. Graphic processor units (GPUs) are nowadays multithreaded, multicore processors with great computational power that are widely available and match the SIMD computing model. The Computed Unified Device Architecture (CUDA) developed by NVIDIA [17], based on a data parallel programming model, provides support for general-purpose computing on graphics hardware [18]. A number of GPU implementations of algorithms for vessel extraction can be found in the literature [10, 15, 24]. A GPU 3D algorithm for segmenting vasculature in images that are obtained by labeling laminae is presented in [15]. All the four-step segmentation process is developed in CUDA and the algorithm is applied to images from rat brain cortex. Torres and Taborda [24] presents an optic disk detection and segmentation approach based on an evolutionary strategy. In this case only two functions are computed in the GPU. To date, [10] is the only retinal vessel segmentation algorithm published that is fully developed in GPU. The execution times over a NVIDIA Geforce GTX680 reveal an speedup of 60 with regard to equivalent sequential implementation. The aim of this work was to provide an algorithm for reliable blood vessel segmentation suitable for real-time operation on GPU architectures. With this purpose a vessel tracing algorithm based on the technique reported in [3] has been tailored to be efficiently executed on GPUs. Our proposal is more accurate and less costly also in CPU than the original technique. The proposal has been tested on the DRIVE [22] and the STARE databases [9]. These databases are extensively used for the community researchers as they provide manual segmentation for performance evaluation. The comparison of the results with those from state-of-the-art methodologies confirms that our proposal outperforms these when accuracy and time performance are considered together. The rest of the paper is organized as follows: in Sect. 2, the algorithm for vessel segmentation is described. In Sect. 3, the GPU implementation is detailed. The results of the test on the images from the DRIVE and STARE databases are presented in Sect. 4 together with a comparison with other techniques. Finally, in Sect. 5 the main conclusions are presented.
J Real-Time Image Proc
element. This operation is also very useful for reducing noise close to bright structures such as the optic disk [8, 12]. The vessel pre-estimate image is obtained by an adaptive segmentation on the output of the previous step. To this end an estimation of the background is subtracted to the image: Ipre ¼ ITH IB þ t;
where IB is a background image obtained with a heavy diffusing of the input image by means of a mean filtering and t is a threshold that has been obtained heuristically and kept constant for all the processed images. The result is then binarized to produce the vessel pre-estimate image (Fig. 2b). Finally, the isolated groups of pixels are removed.
Fig. 1 Stages of the the retinal vessel tree extraction algorithm
2 Algorithm description
2.2 Contour tracing
The algorithm presented herein is a combination of image filtering and contour tracing methodologies. First, a fast pre-estimation of the retinal vessel tree is performed by linear filtering and morphological operations. The resulting data are then used to generate initial contours and guiding information for a set of active contours which fit the external boundaries of vessels. An outline of the proposed retinal vessel tree extraction algorithm is shown in Fig. 1. The algorithm consists mainly of five stages: a fast preestimation of the retinal vessel tree and four stages that are part of the contour tracing based on active contours and a final post-processing. These last four stages are: an initial contour estimation based on the results of the previous stage, an external potential estimation, an active contour evolution and, finally, a small component removal. Below, we describe the main parts of the algorithm. 2.1 Vessel pre-estimation This stage works on the green channel of the retinal image, which provides the highest contrast between vessels and background (Fig. 2a). In this monochrome image, the retinal vessels appear darker than the neighboring background. However, there is actually a gradual intensity variation in the image from the periphery of the retina toward the central macular region. This variation can affect the vessel segmentation process as the intensity of some background pixels may be comparable or even higher than the brighter vessel pixels in the image. Unlike with [3] a morphological top-hat transformation is carried out on an inverted version of the monochrome image with the aim of reducing the influence of the intensity variation: ITH ¼ I þ ðI c SEÞ;
ð2Þ
ð1Þ
where I c represents the complement of the input image, denotes the morphological opening and SE is the structuring
Tracing the external boundaries of the vessels with active contours offers a number of advantages: (1) the initial contours are easily generated, since most of the pixels in a retinal image are background (e.g., in the images of the DRIVE database, only around 12 % of pixels belong to the vessel); (2) the local curvature of the active contours is small (which simplifies the control of the contour smoothing); (3) there is no inferior limit for the width of the structures to be traced, as the active contours do not flow through the vessels; (4) the central reflex is not a relevant problem, since the regions inside the vessels are ignored. The pre-estimated vessels from the previous stage are used to generate the set of initial contours. Since the active contours are intended to fit the external boundaries of the vessels, the initial contours must be located completely outside of them. They are automatically obtained by a morphological erosion on the binary image from the vessels’ pre-estimation (Fig. 2c). Different from the proposal in [3], the initial contours are situated close to the vessel boundaries. This modification produces important improvements in both accuracy and computation speed: 1.
2.
3.
Much fewer iterations are needed to fit the active contours to the vessel boundaries, which increases the computation speed. There is no need to enable topological transformations between active contours to absorb residual active contours due to noise in the active contour path. Therefore, two different contour propagation stages are no longer required. Vessels that are close can be distinguished more accurately, since the initial contours situated between them are not removed by morphological erosion.
The active contours evolve from their initial locations (initial seeds) toward the boundaries of the vessels guided by three potential terms:
123
J Real-Time Image Proc
(a) Input image
(b) Vessel pre-estimate
(c) Initial contours
(d) Guiding information
Fig. 2 Intermediate steps for retinal vessel extraction
–
The external potential, which provides information related to the vessel locations. – The internal potential (extracted from the contours themselves), which attempts to keep the contour shape smooth. – The balloon or inflating potential. Since the contours have to move outward, the balloon potential reinforces this feature. The external potential integrates information of the retinal image under processing. To generate the external potential, a distance map to the edges in the pre-vessel estimation image is produced (Fig. 2d). In [3] the distance map is combined with the result from a Sobel filtering of the retinal image. We observe that a simpler scheme that uses only the pre-estimated image can produce better results extracting information about the retinal vessel tree. This scheme consists of two terms: extpot ¼ 100
ext1pot þ ext2pot 2
;
ð3Þ
The first one is computed as, ext1pot ¼ Ipre & ðIedge Þc ;
ð4Þ
where Iedge is the binary edge detection applied to the preestimated image and & denotes the logical AND operation. This term enhances the vessels in the pre-estimated image. The second term computes a distance map as ext2pot ¼
n X
FðiÞ =i;
ð5Þ
i¼1
with F ðiÞ the ith morphological erosion on F ð0Þ ¼ ðextpot 1 Þc . The number n in the sum operator limits the distance map to n pixels far from the estimated locations of the vessels. The active contours are initialized within this range of distance. The term ext2pot is then updated by applying a diffusion filter. The resulting matrix is finally normalized dividing by its maximum entry to obtain values between 0 and 1. Initially, the active contours are mainly controlled by the balloon potential which reinforces the contour evolution. As the active contours come closer to the vessel boundaries, the
123
Fig. 3 Vessel tree segmentation
external potential becomes dominant to fit the vessels. The internal potential keeps the contour shape smooth in accordance with the small local curvature of the vessels. The topological transformations have to be prevented to avoid merging of active contours coming from the opposite sides of ill-defined vessels (weak external potential). As in [3], we selected the active contour technique called pixel-level snakes (PLS) [25]. PLS consist of a set of active contours defined as the boundaries of regions which propagate pixel by pixel. This pixel-bypixel evolution makes it possible to predict and, consequently, prevent possible collisions between different contours, providing control on the contour topology. This active contour technique was designed for massively parallel processing, which makes it suitable for GPU computation. A comprehensive discussion of the functioning and capabilities of this active contour technique can be found in [25]. Finally, a post-processing step is performed to remove small groups of pixels and those pixels outside the FOV and, therefore, to provide the geometry of the retinal vessel tree (Fig. 3).
3 GPU version of the retinal blood vessels detection algorithm The algorithm described has been adapted to achieve efficient execution on GPUs with CUDA architecture in such a
J Real-Time Image Proc
way that the computation speed is increased 60 against the sequential CPU implementation. The accuracy of the vessel segmentation is not only maintained, but also slightly increased. A brief introduction on GPU programming is given below, followed by a detailed description of the GPU implementation of our algorithm. 3.1 CUDA GPU programming fundamentals GPUs can be considered data-parallel programming architectures consisting of several many-core processors capable of running thousands of threads concurrently. This is also true for the case of commodity GPUs, which are more affordable in terms of cost than multi-GPU computing platforms. NVIDIA’s CUDA architecture [17] consists of a huge number of cores grouped into a set of streaming multiprocessors (SMXs), with a very high memory bandwidth. The number of cores per SMX depends on the device architecture. CUDA is a hardware/software platform that enables NVIDIA GPUs to execute programs invoking parallel functions called kernels that execute across many parallel threads [17]. These threads are organized into blocks so that each thread executes an instance of the kernel following a Single Instruction Multiple Data (SIMD) programming model. The blocks are arranged in a grid that is mapped to a hierarchy of CUDA cores in the GPU. Regarding coordination between threads within a kernel, this is achieved through synchronization barriers. However, as thread blocks run independently from all others, their scope is limited to the threads within the thread block. The two main memory spaces are a global memory (DRAM memory usually known as device memory) and an on-chip shared memory space only available per thread block. It is an extremely rapid read/write access memory, but with the lifetime of the block. This last feature prevents the data sharing among thread blocks when data are stored in shared memory.
3.2 GPU-based retinal vessel extraction In this section, the GPU implementation of the phases involved in the retinal vessel tree extraction algorithm are described. The designed algorithm can be efficiently computed on a CUDA-capable GPU, as its structure in different sequential and independent steps allows their computation in parallel. This way, each step can be computed by independent blocks of pixels. Regarding the data neighborhood required for the computation of each pixel, three different situations have been identified, as shown in Fig. 4. First, we have steps where the computation per pixel does not require the values of the neighbors (Fig. 4a). There are also steps where a pixel requires the pixels in the neighborhood, so an apron of size one is required for each block of pixels (Fig. 4b) and, finally, steps where a larger neighborhood is required. In this last situation, the data loads are optimized using a sliding window (Fig. 4c) to reuse data, thus reducing the data loads, as it will be explained below. The performance optimization strategies applied to the code to optimize the computational performance are the following: 1.
2.
3.
4.
Organize the algorithm into computational blocks that can be executed independently, thus minimizing communications among them. This strategy is applied throughout the algorithm using 16 16-thread blocks. Improve the efficiency in the use of the memory hierarchy. It is essential to perform the maximum number of computations on data already stored in shared memory, thus minimizing the data transfer between global and shared memory. Group the computations to maximize the workload per thread and reduce the data transference among the levels of the memory hierarchy. Overlap data regions. Using overlapping of data regions we avoid leaving the threads in the border of a block inactive when there are data dependencies with
Fig. 4 Different cases regarding data loaded and computations performed by each thread block
1
2 ... 2
3 ...
= reads a data item and computes a result
(a) Block of pixels
= reads a data item
(b) Block of pixels with apron
(c) Sliding window
123
J Real-Time Image Proc
5.
6.
the neighbors. One example of this situation is shown in Fig. 4c. One thread block loads a data region double the size of the block. In this example, the dependencies in the computations are at a distance of 3 in the horizontal direction. In the upper part of the figure, we see the situation for one thread block (data regions 1 and 2). Data in white are updated by the block and data in gray are read, but not updated by the block. Just below, we see the situation for the same thread block later, when the block needs to load data regions 2 and 3. Data region 2 is reused as it has previously been loaded. As a consequence of this scheme, to compute all the pixels of one strip of the image, the data regions loaded by one block are later partially reused by the same block. Data regions are overlapped in a portion depending on the distances of the data dependencies. All the threads in the block perform computations avoiding the situation of inactive threads produced by the scheme shown in Fig. 4b. Optimize the data load when data regions are bigger than the thread block size. The process of loading data into the shared memory space must be general enough to be easily extended to any possible configuration of data region and overlap region width. Avoid global synchronizations which will imply the use of global memory. For example, the small component removal stage is an iterative process. Instead of performing each stage synchronously, the process is divided among synchronous intra-block updates (iterations performed by each thread block without communication with the others) followed by asynchronous inter-block updates (during which all data are copied to global memory).
As explained in Sect. 2 (see Fig. 1), the algorithm consists mainly of four stages: a fast pre-estimation of the retinal vessel tree, an initial contour estimation estimation based on the results of the previous stage, an external potential estimation and, finally, an active contour evolution. A pseudocode of the GPU implementation for the retinal vessel tree extraction algorithm is shown in Fig. 5, whereas the active contour evolution is detailed in Fig. 6. The pseudocodes show both the host and device codes, detailing the steps of the different stages. The kernels executed in GPU are placed between \ [ symbols. The pseudocode also includes the GM and SM acronyms to indicate kernels executed in global memory and shared memory, respectively. The data types considered in the implementation are float or char, depending on the steps. The thread block size considered is 16 16, where each thread processes one pixel of the image. The size of the data region required by each thread block depends on the neighborhood size for each pixel in the particular step of
123
Retinal vessel tree extraction algorithm Input: image I. Output: Vessel tree segmentation map Vmap .
1: 2: 3: 4: 5: 6: 7: 8:
Vessel pre-estimation. Input: I, Output: Ipre < Invert the image obtaining I c > GM < Perform an opening obtaining Iop = I c ◦ SE > SM c < Perform IT H = I − Iop > GM < Apply a mean diffusion filter obtaining IB > SM < Obtaining the normalized image Ipre = IT H − IB > GM < Binaryze Ipre by applying threshold t updating Ipre > GM < Remove the isolated groups of less than 4 pixels > SM < Remove the pixels outside the FOV > GM
9:
Initial Contour Estimation. Input: Ipre , Output: Icont < Apply an erosion obtaining Icont >
10: 11: 12: 13: 14: 15: 16: 17:
SM
External Potential Estimation. Input: Icont , Output: extpot < Perform edge detection producing Iedge > SM < Calculate ext1pot = Ipre &(Iedge )c > GM for each i (i = 1, . . . , n) do < Apply an erosion > SM < Apply an accumulation with a weight factor > GM end for < Calculate ext2pot by a diffusion filter and scale in [0:1] > SM < extpot = −100(ext1pot + ext2pot )/2 > GM Contour evolution stage. Input: contour image Icont , external potential extpot , Output: Vessel tree segmentation map Iev Small component removal. Input: Contour evolution result Iev . Output: Final image Imap . GM: Global Memory, SM: Shared Memory
Fig. 5 Pseudocode for the GPU retinal vessel tree extraction algorithm
the algorithm, as it will be detailed below. Only one initial copy of the input image I is required by the algorithm, which is performed by a block CPU to GPU transfer. Looking more in detail in Fig. 5: –
–
The first stage of the algorithm is the vessel preestimation (lines 1–8), which begins with the application of an inverted top-hat process (lines 1–3) to normalize the background of the image and reduce the negative false ratio close to the border of the optic disk. The process comprises an inversion in global memory followed by a morphological closing consisting of a dilation followed by an erosion, both with circular structuring elements of radius 8 and both computed in shared memory. For this computation, four 16 16 data blocks are loaded, as explained in optimization strategy 5. An update is then processed in global memory (line 3) and the result is then saved in global memory as ITH . The pre-estimation continues calculating the background image IB computing a diffusion (line 4) consisting in applying a mean filter with a kernel size of 17 17. This way, the operations are compacted and, therefore, the execution time is reduced (see the GPU optimization strategy 3 in Sect. 3.2). This filter is separable and can therefore be split into two kernels: one is computed on the rows and the other on the
J Real-Time Image Proc
Contour evolution stage Input: Contour image Icont , external potential extpot Output: Evolved image Iev .
1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12:
< Iev = Icont | Iinf > GM for each inflating weight i (i = 1, −1) do Calculate the internal potential from Icont SM/GM for each direction j (j = N orth, South, East, W est) do Calculate the inflating potential GM Calculate the collision point detector mask SM Calculate the guiding force SM Calculate the directional patch detector giving Iinf SM/GM end for < Perform an edge detection obtaining Iout > SM < Calculate Iev = Iout | Iinf > GM end for GM: Global Memory, SM: Shared Memory
–
Fig. 6 Pseudocode for the contour evolution stage based on PLS
–
–
columns of the image. The sliding window technique (GPU strategy 5) is used in both kernels. Finally, the normalized image is obtained in global memory (line 5). Then the vessels’ pre-estimation Ipre is obtained after binarizing ITH IB (lines 5–6) with a threshold t ¼ 4:32 in global memory. A final kernel removes the groups of less than four pixels (line 7). This last process is performed by exploring the neighborhood of each pixel up to distance 3 and removing it if the group is determined to be isolated. The following stage begins the contour tracing by performing an initial contour estimation (line 9). The operation performed is one morphological erosion with a 3 3 structuring element. A kernel executed in shared memory performs the erosion (line 9). During its execution the data blocks are overlapped (optimization strategy 4). As a result, the initial contours Icont are generated. The next stage consists in estimating the external potential required for the PLS evolution (lines 10–16). First, an edge detection using a Sobel detector (line 10), which must inspect an 8-neighborhood of each pixel, is calculated over Ipre . It is computed in shared memory and also requires the overlap of borders of size one among data regions explained in the optimization strategy 4, producing Iedge . After this, a logical AND of the vessels’ pre-estimation Ipre and the complement of the edge detection result Iedge are calculated (line 11), obtaining the first term of the external potential ext1pot . Then, the distance map is generated (lines 12–15) from ext1pot by applying n ¼ 5 consecutive stages of one erosion in shared memory and one weighted accumulation in global memory, as indicated in Eq. 5. Overlapped data regions are required. A
diffusion filter is then computed in shared memory and the results are normalized in [0,1] giving extpot2 . The size of both the structuring element for the erosions and the diffusion filter is 3 3. The contribution of the detected edges and the distance map is averaged in global memory (line 17). The following stage is a contour evolution based on active contours (PLS method). The contour evolution starts from the contour image Icont and is guided by three potentials: the external potential previously calculated, the internal potential and the balloon potential, as explained in Sect. 2. The PLS evolves according to the contour evolution stage described in Fig. 6. A different number of iterations of PLS are required depending on the image. Experimentally, we found that only one to three iterations are required to obtain the best results in terms of accuracy.
The steps associated with the contour evolution stage follow the PLS algorithm outlined in Sect. 2.2 and described in [6] considering the following parameter values: scpd ¼ 1, kext ¼ 1, kint ¼ 1 and kinf ¼ 1. The PLS algorithm consists of calculating the internal potential, performing four contour evolutions in the four directions (north, south, east and west) and, at the end, applying an edge detection and a pixel-wise OR. In particular (see Fig. 6): –
–
–
The same process (lines 3–7) is performed considering (line 2) the inflating potential weight with positive and negative signs that are associated with a positive or negative inflating processes, respectively. The internal potential calculation stage (line 3), which is not detailed in the pseudocode, consists firstly of an edge detection over the input image Icont . This process requires overlapping among data regions (GPU optimization strategy 4). The following step is an average filter computed by a kernel in global memory and, finally, a diffusion consisting of a filter with a 3 3kernel that is computed in shared memory also performing overlapping between data regions. As a result, intpot is calculated. The next step calculates the contour evolution in the four directions (loop in lines 4–9). This step first performs the calculation of the inflating potential in global memory (line 5). Then the collision point detector mask (line 6) is calculated in shared memory requiring overlapping among data regions (optimization strategy 4). It performs mainly 1-pixel shift operations in one direction each time, additions or subtractions among images and pixel-wise AND/OR operations. The calculation of the guiding force in shared memory with optimization strategy 4 (line 7) is
123
J Real-Time Image Proc
–
the next step. For the computation of the directional patch detector (line 8), a shift operation and a combination of information similar to the described in the line 6 of the pseudocode in Fig. 6 need to be computed. The results computed after the contour evolution are stored in Iinf . A new edge detection (line 10) is computed in shared memory, obtaining Iout . The stage finishes with a combination of information of the edge detection results Iout and the evolved image Iinf by means of a pixel-wise OR that produces the result of the evolution Iev (line 11).
The final stage of small component removal (See Fig. 5) is aimed at detecting and eliminating isolated structures in the image producing Imap . The technique applied consists in using square grids of 50-pixel sides. Two grids shifting 25 pixels one with respect to the other are applied to the image, and all structures that do not cross at least one line of each grid are removed. This is equivalent to considering that the isolated structures smaller than 50 pixels are assumed to be artifacts. For this purpose, the union-find algorithm is used. It is a recursive process that iterates until changes are not detected in two consecutive iterations. The method is synchronous since it is necessary to check after each iteration whether all connected elements have the correct labels. Nevertheless, the scheme has been modified in a similar way as in [23] to exploit the shared memory (see GPU optimization strategy 6) splitting the iteration process into intra-block and inter-block updates.
4 Experimental results The proposed algorithm has been compiled in a computer platform consisting of a four-core Intel Core i7 860 CPU with 8 GB of RAM, using a 64 bit Linux GNU compiler and CUDA 5.0 with the -O3 and -sm_20 flags.1 To validate the segmentation performance of our algorithm, we use the DRIVE and STARE databases. The STARE database [9] contains 20 images with size 605 700 pixels and 8 bits per RGB channel, captured by a TopCon TRV-50 fundus camera at 35 FOV. The DRIVE database [22] includes 40 retinal images captured by a Canon CR5 3CCD camera with a 45 FOV, with 565 584 pixels and 8 bits per color channel. The images are divided into two sets of 20 images (test and training sets). To facilitate the comparison with other methods, performance is evaluated in terms of standard measures such as sensitivity (related with the ability to detect vessel pixels), specificity (related with the ability to detect non-vessel pixels) and average accuracy. Table 1 1
Code available in http://wiki.citius.usc.es/software/gpu-retina.
123
Table 1 Performance on DRIVE and STARE databases where HO denotes a human observer Image
DRIVE
STARE
Sen
Esp
Acc
Sen
Esp
Acc
1
0.8056
0.9655
0.9445
0.6338
0.9636
0.9278
2
0.7927
0.9712
0.9444
0.5458
0.9641
0.9260
3 4
0.6265 0.7349
0.9818 0.9794
0.9301 0.9468
0.7909 0.2895
0.9551 0.9978
0.9417 0.9254
5
0.6603
0.9877
0.9433
0.7922
0.9601
0.9396
6
0.6403
0.9849
0.9363
0.8679
0.9589
0.9511
7
0.7006
0.9755
0.9391
0.9100
0.9446
0.9408
8
0.6307
0.9824
0.9382
0.8835
0.9545
0.9472
9
0.6383
0.9877
0.9467
0.8626
0.9578
0.9475
10
0.6952
0.9812
0.9470
0.7685
0.9684
0.9465
11
0.7506
0.9622
0.9348
0.8957
0.9492
0.9440
12
0.7264
0.9734
0.9425
0.9034
0.9589
0.9531
13
0.6842
0.9799
0.9380
0.8075
0.9681
0.9485
14
0.7726
0.9687
0.9456
0.7909
0.9702
0.9479
15
0.7937
0.9590
0.9418
0.7828
0.9681
0.9464
16
0.7515
0.9752
0.9459
0.5345
0.9889
0.9253
17
0.6938
0.9793
0.9441
0.8294
0.9752
0.9573
18 19
0.7614 0.8203
0.9686 0.9765
0.9448 0.9577
0.6060 0.6310
0.9945 0.9910
0.9677 0.9698
20
0.7391
0.9757
0.9505
0.4835
0.9876
0.9419
Aver.
0.7209
0.9758
0.9431
0.7305
0.9688
0.9448
HO
0.7763
0.9723
0.9470
0.8951
0.9384
0.9348
shows the results for the vessel segmentation of the test images in the DRIVE and STARE databases, considering only those pixels inside the FOV. The performance comparison with other published methods (see a comprehensive review of these in [7]) and with the measurements of a second observer provided in DRIVE and STARE databases shows that our proposal is competitive in terms of accuracy. Table 2 shows the computation time needed to process an image from the DRIVE and STARE database images with our algorithm implemented on a GPU NVIDIA GTX680. Data for each stage (vessel pre-estimation, initial contour estimation, external potential estimation, contour evolution and small component removal) as well as the overall runtime for computation, and the overall time including I/O accesses are detailed. This last overall time includes not only the time for computation and for reads/writes from/to the hard disk, but also the time needed for CPU–GPU transference. Our technique meets real-time requirements and exhibits a much faster computation than the state-of-the-art PC-based algorithms for retinal vessel segmentation. The method proposed by Alonso et al. [3] (which inspired our algorithm) computes an image of the DRIVE
J Real-Time Image Proc Table 2 GPU execution times DRIVE (s)
STARE (s)
4,288 2,848 (s)
Vessel pre-est.
0.0042
0.0052
0.1975
Initial contour est.
0.0001
0.0001
0.0187
External pot. est.
0.0014
0.0018
0.0725
Contour evolution
0.0042
0.0061
0.3084
Component remov.
0.0007
0.0007
0.0510
Overall
0.0101
0.0139
0.6381
Overall inc. I/O
0.0137
0.0178
0.7531
retinal images (4,288 2,848 pixels) in 5 min using affordable and commodity GPUs. This makes our proposal suitable for integration into in computer-aided diagnostic systems aimed at processing high volumes of retinal images. Acknowledgments This work was supported in part by the Ministry of Science and Innovation, Government of Spain, and co-funded by European Union ERDF, under contract TIN 2010-17541.
References database in 0.23 s with a mean accuracy of 0.9180. The same algorithm requires 13.7 s when computed on an Intel Core i7 [16]. As it can be observed in Tables 1 and 2, our method is 16 faster, but obtains considerably higher accuracy values. In addition, it must be considered that the results in [3] correspond to a specialized vision system for which the method was designed, whereas in our case the proposed method was executed on a commodity GPU. It is important to note that significant efforts have been made here to reduce the time of data movement to the GPU memory, as this accounts for an important part of the execution time. The reduction comes from reading each image in one block and taking profit from the disk cache used by Linux. Thereby, for the 4,288 2,848 image the execution time is 0.75 s, which is divided among a computation time of 0.64 s (see Table 2) and a communication time of 0.11 s. To the best of our knowledge, the only other vessel extraction method executed on GPU is [10], this being a very different technique that achieves similar average accuracy values (Acc = 0.9468 on the DRIVE database). The overall execution time reported by [10] for a 4,288 2,848 image is 1.2 s (60 % higher than our time), with the authors indicating that data movement is a large part of the execution time, but without specifying which part.
5 Conclusions In this paper, we propose a fast and accurate technique for the segmentation of the vascular network in retinal images. The algorithm is inspired by that proposed in [3], which has been completely redefined and adapted to be executed in CUDA-based graphics processing units. The results reported show that the values for accuracy, sensitivity and specificity achieved with our algorithm are comparable to recently published results, and close to the performance of human observers. On the other hand, our proposal outperforms most of techniques when accuracy and time performance are considered together. In this sense, it is possible to segment the vessel tree of 400 high-resolution
1. Al-Diri, B., Hunter, A., Steel, D.: An active contour model for segmenting and measuring retinal vessels. IEEE Trans. Med. Imaging 28(9), 1488–1497 (2009) 2. Al-Rawi, M., Qutaishat, M., Arrar, M.: An improved matched filter for blood vessel detection of digital retinal images. Comput. Biol. Med. 37, 262–267 (2007) 3. Alonso, C., Vilarino, D.L., Dudek, P., Penedo, M.G.: Fast retinal vessel tree extraction: a pixel-parallel approach. Int. J. Circuit Theory Appl. 36, 641–651 (2008) 4. Chanwimaluang, T., Fan, G.: An efficient algorithm for extraction of anatomical structures in retinal images. Int. Conf. Image Process. 1, 1093–1096 (2003) 5. Chaudhuri, S., Chaterjee, S., Katz, N., Nelson, M., Goldbaum, M.: Detection of blood vessels in retinal images using two dimensional matched filters. IEEE Trans. Med. Imaging 8(3), 263–269 (1989) 6. Dudek, P., Vilarino, D.L.: A cellular active contours algorithm based on region evolution. In: Proceedings of IEEE International Workshop on Cellular Neural Networks and their Applications, CNNA2006, pp. 269–274 (2006) 7. Fraz, M.M., Remagnino, P., Hoppe, A., Uyyanonvara, B., Rudnicka, A.R., wen, C.G., Barman, S.A.: Blood vessel segmentation methodologies in retinal images: a survey. In: Proceedings of Computer Methods and Programs in Biomedicine, vol. 108, pp. 407–433 (2012) 8. Fraz, M.M., Barman, S.A., Remagnino, P., Hoppe, A., Basit, A., Uyyanonvara, B., Rudnicka, A.R., Owen, C.G.: An approach to localize the retinal blood vessels using bit planes and centerline detection. Comput. Methods Progr. Biomed. 108, 600–616 (2012) 9. Hoover, A.D., Kouznetsova, V., Goldbaum, M.: Locating blood vessels in retinal images by piecewise threshold probing of a matched filter response. IEEE Trans. Med. Imaging 19, 203–210 (2000) 10. Krause, M., Alles, R.M., Burgeth, B., Weickert, J.: Fast retinal vessel analysis. J. Real-Time Image Process. (2013) 11. Liu, I., Sun, Y.: Recursive tracking on vascular networks in angiograms based on the detection-deletion scheme. IEEE Trans. Med. Imaging 12(2), 334–341 (1993) 12. Marin, D., Aquino, A., Gegundez-Arias, M.E., Bravo, J.M.: A new supervised method for bood vessel segmentation in retinal images by using gray-level and moment invariants-based features. IEEE Trans. Med. Imaging 30, 146–158 (2011) 13. Martinez-Perez, M., Hughes, A., Stanton, A., Thorn, S., Bharath, A., Parker, K.: Scale-space analysis for the characterisation of retinal blood vessels. In: Proceedings of Medical Image Understanding, Analysis, pp. 57–60 (1999) 14. Mendoza, A.M., Campilho, A.: Segmentation of retinal blood vessels by combining the detection of centerlines and morphological reconstruction. IEEE Trans. Med. Imaging 25(9), 1200–1213 (2006)
123
J Real-Time Image Proc 15. Narayanaswamy, A., Dwarakapuram, S., Bjornsson, C.S., Cutler, B.M., Shain, W., Roysam, B.: Robust adaptive 3D segmentation of vessel laminae from fluorescence confocal microscope images and parallel GPU implementation. IEEE Trans. Med. Imaging 29(3), 583–597 (2010) 16. Nieto, A., Brea, V., Vilarino, D.L., Osorio, R.R.: Performance analysis of massively parallel embedded hardware architectures for retinal image processing. EURASIP J. Image Video Process. 2011(10), 1–17 (2011) 17. NVIDIA Corporation: NVIDIA CUDA C Programming Guide 4.0, Santa Clara (2011) 18. Owens, J.D., Luebke, D., Govindaraju, N., Harris, M., Krger, J., Lefohn, A.E., Purcell, T.J.: A survey of general-purpose computation on graphics hardware. Comput. Gr. Forum 26(1), 80–113 (2007) 19. Palomera-Perez, M.A., Martinez-Perez, M.E., Benitez-Perez, H., Ortega-Arjona, J.L.: Parallel multiscale feature extraction and region growing: application on retinal blood vessel detection. IEEE Trans. Inf. Technol. Biomed. 14, 500–506 (2007) 20. Ricci, E., Perfetti, R.: Retinal blood vessel segmentation using line operators and support vector classification. IEEE Trans. Med. Imaging 26, 1357–1365 (2007) 21. Soares, J.V.B., Leandro, J.J.G., Cesar, R.M., Jelinek, H.F., Cree, M.J.: Retinal vessel segmentation using the 2-D Gabor wavelet and supervised classification. IEEE Trans. Med. Imaging 25, 1214–1222 (2006) 22. Staal, J., Abramoff, M.D., Niemeijer, M., Vierger, M.A., Ginneken, B.: Ridge-based vessel segmentation in color images of the retina. IEEE Trans. Med. Imaging 23(4), 501–509 (2004) 23. Stava, O., Benes, B.: Connected component labeling in CUDA. In: Hwu, W.W. (ed.) GPU Computing Gems, pp. 569–581 (2011) 24. Torres, G.S., Taborda, J.A.: Optic disk detection and segmentation of retinal images using an evolution strategy on GPU. In: Proceedings of XVIII Symposium of Image, Signal Processing, and Artificial Vision (STSIVA), IEEE, pp. 1–5 (2013) 25. Vilarino, D.L., Rekezcky, C.: Pixel-level snakes on CNNUM: algorithm design on-chip implementation and applications. Int. J. Circuit Theory Appl. 33, 17–51 (2005) 26. You, X., Peng, Q., Yuan, Y., Cheung, Y., Lei, J.: Segmentation of retinal blood vessels using the radial projection and semi-supervised approach. Pattern Recognit. 4(10–11), 2314–2324 (2011) 27. Zana, F., Klein, J.: Segmentation of vessel-like patterns using mathematical morphology and curvature evaluation. IEEE Trans. Image Process. 10(7), 1010–1019 (2001) 28. Zhang, B., Zhang, L., Karray, F.: Retinal vessel extraction by matched filter with first order derivative of Gaussian. Comput. Methods Progr. Biomed. 40(4), 438–445 (2010)
123
Francisco Argu¨ello received the B.Sc. and Ph.D. degrees in Physics from the University of Santiago de Compostela, Spain, in 1988 and 1992, respectively. He is currently an Associate Professor in the Department of Electronics and Computer Science at the University of Santiago de Compostela. His current research interests include signal and image processing, computer graphics, parallel and distributed computing, and quantum computing. David L. Vilarin˜o received the Ph.D. degree from in 2001 from the University of Santiago de Compostela, Spain, where he is currently associate professor in Electronics. In 1997 and 1998 he was a visiting scholar of the Department of Electronic Engineering of the University La Sapienza of Rome and the Department of Electrical Engineering and Computer Science of the University of California at Berkeley. He also visited for six months the Computer and Automation Research Institute of the Hungarian Academy of Sciences at Budapest as research fellow in 2003. He is the author or co-author of around 80 publications on the design and hardware implementation of algorithms for computer vision and medical image processing. His current research is on the design and implementation of image processing algorithms for embedded processor and FPGA-based architectures with fast and efficient computation as main concerns. Dr. Vilarin˜o is co-recipient of Best Paper Awards at the 2003 European Conference on Circuit Theory and Design and the 2008 IEEE International Workshop on Cellular Neural Networks and Applications. Dora B. Heras received a M.S. degree in Physics in 1994 and a Ph.D. in 2000 from the University of Santiago de Compostela (Spain). She is currently an Associate Professor in the Department of Electronics and Computer Engineering at the same University. Her research interests include parallel and distributed computing, performance prediction and improvement for parallel programming, computer graphics for high performance computing and image processing. Her last works on image processing are on the medical and remote sensing fields. Alejandro Nieto obtained the PhD. degree in the University of Santiago de Compostela (USC). Previously, he obtained the M.S. degree in Information Technology and the B.S. degree in Physics Science specializing in Electronics, both from USC. His current research is focused on the development of new reconfigurable architectures optimized for image and video processing with emphasis on performance. Other of my research interests include multi-core/ many-core and custom parallel architectures, reconfigurable computing, parallel programming, embedded systems, FPGAs and digital ASIC design, besides Computer vision applications.