MORPHOLOGICAL CLASSIFICATION OF GALAXIES USING COMPUTER VISION AND ARTIFICIAL NEURAL NETWORKS: A COMPUTATIONAL SCHEME
Letter to the Editor SHAUKAT N. GODERYA and SHAWN M. LOLLING Department of Physics, Illinois State University, Normal, IL 61790-4560, U.S.A. E-mail:
[email protected]
(Received 14 August 2000; accepted 21 July 2001)
Abstract. The morphology of galaxies is an important issue in the large scale study of the Universe. The Hubble Deep Field project has already shown that the Universe contains billions and billions of galaxies. The Sloan Digital Sky Survey is expected to map the sky for one million galaxies. One of the major challenges facing astronomers today is how to automatically identify and classify large number of galaxies that will began to show up in the hundreds of thousands of digitized images from sky surveys. Today it is possible to address this problem with the help of advances occurring in computer vision and artificial neural networks technology. This paper describes a computational scheme to develop an automatic galaxy classifier. From the scheme it is possible to visualize several different types of automatic galaxy classifiers. Two types are presented here with prototype models. The first type uses the geometric shape features as the basis for classification. The second uses the direct pixel images of galaxies and artificial neural networks to do the classification. The results show that geometric shape features are very good indicators of different types of nearby galaxies. Three test cases were presented to the prototype geometric shape classifier and it was able to successfully classify all three of them. The direct image based neural network classifier was able to learn 97% of the 171 training patterns presented to it. However when the network was presented a test set of 37 independent patterns, it was only able to classify 57% percent of the test cases. This study demonstrates that a very robust and efficient automated galaxy classifier based on shape features and artificial neural network can be develop.
1. Introduction Classification of galaxies play an important role in the large scale study of the Universe. Astronomers have been classifying the galaxies by using visual analysis of the photographs and CCD images. The method and the techniques used for such a process introduces considerable uncertainties in the classification not to mention mis classification of the galaxies altogether. Naim et al. (1995) found a variation of 2 subclasses in the revised Hubble system from the RMS difference observed in the classification by different human experts. Further the human vision system is not apt for repetitive identification task often required in galaxy classification work. Especially when a single large CCD image of the sky may contain hundreds or even thousand of galaxies in them. The problem becomes even more severe if the human expert has to deal with hundreds of such photographs or images. Astrophysics and Space Science 279: 377–387, 2002. © 2002 Kluwer Academic Publishers. Printed in the Netherlands.
378
S.N. GODERYA AND S.M. LOLLING
The advent of space based astronomy particularly the Hubble Space Telescope (Parker, 1996) has shown beyond doubt that our Universe potentially contains billion and billions of galaxies. Currently the Hubble Deep Field (HDF) project and the Sloan Digital Sky survey (SDSS) have already began scanning the sky to obtain digitized images of galaxies. The SDSS project alone will obtain images of about 1 million galaxies. Thus in the very near future astronomers will have digitized data for thousands of million of galaxies from a variety of projects. Consider the scenario of having to identify and classify large number of galaxies with human experts who can introduce uncertainties of the order of two subclasses. The task is not only inefficient, but it will also be very slow, inaccurate and almost impossible. To compile a comprehensive and homogeneous database of billions of galaxies, it will be necessary to devise and develop highly robust, fast, efficient and automated classifiers based on computer vision and artificial intelligence tools. How can computer vision and artificial intelligence help the human experts? Computer vision deals with detection and extraction of mathematical properties of an object in a digital image. Artificial intelligence deals with using the mathematical properties of an object to draw scientific inferences. Consider the simple case of looking at a drawing of circle and an ellipse. How can one differentiate the two objects? The most obvious way is to state that a circle has a constant radius and an ellipse one major and one minor radius. By deriving the radii (a mathematical property) of both the circle and an ellipse one can differentiate the two objects very clearly. Now think about the process that has just been performed to do this differentiation. The human vision system detected the two objects and sent the information to the brain for analysis. The brain sent back instruction to the vision system to extract radii information of the two objects in order for it to be able the recognize them. After the vision system sent back information on the radii of the two objects, the brain immediately reasoned by way of experience, training or self learning that since a circle has a constant radius as compared to the ellipse, it can use this mathematical property to recognize and classify the two objects. Is there a way to replicate this process by using computers and artificial intelligence? Yes there is. The part played by human vision system can be modeled by a computer vision system. The computer vision system has the job of scanning, detecting and extracting the mathematical properties or features. The part played by the brain can be simulated by artificial intelligence for example by using artificial neural networks. Thus automated classification of galaxies is possible if mathematical feature of the galaxies are extracted and used with artificial neural network classifier. Mathematical parameters of galaxies have been used from the very beginning for visual classification however, their use in computer classification was taken up by Sebok (1980). There are three broad groups of features that can be used in galaxy classification work. The three groups are photometry features, profile features, and shape features. Doi et al. (1993) have used photometric features of concentration index and mean surface brightness derived from an isophote of
AUTOMATIC GALAXY CLASSIFICATION
379
fixed brightness level in one color band. With their techniques they achieve a success ratio of greater than 85 percent in classifying galaxies in early and late types. Abraham et al. (1996) use a mixture of photometry and profile parameters to perform classification of galaxies in the HDF image. The photometry feature is the central concentration of light index. The profile feature is the radial distribution of surface brightness or asymmetry of the galaxies. Abraham et al. show that these two parameters can be used to classify elliptical and spiral galaxies. Spiekermann (1992) use shape features to describe automated morphology of faint galaxies. In their study they derive parameters that represent symmetry, intensity profiles and shape of the isophote to develop the classifier with fuzzy algebra technique. They apply the classifier to 100,000 galaxies in the 16 ESOSERC fields near the South Galactic Pole and find a morphological mixture of E:SO:S/Ir = 14:21:65%. Spiekermann study demonstrates that shape features are useful parameters for computer classification of galaxies. Artificial neural networks are computational systems inspired by the brain and the nervous system of the human body. In the human body the neuron accepts electrical inputs and generates appropriate responses to them. The brain perceives these responses by way of genetic programming to learn and self organize the information. No one understands completely how the brain works however, it is now possible to simulate some of what the brains does by constructing an artificial neural network using a high level language such as C or FORTRAN. In fact artificial neural networks can be given the ability to behave, react, self-organize, learn, generalize and forget. These virtues are quiet different from conventional algorithms that execute a sequence of instructions contained in a program. The theory of neural networks is extensively discussed in literature see for example Fausett (1994) and Masters (1993). There are two types of artificial neural networks, supervised and unsupervised. In a supervised artificial neural network the network is first trained to recognize and identify known samples. After training the network it is tested on unknown samples. In an unsupervised artificial neural networks, the network is allowed to self-organize, learn and group sample into classes. These type of networks have no prior information on the classes of the samples. Artificial neural networks are capable of performing complex pattern recognition and classification tasks. Scientific applications of this technique have been investigated by Clark (1999) and Serra-Ricart (1993). In astronomy artificial neural networks can be used for galaxy classification task. The first classifiers were based on feeding direct pixel images to supervised artificial neural networks. It became clear then that more objective characterization of galaxies can be obtained by feeding the networks with galaxy parameters (Storrie-Lombardi et al., 1992; Naim et al., 1996; Lahav et al., 1992). Nielsen and Odewahn (1995) describe three different types of supervised artificial neural network galaxy classifiers. In the first type they use the photometric indices’s of surface brightness, concentration index and color as input to the classifier. In the second type they use the surface brightness
380
S.N. GODERYA AND S.M. LOLLING
profiles in two band passes as input to the classifier. In the third type they use the raw galaxy image in one bandpass. From their study they find that the raw galaxy image classifier is only good for distinguishing elliptical from spirals and that the profile classifier produce more accurate classification than the photometric parameter classifier. Mahonen (1995) explore the possibility of using unsupervised neural network to distinguish stars and galaxies in digitized images of sky surveys. Although galaxy parameters are more useful to present to an artificial neural networks, it remains an open question as to what parameters will retain a large fraction of the full information of the image. This paper first describes a computational scheme to develop an automated galaxy classifier. Five different types of classifiers can be visualized from this scheme. An initial investigation on two of these are presented in this paper while others will be the subject of future publications. The first type of classifier is based on identifying the galaxies using the geometric shape features, whereas in the second type of classifier the galaxies are identified by using direct pixel images of galaxies and a supervised neural network.
2. Computational Scheme The computational scheme is shown as a flow chart in Figure 1 on page 381. The flow chart is made up of several different modules. Each of the modules are either a data structure or a stage that represents data processing. The first module ‘Galaxy Images’ is a data structure that indicates a database of galaxy images. Examples of databases are the Palomar Observatory digitized sky survey, the HDF project, the SDSS and many others. The second module ‘Isolate Galaxies’ is a data processing module. The galaxy images form the database are passed to this module where an image processing algorithm isolates and labels the individual galaxies from an image that may contain background stars and other objects. The output of the second module results in the third module that is again a database of individual labeled galaxies. The fourth module ‘Feature Extractor’ is an algorithm. This algorithm computes the various types of features of single, individually, labeled, and isolated galaxies. In Figure 1 the feature extractor module is shown with different kinds of features that can be extracted from images of galaxies. There are three broad categories. The shape feature extractor simply computes the geometric shape features of galaxy images. A variety of shape features are used in the identification and pattern recognition of objects. Several of these were used in the present work and are discussed in a later section. A crucial aspect in this approach is to acquire shape features that are not only invariant to scale and orientation of the galaxy images but also to the hardware related issues of digitized images. The second category is the photometry indices’s of galaxy images. Examples are color (B-R), concentration index, surface brightness etc. It is relatively easy to acquire photometric features of galaxies from
AUTOMATIC GALAXY CLASSIFICATION
381
Figure 1. Flow chart for the computational scheme.
listed catalogs and other sources. The third category is the profile features. Example is the radial distribution of surface brightness of galaxies. The fifth module ‘Classifier’ is also a data processing unit. There are various routes to building the classifier. Several possibilities are shown in the fifth module. The direct image based classifier uses an artificial neural network. The artificial neural network is either a supervised network where the network is trained on a sample of previously classified galaxies from other sources or it is unsupervised where the network learns from the given data to do the classification. The images for both the supervised and unsupervised direct image based classifier come from the third module and are normally processed for use as inputs to the artificial neural networks. The feature based classifiers can be implemented either using the conventional programming techniques or the artificial neural networks.
382
S.N. GODERYA AND S.M. LOLLING
3. Shape Feature Classifier: A Prototype Model Shape features have been used in many applications for pattern recognition and identification purposes (Parker, 1994). A galaxy classifier based on the mathematical properties of galaxy shapes is referred to as a shape feature classifier. A domain of digitized images of four elliptical E0, E3, E5, E7, three simple spirals Sa, Sb, Sc and three barred-spiral SBa, SBb, SBc galaxies were judiciously selected. The images were then digitally processed to extract shape properties. The shape properties are elongation (e), Form Factor (F ), Convexity (Cx ), Bounding rectangle to fill factor (Bx ) and the area (A), which is the pixel count of the galaxy. The elongation (e) is a measure of flatness of the object, it is a ratio that involves the semi major axis (a) and the semi minor axis (b). It is defined as follows; e=
a−b a+b
(1)
The axis are determine from the best fit ellipse. Elongation is particularly useful in discriminating between the different classes of elliptical galaxies where each class is found to have a narrow range. Elongation is also useful for spirals as simple spirals will have small values and barred-spiral will have high values. Form Factor is a ratio of area and square of the perimeter of the galaxy; F =
A P2
(2)
Elliptical galaxies are found to have somewhat higher values as their luminosity is more symmetrically distributed. On the other hand barred spirals show smaller values as their perimeter per unit area are considerably larger. Simple spirals show intermediate values, with the trend decreasing from Sa to Sc. Convexity is an interesting shape property for galaxies. It is defined as Cx =
P (2H + 2W )
(3)
where, H and W are the height and width of the minimum bounding rectangle around the galaxy. For jagged regions like spiral galaxies convexity is very large, whereas for elliptical galaxies it is very small. Another interesting thing about convexity is that for elliptical galaxies it decreases from E0 to E7, while for simple spirals it shows an increasing trend from Sa to Sc and a weak decreasing trend from SBa to SBc. Hence convexity is a good shape parameter to classify elliptical and spiral galaxies. Bounding rectangle to fill factor is the ratio of the area of the galaxy to the area of the bounding rectangle; Bx =
A HW
(4)
383
AUTOMATIC GALAXY CLASSIFICATION
TABLE I Geometric Shape Features Features e C Cx Bx
E0
E3
E5
E7
Sa
Sb
Sc
SBa
SBb
SBc
0.07 0.25 1.71 0.73
0.25 0.27 1.62 0.71
0.54 0.23 1.66 0.65
0.82 0.26 1.20 0.38
0.13 0.24 1.77 0.74
0.08 0.14 2.18 0.67
0.14 0.07 2.68 0.48
0.63 0.10 2.42 0.65
0.81 0.11 1.83 0.44
0.82 0.09 2.16 0.45
This ratio shows how many pixels in the bounding rectangle belong to the galaxy in reference to the total number of pixels in the bounding rectangle. For elliptical galaxies no real significance is observed except the fact that there is a decreasing trend from E0 to E7. The simple and barred spirals show strong decreasing trends from Sa to Sc and SBa to SBc. Table I on page 383 summarizes these features for the 10 galaxy images and is the basis for ten classes (E0, E3, E5, E7, Sa, Sb, Sc, SBa, SBb, and SBc). In general there will be uncertainties involved with these feature values for large number of galaxies. However the effect of these uncertainties can be minimized when several of these features are used together in defining the identification criteria. For example, the classes E0, E3, E5 and E7 are constructed by using three criteria. The first criteria uses the convexity feature value to identify the galaxy as either elliptical or spiral. The second criteria compactness serves as a verification test. The third criteria elongation is used to further sub classify the elliptical into four categories (E0, E3, E5, E7). In a similar way elongation can be used to separate a simple spiral from a barred spiral and then the bounding rectangle to fill feature value is used to sub classify the simple spiral and barred spirals. After the classifier was built three unknown images were prepared to test the performance of the system. Figure 2 on page 384 shows the output of the classifier in one such test. The unknown galaxy is at the top left hand corner, the classifier correctly identifies this galaxy as type Sc shown as shaded regions. The other two unknown images were to test for elliptical and bar spiral galaxy. The classifier performed well with the three test that were used. Two limitations are found in this approach. One is during the thresholding of the image in the processing stage. Thresholding converts a gray scale image into a binary image for feature extraction. It is found that an improper value of threshold can make a spiral galaxy look like an elliptical. This can be seen in Figure 2. The second and third galaxy in the middle have been so severely clipped during the conversion process that all the arms features are lost in the image. Thus the classification criteria for Sa and Sb are inadequate in this case. The classifier would confuse the E0 galaxy with Sa or Sb type galaxy. The second limitations is that these shape features do not reveal much information for edge-on galaxies and hence
384
S.N. GODERYA AND S.M. LOLLING
Figure 2. A test case for the shape feature classifier. The shaded galaxy in the top left hand corner is the unknown galaxy. The shaded galaxy in the middle and top portion shows the classification found by the classifier.
are not very effective for their classification. Investigations are under way to find effective solution to these problems.
AUTOMATIC GALAXY CLASSIFICATION
385
4. A Supervised Direct Image Based Neural Network Classifier: A Case Study In a supervised neural network classifier the system is first trained to recognize and identify known galaxies. After training the network it is tested on unknown galaxies for their classification. Since this is an exploratory investigation the network was trained to distinguish only three types of galaxies, an elliptical a simple spiral and a barred spiral. Two hundred and fifteen galaxies were selected from the Digitized Sky Survey available at Space Telescope Science Institute web page for the training and validation files. One hundred and seventy six images were processed and reduce into an array of 30 × 30 pixels. The galaxy arrays were saved as ASCII data in the training file together with their NGC classification represented as a vector of 3 elements. The three elements of the vector are either 0 or 1. For example the vector 100 would mean an elliptical galaxy. In the same way a validation file with 37 patterns was also made. The galaxies in the validation file were completely different from the ones in the training file. Stuttgart Neural Network Simulator (Zell et al., 1990; Zell et al., 1991) was used to implement the supervised neural network classifier. The network with 900 neurons in the input layer, 2 neurons in the hidden layer and 3 neurons in the output layer was constructed. The network was fully connected and the learning algorithm chosen for the network was standard backpropagation algorithm (Rumelhart, 1986). The network was then trained with the training file via the backpropagation algorithm. During the training the root mean square error of the training and validation file was monitored. It took the network 1000 training epoch to show convergence of RMS errors between the training file and the validation file. The point at which the errors converge is the point where the network is said to be fully trained on the given training file. In general the initial training process requires trial and error with the number of neurons in the hidden layer and also the learning parameters. When the network is fully trained the weights are frozen in the network and can no longer be adjusted. The network is then presented with a file with patterns of unknown galaxies for their classification. Only the results of the best network configuration and the statistics of the first training session are presented here. After the training of the network on 176 patterns, it is found that the network correctly learns 171 patterns, incorrectly learns 1 pattern and is unable to learn 4 patterns. This means that the learning probability is 97.16%. When the validation file is presented after the training process, it is found that out of the 37 patterns in the validation file the network can only classify 19 patterns correctly while it incorrectly classified 13 patterns and was not able to recognize 5 patterns. Thus the network correctly identified 51.35% of the 37 patterns. The success ratio although not very good in the first training session, none the less shows the capability of the the artificial neural network. The major issue that
386
S.N. GODERYA AND S.M. LOLLING
relates to the low success ratio, is the large number of input neurons that leads to over-fitting. With 900 neurons in the input layer and 2 neurons in the hidden layer, there are more than 1800 free parameter to work with. This causes the network to learn unique characteristics rather than the universal characteristics of the training set. Two approaches are used to avoid over-fitting. One is to reduce the number of neurons in the hidden layer. For this reason, no networks with more than two neurons in the hidden layer were tested. The second approach is to flood the network with so many examples that it cannot possibly learn the idiosyncrasies of the samples in the training set. Unfortunately the small number of patterns prevented us from constructing large training and validation files and also the training test file, thus there is some over-fitting present. This problem will be solved in future work. The orientation of the galaxies can also contributes to the low success ratio. It is quiet possible that a network will consider the same galaxy in different orientation as different objects. However, in the present work this is not a problem as none of galaxies were altered in orientation during image processing stages with respect to the original image. One way to take into account the alignment of images is to rotate each images at several different angles and add them to the training file. This procedure may improve the learning capability of the network. The training process is also greatly affected by the variation of the background scene from pattern to pattern. It is expected that by increasing the number of patterns in the training file as well as the validation file, it will be possible to greatly improve not only the success rate, but also minimize the affect of over-fitting, orientations and the changing background scene. This procedure is currently under investigation.
5. Conclusions Computer vision and artificial neural networks techniques can be used for automated classification of galaxies. Classifier based on shape features show promising results. The main work to be done is to acquire the galaxy images and extract the shape features. The shape features will have the advantage of being scale and rotation invariant. This work is being done presently and the results will be presented in future publications. Classifier based on direct image supervised neural network also show promising results. Construction of additional independent training files are under way to improve the performance of the classifier. We suggest that a very robust and efficient classifier can be constructed by using shape features as inputs to the artificial neural networks.
AUTOMATIC GALAXY CLASSIFICATION
387
Acknowledgements This research has made use of the Digitized Sky Surveys that were produced at the Space Telescope Science Institute under U.S. Government grant NAG W-2166. The authors thank all the organization that are involved in the production of the Digitized Sky Survey. These organization are listed at the Space Telescope Science Institute web site. Thanks are due to Astronomical Data Center (ADC) at NASA Goddard Space Flight Center for providing access to the NGC catalog. Thanks are also due to the people who have developed the Stuttgart Neural Network Simulator at the University of Stuttgart. The authors also thank the Department of Physics at Illinois State University for providing the computational and manuscript preparation facility for this research. Finally the authors thank the referee for suggesting comments that improved the quality of this manuscript.
References Abraham, R.G., Tanvir, N.R., Santiago, B.X., Ellis, R.S., Glazebrook, K. and Van den Berg, S.: 1996, Mon. Not. R. Astron. Soc. 279, L49. Clark, J.W.: 1999, in: J.W. Clark, T. Lindenaw and M.L. Ristig (eds.), Scientific Applications of Neural Networks, Springer Lecture Notes in Physics, Vol. 522, Springer-Verlag, Berlin. Doi, M., Fukugita, M. and Okamura, S.: 1993, Mon. Not. R. Astron. Soc. 264, 832. Fausett, L.: 1994, Fundamentals of Neural Networks: Architectures, Algorithms and Applictions, Prentice Hall. Lahav, O., Niam, A., Sodré, L., et al.: 1996, Mon. Not. R. Astron. Soc. 283, 207. Masters, T.: 1993, Practical Neural Network Recipes in C++, Academic Press. Mahonen, P.H. and Hakala, P.J.: 1995, Automated Source Classification using a Kohonen Network, Astrophys. Lett. 452, L77–L80. Nielsen, M.L. and Odewahn, S.C.: 1995, Bull. Am. Astron. Soc. 26, no. 4, AAS Meeting held in Tucson, AZ, Jan. 8–12. Naim, A., Lahav, O., Buta, R.J., et al.: 1995, Mon. Not. R. Astron. Soc. 274, 1107. Naim, A., et al.: 1996, Mon. Not. R. Astron. Soc. 275, 567. Parker, J.R.: 1994, Practical Computer Vision using C, John Wiley and Sons Inc., p. 40. Parker, S. and Roth, J.: May 1996 Sky and Telescope. Rumelhart, D., McClelland, J., and the PDP Research Group: 1986, Parallel Distributed Processing, MIT Press, Cambridge, MA. Sebok, W.: 1980, A Faint Galaxy Counting System: SPIE Conference on Applications of Digital Image Processing, SPIE 264, 213. Serra-Ricart, M. and Xavier, C.: 1993, Astron. J. 106, 1685. Spiekermann, G.: 1992, Astron. J. 103 (6), 2102. Storrie-Lombardi, M.C., Lahav, O., Sodré, L., et al.: 1992, Mon. Not. R. Astron. Soc. 259, 8. Zell, A., Mache, N. and Sommer, T.: 1990, Applications of neural networks: In Proc. Applications of Neural Networks Conf., SPIE 1469, 535–544, Orlando Florida, Aerospace Sensing Intl. Symposium. Zell, A., Mache, N. Sommer, T. and Korb, T.: 1991, Recent Developments of the SNNS Neural Network Simulator: In Proc. Applications of Neural Networks Conf., SPIE 1469, 708–719, Orlando Florida, Aerospace Sensing Intl. Symposium.