Genet Program Evolvable Mach (2011) 12:81–83 DOI 10.1007/s10710-010-9119-9 BOOK REVIEW
Gisele L. Pappa, Alex Freitas: Automating the design of data mining algorithms, an evolutionary computation approach Springer, 187 pages, ISBN-13: 978-3642025402 John Woodward
Published online: 24 August 2010 Ó Springer Science+Business Media, LLC 2010
The book proposes using grammar based genetic programming (GP) to evolve innovative data mining algorithms for classification tasks. These algorithms are then extensively tested on a number of data sets. What sets this approach apart from earlier methods is that previous attempts simply tune a numerical parameter, whereas this method actually generates new algorithms in the form of if-then-else statements. In other words, genetic programming is used to automate the design of rule induction algorithms, as opposed to manually designing them which is the conventional approach. Pappa and Freitas concentrate on easy-to-interpret classification rules of the form of ‘‘if (conditions) then (class)’’ as we (humans) are more likely to place confidence in human interpretable rules, rather than black box approaches such as artificial neural networks. Thus the book is about decision support rather than automated decision making. The case for both the use of genetic programming as a methodology and data mining as an application area are well presented and well argued in chapters 2 and 3. Their methodology is an alternative to manually designing algorithms and is therefore one more step in the direction of automation. In addition, not only are algorithms automatically generated, but at the same time they are tailored to the specific training data. Chapter 1 gives an introduction to the book, while chapters 2 and 3 give introductions to data mining and genetic programming. The introductions to data mining and GP may have been included for the sake of completeness; however I would rather have seen chapters 2 and 3 reduced in size and more depth and breadth in the remaining chapters concerning the actual research.
J. Woodward (&) The University of Nottingham, Ningbo, China e-mail:
[email protected]
123
82
Genet Program Evolvable Mach (2011) 12:81–83
Chapter 4 makes the important distinction between a classification algorithm and a classification model; an algorithm being more general than a model in that a classification algorithm can generate a classification model. This ties in with a subtle but important point that a GP individual (in this book) is evaluated, not on a data set, but on a meta-data set (a set of data sets). Similarly a distinction is drawn between optimization and classification. Both distinctions lead to a crucial clarification of the main contribution in the book; a new methodology to generate data mining algorithms. Chapter 5 describes the main reason for using a grammar in the genetic programming system which is to supply the system with background knowledge about the problem being tackled (page 111). The grammar was created after a comprehensive study of a number of basic rule induction algorithms. A number of major new components are also inserted into the grammar, for example, a terminal ‘‘typicalExample’’, which was borrowed from instance-based learning. One of the strengths of the book is that they deal with the problem of over-fitting. Over-fitting means modeling the training data to such an extent it is almost memorizes the given data, and the model fails to generalize to previously unseen data. This has been an issue for GP, where over-fitting is often ignored as experiments are typically run for a predetermined number of generations. This may be a ‘‘hang-up’’ due to the pioneering work of John Koza, who includes the number of generations in a parameter table. In addition, the authors also consider receiver operating characteristics, which are important when dealing with unbalanced classification data. Chapter 6 presents the experimental results, where the evolved rule induction algorithms are evaluated in five phases (page 138). Some of the human designed algorithms (e.g. CN2) are expressible within the grammar while others are not (e.g. C4.5). Unfortunately the authors do not explain why this is the case. The evolved classification algorithms could outperform CN2. This result should not come as a surprise to the reader as it was included in the search space, and was relatively simple to evolve. Thus this result confirms the value of evolving data mining rules over manually designing them. One of the most well known repositories of data sets, the UCI data sets, is used as a test bed for the methodology. Most of the algorithms produced were similar to CN2 in terms of syntax. This suggests that CN2 is therefore a good algorithm in terms of average predictive accuracy for the UCI repository. This is possibly because most of the rule induction algorithms were first designed or later modified targeting these data sets (page 148). There are interesting questions regarding experiments designed to illustrate the effectiveness of different grammars (pages 151–153), for example, an experiment where pruning rules are removed from the grammar. In short, these experiments show that simpler versions of the grammar can be as effective as more sophisticated ones. Genetic programming is used rather than, say, hill climbing, which can suffer from ‘‘getting trapped in local optima’’. The experiments of section 6.2.6 confirm that this was the right choice. However I feel that this experiment is largely redundant as the reader is by this point in the book already aware of the danger of getting stuck at a local peak.
123
Genet Program Evolvable Mach (2011) 12:81–83
83
In some senses the book reads like an extended journal article, beginning with an introduction, taking the reader through the usual stages of experimental set up and results and ending with a chapter on future work. I would have hoped for a more adventurous format in a book. Predictably the book closes with suggestions for future work. I would like to have seen more detail in the introductory chapter, which would be a compact version of the book and would point me to more detail later on in the book. The book would benefit from a little restructuring. A few sections seem rather long and could be broken down into subsections (e.g. section 5.2). Also, the layout of some tables could be improved (e.g. tables 6.6–6.9). Future editions would greatly benefit from having a single list of references at the back of the book, rather than at the end of every chapter. However a helpful table of acronyms is included. The book is targeted at researchers and postgraduate students. As the amount of data being mined continues to grow it demands ever more sophisticated mining algorithms. Therefore there is a need for new algorithms and so Pappa and Freitas’ book will be of interest particularly to researchers in data mining. The book is free of typos, and is written in a confident commanding style. In short, this book will appeal to the target audience of Genetic Programming and Evolvable Machines and, I feel, will align with the research interests of its readership.
123