Computational Statistics (2006) 21:1-7 DOI 10.1007/s00180-006-0247-x 9 Physica-Verlag 2006
Interview with Censhiro Kitagawa Genshiro Kitagawa is Director-General of the Institute of Statistical Mathematics (ISM), Tokyo, Japan. He has written many books, including "Akaike Information Criterion Statistics" (with Y. Sakamoto and M. Ishiguro, 1986, D. Reidel Publishing Company), and has published a large number of papers on subjects such as time series analysis, information criterion, and general state-space modeling. In addition, Genshiro Kitagawa is one of the developers of the time series software TIMSAC. He is currently involved with the AISM and with many academic journals as an editor. He is an important member of several statistical societies including the council of the IASC 2005-2009. This interview was conducted by the managing editor, Prof. Dr. W. H~irdle, and joint editor, Prof. Dr. Y. Mori, during the ISM workshop for "Statistical Tools in Risk Management" in May 2005.
Statistical Science has many roots. Where do you see the roots o/ Computational Statistics?
I think Computational Statistics stems from computing as an approaching to real world problems, because we cannot avoid computing when solving such problems. For example, we may find roots of matrix computation in multivariate analysis, numerical optimization in obtaining MLE, and more recently, Bayesian computation, and visualization or graphical representation in data analysis. I heard that, in 1956, the first general-purpose computer was introduced to the ISM for computing the inverse of matrices with dimensions larger than 10. In any cases, I think the cutting edge of Statistical Science is always faced with computational problems. Similarly, I remember that my statistics teacher, Prof. Hirotugu Akaike, told us that statisticians have to work three-times harder than other researchers, namely because we must become experts in statistics, domain science, and computation. The de-
velopment of a general-purpose computer was another factor in the progress of computational statistics. As a result, statistical software has come to play an important role in expanding the statistical community and the number of individuals using statistical methods. Finally, knowledge discovery from data is the principal task of statistics; therefore, various tools have been developed by computational or graphical methods.
What were the most prominent developments for Computational Statistics in the last 5, 10, 20, and 30 years? In the past 5 years, the problems associated with prediction and knowledge discovery from huge data sets have arisen in various fields of domain sciences, such as bio-science, marketing, finance, earth science, and environmental science. Various methods have been developed areas of science close to Statistical Science such as those used in statistical learning theory and AI. In the past 10 years, the development of powerful tools for Bayesian analysis, such as MCMC, was important. I think that because of these, statistics has now been reevaluated as an indispensable tool for scientific research on complex real world problems. In the 20 years since 1985, personal computers have become popular tools for statisticians by providing a variety of practical numerical methods such as bootstrapping and state-space modeling. As a result, Bayesian modeling became a popular and practical tool for statistical analysis. In the past 30 years, general-purpose computers, which became popular by 1975, were an important development. As running a general-purpose computer required a program, various software packages and subroutines were developed. In the 1970s, it became feasible to obtain the maximum likelihood estimates of complex models by numerical optimization methods. By this numerical approach, it became sufficient to provide the subroutines for evaluating the log-likelihood, which made statistical modeling very easy. I think these developments were important conditions for the development of various models and model selection techniques, in particular, the information criterion, AIC.
Why would you classify yourself as a computational statistician? Statistical research naturally leads to computational statistics. I have worked with time series modeling and most of the statistical methodologies I used were computationally very intensive. As real phenomena are complicated, the familiar simple linear Gaussian models are sometimes inadequate. However, for most of the nonlinear or non-Gaussian models, it is impossible to solve statistical problems in an analytic manner, therefore, we must apply computational methods, such as numerical optimization, bootstrap methods, or non-Gaussian filters based on numerical integration or the Monte Carlo method.
i~(:i)! ~I:~I ~
9
Figure 1: Genshiro Kitagawa at Joint Statistical Meeting, Washinton, D.C., 1989, with Prof. Will Gersch.
Some people say that, as a computational statistician, one needs mathematics and C ++. Is this reflected in today's education? If "mathematics and C ++'' could be replaced by "mathematical thinking and efficient computer language(s) skills", I think that the first sentence is correct because mathematics and computers are certainly necessary in statistical computing, particularly, in developing statistical methods or in the statistical modeling of real phenomena. Unfortunately, it is difficult to say that mathematics and computer languages are taught effectively in statistics education in Japan. Currently, the number of mathematics and statistics classes at universities is decreasing and computer science classes are focusing more on operating application software rather than programming. We believe that in universities basic mathematics and data literacy should be taught more adequately both for statisticians and non-statisticians and that balanced classes between statistical theories and computer languages should be provided for computational statisticians.
How is Computational Statistics distinguished from Bioinformatics? Obviously, computational statistics is a cross-disciplinary research methodology. In contrast, bioinformatics is focused on a specific area of domain science. Bioinformatics is a typical area in which we must manage multiparameter problems and simple data manipulation cannot provide meaningful results. This is not only a challenging area for statisticians, but also one of the most promising areas for statisticians to make contributions in the field of knowledge discovery. However, I believe that computational statistics can contribute to a broader area of the life science than bioinformatics.
4
Figure 2: Genshiro Kitagawa at the First US/Japan Conference on the Frontier of Statistical Modeling, Knoxvill, 1992, with Prof. Emanuel Parzen, Prof. Will Gersch, Prof. Tohru Ozaki, Prof. Makio Ishignro, Prof. Yoshihiko Ogata, Prof. Yoshiyasu Tamura and Prof. David Findtey.
Is there a future .for Computational Statistics? Certainly. It seems to me that the role of computing will become more important to various fields of science. I rather think that the traditional paradigm of scientific research, namely "theory and experiment", is changing to a new paradigm consisting of "theory, experiment, and computing". However, since computation is essential to statistics, the distinction between statistics and computational statistics may become meaningless.
How should new developments in Computational Statistics be published7 Recently, we have advocated the importance of "statistical metaware". Thus, we held an international symposium on the "Art of Statistical Metaware" in March 2005. Typically, we think of hardware and software when referring to computation or information processing. However, we prefer to think that the models and algorithms that specify the design of the software and the hardware are more important in solving real world problems. Obviously, published manuscripts are insufficient for properly distributing this type of knowledge; therefore, a systematic platform for presenting models, algorithms, and knowhow is required. Identifying a proper platform for handling and distributing this type of research results is an important task.
Where will science proliferation lead us? Integrated documents? At the moments, the creation of large databases and the extraction of information and knowledge discovery from such databases have a strong impact on various fields of scientific research. This approach is different from the traditional methods of scientific research. We used to consider that data should be carefully designed. But it seems to me that in the post-IT era, statisticians should confront the problem of extracting knowledge by combining various types of data, some of which are not carefully designed. I'd like to see what
5
Figure 3: Genshiro Kitagawa at Workshop on Seasonal Adjustment at US Census Bureau, 1995, with Prof. David Findley and Prof. Makio Ishiguro.
happens next. How did you become a computational statistician?
My use of computers started sometime in 1971 when my professor, Dr. S. Furuya of Tokyo University, began to analyze the behavior of logistic maps. This was a few years before the famous work by Robert May in 1976. I tried to generate the path of the map by computer. It was my first experience using FORTRAN and computers. Until this experience, as a student of mathematics, I was mostly interested in merely the existence and the uniqueness of the solution. However, during this research, I realized that the construction of a solution or path is a more interesting task. Soon after I joined the ISM in 1974, I began cooperative research with Prof. Ohtsu of the Tokyo University of Marine Science & Technology on an automated control system for a ship's motion. At that time, hard disk was very fragile so we had to use paper tape for storage. Furthermore, for the on-board computer we had to write programs in assembly language. Controlling a big ship via statistical modeling was very exciting for me. Incidentally, at that time there were various types of computers at the ISM, including analogue computers, hybrid computers for simulating nonlinear phenomena, and the wreckage of computers constructed from relays or parametrons, an element similar to transistor developed in Japan. In 1978, Prof. Akaike suggested that I develop a time series program, based on a systematic use of least squares fitting. We found the paper by Prof. Gene Golub on a square root algorithm for the least squares method. Since I was used to the Householder transformation in the design of optimal controllers, I was able to make a FORTRAN program for fitting AR models in one day. We soon realized that this method could be applied to a very wide range of time series modeling, such as multivariate modeling, subset regression, Bayesian modeling, locally stationary AR modeling, and so on. Eventually, we developed the TIMSAC-78 software package. Prof. Akaike
developed a program for MLE of the ARMA model based on state-space modeling in TIMSAC-74. However, during the development of TIMSAC-78, I realized the usefulness of the state-space model (SSM) in connection with missing value problems. In 1976, soon after he developed his AIC criterion, Prof. Akaike became interested in Bayesian modeling. In 1978, he developed a new Bayesian method for seasonal adjustment. When he explained his model to me, I immediately noticed that it could be efficiently handled using state-space modeling. After my joint work with Prof. Will Gersch, I developed a program called DECOMP. At that time, I also noticed that state-space modeling is a very powerful tool for nonstationary time series analysis. Actually, Prof. Will Gersch and I were able to develop various nonstationary time series models, such as time varying variance, a precursor of the stochastic volatility model, time-varying AR model, and the decomposition model, during our research at the US Census Bureau. After I returned to Japan, I noticed that we could develop recursive NonGaussian filtering algorithms using numerical integration. It took about 20 minutes for a single run of the non-Gaussian filtering and smoothing using the 14MIPS mainframe computer of the ISM. In fact, we had to estimate the unknown parameters by numerical optimization, so it took a very long time before we reached the final result. I usually ran the program during faculty meetings. After the success of the non-Gaussian filtering and smoothing via the computational method, I noticed that it could be generalized to nonlinear models and even to general state-space models. Beginning in 1990, we conducted joint research with colleagues in the ISM on extending the information criterion. This was successfully realized using the bootstrap method and the development of a bootstrap information criterion, the EIC. In 1992, during this research, I suddenly realized that the curse of dimension in nonlinear filtering may be mitigated by approximating the distribution in many particles and developed the Monte Carlo filter. I presented this method at a Japan-US symposium on time series analysis in 1993. But later I learned that an almost identical method had been developed at the same time by an English research group.
Wha~ do you think of today's statistical software packages? I am not qualified to talk about this subject. But, I think in general, that statistical software packages should be useful tools for managing real world problems by providing tools for modeling, analyzing, predicting, and visualizing the data and results. Since data are becoming larger, the ability to handle huge data sets will be required for cutting-edge users. However, I have the impression that the current statistical software is not geared for handling huge amounts of data. Currently, our group is developing Web-based statistical software. With this Web-based software, the user can access the latest statistical software with-
out installing it. For example, WebDECOMP, a seasonal adjustment program developed by Prof. Sato, is being accessed 1000-2000 times per month. Where do you see the big job market for statisticians ? I think there are many opportunities for researchers who are skilled in statistical modeling and data analysis. Our current society can be characterized as an Information Society and as a Risk Society and we now face two important tasks. As an Information Society, we must perform prediction and knowledge discovery based on huge data sets. As a Risk Society, we are expected to offer scientific tools for evaluating and managing risks. For both problems, proper modeling and data analysis are essential; therefore, the role of the statistician is becoming more important in various fields of research such as disaster prevention for earthquakes, tsunamis, and typhoons, and safety for food, drugs, traffic, and radiation. I think that young researchers skilled in modeling and data analysis and possessing a strong interest in these real world problems will be very well received by these domain sciences.