Test DOI 10.1007/s11749-012-0306-2 O R I G I N A L PA P E R
Functional projection pursuit regression F. Ferraty · A. Goia · E. Salinelli · P. Vieu
Received: 23 June 2011 / Accepted: 16 July 2012 © Sociedad de Estadística e Investigación Operativa 2012
Abstract In this paper we introduce a flexible approach to approximate the regression function in the case of a functional predictor and a scalar response. Following the Projection Pursuit Regression principle, we derive an additive decomposition which exploits the most interesting projections of the prediction variable to explain the response. On one hand, this approach allows us to avoid the well-known curse of dimensionality problem, and, on the other one, it can be used as an exploratory tool for the analysis of functional dataset. The terms of such decomposition are estimated with a procedure that combines a spline approximation and the one-dimensional Nadaraya– Watson approach. The good behavior of our procedure is illustrated from theoretical and practical points of view. Asymptotic results state that the terms in the additive decomposition can be estimated without suffering from the dimensionality problem, while some applications to real and simulated data show the high predictive performance of our method. Keywords Additive decomposition · Consistency · Functional predictor · Functional regression · Predictive directions · Projection pursuit regression Mathematics Subject Classification 62M10 · 62H12 · 62F12
Communicated by Domingo Morales. F. Ferraty · P. Vieu Institut de Mathématiques de Toulouse - Équipe de Statistique et Probabilités (ESP), Université Paul Sabatier, 118, route de Narbonne, 31062 Toulouse Cedex, France A. Goia () · E. Salinelli Dipartimento di Studi per l’Economia e l’Impresa, Università del Piemonte Orientale “A. Avogadro”, Via Perrone, 18, 28100 Novara, Italy e-mail:
[email protected]
F. Ferraty et al.
1 Introduction A part of the recent methodological and applied statistical literature is devoted to the study of estimation methods in the functional regression framework where the predictor X is a random function defined on an interval I and the scalar response Y follows the regression model Y = r[X] + E. This kind of model appears in many scientific domains such as chemometrics, biology, economics, and climatology and has been analyzed under different specifications of the regression operator r. Much of the recent work has been devoted to the study of methods for estimating the regression operator r. For a review, see Ferraty (2011), Ferraty and Romain (2010), Ferraty and Vieu (2006), or Ramsay and Silverman (2005). Nonparametric methods with a Nadaraya–Watson-type estimator provide a flexible mean to recover the regression operator r. This has been successfully used as an exploratory tool as well as for forecasting purpose. However, the nonparametric approach does not permit a direct interpretation of the estimates. Moreover, as any full nonparametric method, it is subject to the problem of high dimensionality, which can limit its effective use in practice. To avoid the limitations of the nonparametric approach, several parametric and semi-parametric methods have been proposed. The most important example is the functional linear model where the regression operator r is specified as r[X] = μ + I β(t)X(t) dt. It is well known that the estimation of the functional coefficient β is an ill-posed inverse problem (see Cardot et al. 2003 and Crambes et al. 2009) and requires a regularization method. One solution to this problem is to project the observations onto a finite-dimensional space spanned by eigenfunctions of the (empirical) covariance operator of X (see, e.g., Cardot et al. 1999 and Cardot et al. 2007). The main advantage of this approach is the interpretability of the estimated coefficient by means of its shape. However, the linearity assumption is quite restrictive and is very difficult to verify in the functional regression context. Generalizations are pursued in many different directions. Functional additive models (FAM) (see, e.g., M˝uller and Yao 2008) incorporate an additive structure in the functional principal component scores of X. Generalized functional linear models (see, e.g., Cardot and Sarda 2005 and James 2002) specify a link function depending on the distribution of the response variable. James and Silverman (2005) have introduced the functional adaptive model estimation (FAME) which combines some characteristics of the projection pursuit regression (PPR) (see, e.g., Chen 1991; Friedman and Stuetzle 1981; Huber 1985, and Jones and Sibson 1987) and generalized additive models. Functional Index Model (see, e.g., Eilers et al. 2009 and Ferraty et al. 2003) follows the same spirit. In the latter, it is assumed that X acts on the scalar response only through its projection on one functional direction. An extension to the case of several directions is explored in Amato et al. (2006). Moreover, the generalizations in the case of more regressors (all functional, or both functional and real) have been tackled in the nonparametric, semi-parametric, and parametric frameworks (see for an overview the works of Aneiros-Pérez and Vieu 2006, Febrero-Bande and Gonzàlez-Manteiga 2011, and Ferraty and Vieu 2009).
Functional projection pursuit regression
In this paper we propose a new flexible functional regression approach that generalizes the popular functional linear model. We adopt an additive decomposition of r which uses projections in the space of the functional predictor. Note that projection has a direct relation to predictability and interpretability of the model. We shall keep this feature by defining the most predictable direction and generalize the linear relationship to include nonlinear or nonparametric relationship. The additive structure provides a flexible alternative to the nonparametric approach having often reasonable interpretations. Since our approach extends the PPR principle to the functional context, it can be named Functional Projection Pursuit Regression (FPPR). In contrast with the above cited additive-type decompositions, our approach is distribution-free, and the directions are not fixed but chosen with an optimality criterion which takes the response Y into account. Each term of our decomposition is estimated by a procedure combining a spline approximation and the one-dimensional Nadaraya–Watson kernel regression estimate. The number of terms in the approximation is chosen “a posteriori” on the base of a user-specified threshold of fitting. A study of predictive performances of our approach has been developed both from a theoretical and practical point of view. Theoretical results show that, under quite mild regularity conditions, this fully data-driven method is insensitive to dimensionality effects. An extended empirical analysis supports these results. Practical study with simulated and real data emphasizes the good predictive performance of our method either when an additive decomposition of the regression function is reasonable or not. The outline of the paper is the following. In Sect. 2 we define the most predictable directions and introduce an FPPR-type decomposition based on these directions. The methodological issues are discussed in Sect. 3, where estimates of the various terms of our decomposition are defined. Computational issues are discussed in Sect. 4, and finite sample performances are illustrated in Sect. 5, first by means of simulated experiments and then through a real curves data-set coming from chemometrics. The asymptotic behaviors of the prediction abilities of the FPPR approach are presented in Sect. 6. A final Sect. 7 states some general conclusions, while a technical Appendix ends the paper.
2 Most predictive directions and components In this section we introduce an additive approximation of the regression operator, defining what we call the “most predictive directions” and the associated “most predictive additive components”. Denote by H the separable Hilbert space of square-integrable real functions over the interval I (i.e., H = {h : I h2 (t)dt < +∞} with I ⊂ R) endowed with the in ner product g, f = I g(t)f (t)dt and the induced norm g2 = g, g. Let (X, Y ) be a pair of centered random variables taking values on H and R, respectively. We consider the standard regression model Y = r[X] + E with r[X] = E[Y |X], assuming that E[E|X] = 0 and E[E 2 |X] < ∞.
F. Ferraty et al.
We approximate the unknown regression functional r by a finite sum of terms r[X] ≈
m
gj∗ θj∗ , X ,
(1)
j =1
where θj∗ ∈ H with ||θj∗ ||2 = 1. The aim is to find the principal (projection pursuit) direction θj∗ along which to project X and the function gj∗ that explain the most variability of Y with respect to X. The functions gj∗ are called the ridge functions in the multivariate statistics literature. The pairs (θj∗ , gj∗ ) are defined as the solutions of the successive minimization problems as explained below. At first step, θ1∗ is determined by solving 2 min min E Y − g θ1 , X . g θ1 2 =1
Equivalently, θ1∗ is a solution to 2 min E Y − E Y |θ1 , X ,
θ1 2 =1
and g1∗ is
(2)
g1∗ (u) = E Y | θ1∗ , X = u .
If we set E1,θ1∗ = Y − g1∗ (θ1∗ , X), then E1,θ1∗ and θ1∗ , X are uncorrelated. So, in an iterative way, we can define E
j,θj∗
=Y −
j
gs∗ θs∗ , X ,
j = 1, . . . , m,
s=1
with E[Ej,θj∗ |θj∗ , X] = 0 at each stage. Then, we can obtain, for j > 1, the j th direction θj∗ by solving the minimum problem 2 min E Ej −1,θj∗−1 − E Ej −1,θj∗−1 |θj , X
θj 2 =1
and then define the j th component as
gj∗ (u) = E Ej −1,θj∗−1 θj∗ , X = u .
(3)
(4)
By this way, the directions θj∗ entering in (1) are explicitly constructed, and so, after the mth step, we have the additive decomposition Y=
m
gj∗ θj∗ , X + Em,θm∗
(5)
j =1
with E[Em,θm∗ |θm∗ , X] = 0. A stop criterion in order to choose m will be proposed later in Sect. 4. By setting E0,θ0∗ = Y , we introduce the following definition which summarizes our procedure.
Functional projection pursuit regression
Definition 1 For each j = 1, . . . , m, we call the solution θj∗ of the minimization problem (3) the “j th most predictive direction”, and the corresponding function gj∗ defined in (4) the “j th most predictive additive component”. It is implicitly assumed in the remaining of the paper that the above-introduced minimum problems have solutions. Note that, as in the multivariate PPR context, two different pairs (θj1∗ , gj1∗ ) and (θj2∗ , gj2∗ ) may produce the same additive estimation (i.e., gj1∗ (θj1∗ , X) = gj2∗ (θj2∗ , X)). However, that lack of unicity is not a problem in a prediction perspective. Remark 2 While it is always possible to derive the additive decomposition as in (5) with Em,θm∗ uncorrelated with θm∗ , X, the FPPR model actually assumes that there ∗ is uncorrelated with X and thus the equality holds in (1). exists m ˜ such that Em,θ ˜ m ˜ Roughly speaking, this assumption implies that after the mth ˜ step all the information on X has been captured. In this case, there are many different ways to ensure a unique decomposition in (1). For instance, we can impose some orthonormality conditions on the directions and differentiability conditions on the corresponding univariate functions, and then follow the same lines as in Ferraty et al. (2003). Remark 3 When one fixes the directions θj∗ to be the first m eigenfunctions of the covariance operator of X sorted according to the decreasing sequences of eigenvalues, our procedure is analogous to the FAM of M˝uller and Yao (2008). Our procedure has some similarity to the FAME by James and Silverman (2005) when they use the identity link function, but the main difference is that our method is able to construct a distribution-free decomposition. 3 Construction of the estimates We observe data {(Xi , Yi ) : i = 1, . . . , n} where (Xi , Yi ) are independent and identically distributed as (X, Y ). In this section we propose an estimation method for various terms to define decomposition (1). In Sect. 3.1 we consider the nonparametric estimation of the predictive components in some fixed functional direction. This is used in Sect. 3.2 to estimate the most predictive directions. A method of selecting the number of components is finally proposed in Sect. 3.3. Theoretical results are given in Sect. 6, whereas technical details are deferred to Appendix. 3.1 Estimation of the one-dimensional functional components For each j = 1, . . . , m (m a fixed integer), given (θ1 , . . . , θj ), we wish to estimate the functions gj,θj (u) = E Ej −1,θj −1 |θj , X = u with E0,θ0 = Y and Ej,θj = Y −
j s=1
gs,θs θs , X .
F. Ferraty et al.
These can be estimated nonparametrically by using the Nadaraya–Watson kernel approach. For all j = 1, . . . , m, the estimates are constructed as n u−θj ,Xi ) i=1 Ej −1,θj −1 ,i Kj ( hj gj,θj (u) = , n u−θj ,Xi ) i=1 Kj ( hj where, for all i = 1, . . . , n, E0,θ0 ,i = Yi , and Ej,θj ,i = Yi −
j
gs,θs θs , Xi ,
j = 1, . . . , m.
s=1
Here hj are smoothing parameters depending on n, and Kj are standard kernel weighting functions. The asymptotic theory, developed further in Sect. 6.1, will show that these estimates reach the usual univariate nonparametric rate of convergence. 3.2 Estimates of the most predictive directions In this section m is fixed, but the predictive directions θ1∗ , . . . , θm∗ are unknown. The natural idea for estimating the θj∗ s is to look for directions minimizing some prediction criterion. A usual way to do that is to use cross-validation principle. In all the following, we will use the standard notation g −i for denoting any leave-one-out kernel estimate, that is, any estimate as defined in Sect. 3.1 before but based on the leave-one-out statistical sample {(Xj , Yj ), j = 1, . . . , n, j = i} rather than on the full sample {(Xj , Yj ), j = 1, . . . , n}. This is done iteratively, and at each step the functional direction is searched over some subset Θ = Θn with Θ ⊂ H ∩ {θ : θ 2 = 1}. The estimate of the most predictive direction θ1∗ is defined as θ1 = arg min CV1 (θ1 ) θ1 ∈Θ
with CV1 (θ1 ) =
n 2 1 −i Yi − θ1 , Xi IXi ∈G , g1,θ 1 n i=1
where G is a subset of H introduced for usual technical boundedness reasons. Then, estimates of further directions can be constructed step by step. Once one has θj −1 , the next direction is chosen as at hand the estimates θ1 , . . . , θj = arg min CVj (θj ) θj ∈Θ
with
2
j −1 n 1 −i CVj (θj ) = gs, gj,θj θj , Xi IXi ∈ G . Yi − θs θs , Xi − n i=1
s=1
(6)
Functional projection pursuit regression
The theoretical analysis in Sect. 6.2 shows that these estimated directions are asymptotically optimal in some L2 sense. Numerical performance is evaluated in Sect. 5 with simulation studies and real data examples, in which good finite-sample behavior is also illustrated. Because these estimates are based on a minimization criterion, the main consideration to implement such a procedure should be made on the choice of the set Θ: it should be sufficiently large to avoid to loose true values of θ but also sufficiently small to ensure computational feasibility. Note that the proposed algorithm follows ideas similar to those in Hall (1989) since the cross-validation procedure in estimating the predictive additive component gj∗ removes extraneous bias in estimating θj∗ . 3.3 How to choose the number of components We now discuss a method to select the number of “relevant” components required in our decomposition. Since the estimates are constructed in an iterative procedure, a natural way of evaluating m∗ is to measure the contribution by additional terms to the prediction. One might consider the cross-validation criterion min
m∈N,m>0
CV (m),
where CV (m) = CVm ( θm ), but it is well known that this procedure may overfit m∗ . In order to avoid overfitting, we introduce the following penalized version of the cross-validation criterion: P CV (m) = CVm ( θm )(1 + un ), where un is any sequence tending to zero as n grows to infinity. Practical choice of un will be discussed along Sect. 4. Hence, we define: m∗ = arg
min
m∈N, m>0
P CV (m).
Note that this is a quite general way for proceeding to dimension estimation (see, for instance, Vieu 2002 for a general presentation of these ideas).
4 Computational issues The aim of this section is to discuss the computational problems one has to solve when implementing the FPPR procedure. Clearly the main challenge to implement the procedure is to do with estimation of the directions θj∗ because it involves solving minimization problems with respect to a functional parameter. This is illustrated with details in Sect. 4.1 below. Other steps of the procedure are more standard and quickly discussed in Sect. 4.2. Finally, a boosting procedure is presented in Sect. 4.3 to improve the predictive power of the FPPR procedure.
F. Ferraty et al.
4.1 Constructing the estimates We now focus on the estimation of the additive components gj∗ and directions θj∗ from data. Following the same principle illustrated in Sect. 2, we implement an iterative method for j = 1, . . . , m, with m to be determined. We base the procedure on an alternating optimization strategy (see, e.g., Friedman and Stuetzle 1981), combining a spline approximation of directions and the Nadaraya–Watson kernel regression estimate. Denote by Sd,N the (d + N )-dimensional space of spline functions defined on I with degree d and with N − 1 interior equispaced knots (with d > 2 and N > 1, integers) and let {Bd,N,s } be a normalized B-splines basis (see, e.g., De Boor 2001). For j = 1, . . . , m, the spline approximation of θj is expressed as γjT Bdj ,Nj (t), where Bdj ,Nj (t) is the vector of all the B-splines, and γj is the vector of coefficients satisfying the normalization condition γjT Bdj ,Nj (t)Bdj ,Nj (t)T dt γj = 1. (7) I
The estimation procedure is based on the following steps: • Step 1—Initialize the algorithm by setting m = 1 and current residuals Em−1, γm−1 ,i = Yi
i = 1, . . . , n.
• Step 2—Choose the dimension Nm + dm of Sdm ,Nm and fix the initial direction by (0) setting the vector of initial coefficients γm such that they satisfy (7). −i Find an estimate g (0) of gm using the Nadaraya–Watson kernel regression m,γm
approach excluding the ith observation Xi (see Sect. 3.1): g −i
(0)
m,γm
(z) =
z−(γm0 )T bm,l ) hm Em−1, γm−1 ,l , 0 T z−(γm ) bm,l K ( ) l =i l =i m hm
Km (
where bm,l = Bdm, Nm , Xl . Then, compute an estimate γm by minimizing 2 1 Em−1, IXi ∈ G g −i (0) γmT bm,i γm−1 ,i − m,γm n n
CVm (γm ) =
i=1
over the set of vectors γm ∈ RNm +dm satisfying (7). (0) Update γm = γm and repeat the cycle until the convergence: the algorithm stops when the variation of CVm between the previous and the current iteration (normalized dividing it by the variance of current residuals) is positive and less than a prespecified threshold. • Step 3—If the penalized criterion 2 1 −i T P CV (m) = Em−1, γm bm,i (1 + un )IXi ∈G gm, γm−1 ,i − γm n n
i=1
Functional projection pursuit regression
does not decrease, then stop the algorithm. Otherwise, construct the next set of residuals T m−1, Em, γm bm,i , gm, γm ,i = E γm−1 ,i − γm update the term counter m = m + 1, and go to Step 2. It is worth emphasizing that this algorithm follows the methodology described before and is in conformity with the theoretical advances developed further. The introduction of the spline space plays the role of the general space Θ on which the principal directions have to be searched, and the well-known approximation properties of spline functions make reasonable the general technical assumptions on Θ done later (see Sect. 6.2). 4.2 Other computational issues While the choice of the space Θ on which the minimization has to be performed is the main issue of our procedure, there are several other computational issues which have to be taken into consideration. The issues will be discussed briefly as they are quite well known and not relevant to our purposes. For each j = 1, . . . , m, if dj may be taken equal to 3 for all j , it is opportune to vary the number Nj of knots according to the complexity of the shape of the direction θj we have to detect. To select the number of knots we have used the classical Akaike Information Criterion, AIC, and the Schwartz Information Criterion, BIC (see, e.g., Burnham and Anderson 2002 for an overview): at each step j , we choose Nj which minimizes κ θj + δ(κ), n log CVj θjκ ) is the estimated mean square error when one uses κ knots, and δ(κ) where CVj ( is a penalization equal to 2(κ + dj ) for AIC and (κ + dj ) log n for BIC. The second problem we have to face is the choice of the smoothing parameter hj . This should be done by a data driven method. Cross-validation procedure could be used, but from a computational point of view this strategy could be very expensive. Thus, for our purposes, it is more convenient to use a plug-in bandwidth estimator: here we have adopted the procedure proposed in Gasser et al. (1991). Finally, concerning the third step of the algorithm and the choice of the penalization sequence un , we have followed general advices on this point (see, for instance, Ferraty et al. 2010a), and we have used a standard choice un = Cm
log n , n
where C is a fixed positive constant to be chosen suitably.
F. Ferraty et al.
4.3 A final boosting step Once the m∗ most predictive directions and the components which approximate the link between the functional regressor and the scalar response are estimated, it is possible to improve the prediction performances by using a boosting procedure with a final full nonparametric step: we compute the residuals ∗
Yi −
m j =1
gj, θj θj , Xi ,
and we estimate the regression function between these residuals and the whole functional regressors Xi by using the Nadaraya–Watson-type estimator proposed in Ferraty and Vieu (2002). Hence, we derive ∗
Yi =
m j =1
gj, rm∗ +1 (Xi ) + ηi , θj θj , Xi +
where ηi are centered independents random errors with finite variance, uncorrelated with Xi , and
i ,x) n m∗ Km∗ ( ϕ(X hm∗ +1 ) , rm∗ +1 (x) = gj, Yi − n θj θj , Xi ϕ(Xi ,x) ) i=1 Km∗ ( h ∗ i=1 j =1 m +1
where Km∗ is a standard kernel function, hm∗ +1 > 0 is a suitable bandwidth, and ϕ(·, ·) is a relevant semi-metric (for more details, see Ferraty and Vieu 2006).
5 Simulations and application to real data The methodology developed in Sect. 4 is applied both to simulated and real spectrometric data in order to asses its performances. The estimation procedure has been implemented in R code and exploits an optimization method based on the Nelder– Mead algorithm (see Nelder and Mead 1965). For each case considered, we compute the estimates on a training set, and the goodness of prediction is evaluated on a testing sample. The predicted values are compared with those obtained by using two functional competitors, the functional linear model and the nonparametric model (NPM in the sequel). About the first, the estimation of the functional linear coefficient is based on the functional principal components regression approach; the number of components is selected by the standard PRESS criterion. Concerning the nonparametric model, we use the kernel estimator proposed in Ferraty and Vieu (2006) with the family of semi-metrics ϕp (u, v) = ( I (u(p) (t) − v (p) (t))2 dt)1/2 (with p = 0, 1, 2, . . .) and the K-nearestneighbor bandwidths, with K selected by a cross-validation over the training-set. The derivative-based family of semi-metrics is especially relevant when one has to deal with spectrometric curves (as in our real dataset). Indeed, spectrometric curves suffer from a calibration problem intrinsic to the near infrared spectrometer analyzer; this problem is dealt with by differentiating the curves instead of using the original
Functional projection pursuit regression
curves (see Sect. 5.2 for more details). The results for linear, nonparametric, and our estimators are given in terms of Mean Square Error of Prediction (MSEP) out
n 1 MSEP = out (yi − yi )2 , n i=1
yi are the true value and the corresponding prediction, and nout is the where yi and size of the testing sample. To complete the analysis for simulated data, we also compare the results with those obtained by fixing the predictive directions (see Sect. 3.1): here we select as θ s the estimates of the first eigenfunctions of the covariance operator of X. According to Remark 3, in the following we denote this approach as FAM even though we use a different estimation technique. For the application, we refer to the Tecator data-set (available at http://lib.stat.cmu. edu/datasets/tecator), a well-known case studied in the chemometric domain. Data are derived from Near Infrared (NIR) spectroscopy experiments with the aim of predicting the proportion of a particular chemical constituent using as a predictor the spectrum of a mixture of several compounds, discretized at a sequence of wavelengths. This data set is widely used in the literature and has become a benchmark for assessing the predictive quality of functional regression models. 5.1 Simulation study A set of 1000 independent curves is generated according to the following model: Xi (t) = ai + bi t 2 + ci exp(t) + sin(di t),
t ∈ [−1, 1], i = 1, . . . , 1000,
where ai (respectively bi , ci , and di ) are uniformly distributed on [0, 1] (respectively on [0, 1], [−1, 1] and [−2π, 2π]). Curves are discretized over a very fine mesh of one thousand equispaced points: the set of curves is stored in the matrix X = [xi (tj )], i = 1, . . . , 1000, j = 1, . . . , 1000. A random selection of 30 of these functional data is plotted in Fig. 1. We consider the following regression operators: 1 3 1 1 sin πt + sin πt x(t)dt, r1 [x] = √ 2 2 2 −1 3 r2 [x] = r1 [x] , 1 x 2 (t)dt, r3 [x] = −1 1
r4 [x] =
−1 1
r5 [x] =
−1 0
r6 [x] =
−1
cos x(t) dt,
x(t) log x(t) dt, sin x(t) dt +
0
1
x(t) dt.
F. Ferraty et al. Fig. 1 A random selection of 30 simulated curves
Response values Y corresponding to each rp are then computed from sequences of independent random errors as follows: (p)
Yi
= rp [Xi ] + σp Ei ,
i = 1, . . . , 1000, p = 1, . . . , 6,
where Ei are i.i.d. r.v. with zero mean and unit variance, and σp is equal to ρ times (ρ > 0) the standard deviation of the regression functional: in our experiments we choose ρ = 0.1 and ρ = 0.3, corresponding to small and moderate signal-to-noise ratios. We consider two distributions of error: the standard normal N (0, 1) and the standardized gamma γ (4, 1), which is right-skewed. We denote by D(p, ρ, E) each data set obtained under different experimental conditions. The generation of a large matrix of data X for each model has the sole purpose to obtain a more accurate approximation of rp and consequently a better evaluation of σp . We base our simulations experiments on samples of 300 couples (Xi , Yi ) extracted randomly from D: we use the first 200 as training-set and the remaining 100 as testing-set. Since, in practice, functional data are not observed continuously but on a grid of measurement locations, we do not use all the columns of X but only a selection. In our experiments we work with both dense and sparse design of measurement locations. For the dense case, we consider 100 equispaced points: {tj , j = 1, 11, 21, . . . , 991}, whereas for the sparse case, there are six design points: {tj , j = 1, 168, 334, 501, 668, 834}. All computations were done with R using the random seed 1678321. For every case, we make from one to three steps of the FPPR algorithm, each one followed by a full nonparametric procedure, as illustrated in Sect. 4.3. We use cubic splines, and the number of knots at each step has been fixed to 5 both in the dense and in the sparse case. The initialization values of the vector γm(0) are random (in general, it is convenient to try different initializations). About the functional nonparametric estimator, the semi-metric used is ϕ0 (·, ·), i.e., the standard L2 distance between curves.
Functional projection pursuit regression Table 1 MSEP and relative MSEP when we use the data sets D(p = 1, ρ, E), with ρ = 0.1, 0.3 and E normal and gamma distributed Method
FLM NPM FPPR (m = 1) FAM (m = 3)
D(1, 0.1, N (0, 1))
D(1, 0.1, γ (1, 4))
D(1, 0.3, N (0, 1))
D(1, 0.3, γ (1, 4))
Dense
Dense
Dense
Dense
Sparse
Sparse
Sparse
Sparse
0.0139
0.0145
0.0178
0.0178
0.0417
0.0436
0.0543
0.0541
(0.0756)
(0.0793)
(0.0829)
(0.0829)
(0.1947)
(0.2037)
(0.1992)
(0.1987)
0.0233
0.0213
0.0282
0.0254
0.0535
0.0517
0.0722
0.0723
(0.1268)
(0.1153)
(0.1316)
(0.1187)
(0.2498)
(0.2414)
(0.2649)
(0.2652)
0.0137
0.0135
0.0178
0.0173
0.0403
0.0425
0.0532
0.0528
(0.0747)
(0.0736)
(0.0833)
(0.0807)
(0.1883)
(0.1983)
(0.1952)
(0.1937)
0.0176
0.0269
0.0157
0.0220
0.0458
0.0578
0.0475
0.0527
(0.0959)
(0.1462)
(0.0734)
(0.1028)
(0.2137)
(0.2699)
(0.1742)
(0.1933)
When we work with fixed directions, we use from one to four eigenfunctions of the covariance operator of X: note that the cumulative explained variances of the corresponding linear principal components are 66.9 %, 85.6 %, 98.5 %, and 99.5 %. Tables 1–6 collect the out-of-sample results: we provide both the MSEP and MSEP divided for the empirical variance of Y s (in brackets). For the additive decompositions (FPPR and FAM), we report only results for the step (reported in brackets in the tables) that gave the best results. The parameters used in the FPPR approach (the number of knots and m) are selected in a subjective form, instead of using the methods presented in Sect. 4: we prefer to emphasize the abilities of our FPPR method rather than to explore in detail the data-driven procedure. In this way, we can expect to improve these results by optimizing the choice of the parameters. The aim of the analysis involving the linear functional r1 is to illustrate how FPPR, differently from NPM, is an equivalent competitor of FLM (see Table 1). A natural feature of FPPR is to be more parsimonious with respect to the additive model with fixed directions. As we can see in the graphic in the left panel of Fig. 2, our approach reveals through g1 the linearity of the link between variables. The case r2 provides an example in which the single-index model is appropriate: the one step FPPR approach produces better results in terms of forecasting than the other methods (see Table 2); Moreover, in this case, the relationships between the predictor and the response are detected satisfactorily: as we can see from the right panel of Fig. 2, the estimated first additive component g1 highlights the existence of a nonlinear link, allowing us to infer on its nature. In both cases r1 and r2 , any supplementary steps do not produce improvements. This is why we do not present the boosting step in Tables 1 and 2. Subsequent models are genuinely nonlinear: while the failure of the linear model is foregone, on the other hand, nonlinear and additive techniques show their superiority in making predictions. Tables 3, 4, and 5 report the results for the models based on functionals from r3 to r5 . In all the examples, a one-step FPPR followed by a full nonparametric step on residuals achieves superior results with respect to the NPM approach. More in detail, in examples r3 and r5 , a one-step FPPR gives a better performance than NPM,
F. Ferraty et al.
Fig. 2 Estimates of the first additive components with D(p = 1, ρ = 0.1, E ∼ N (0, 1)) (left panel) and D(p = 2, ρ = 0.1, E ∼ N (0, 1)) (right panel). In the graphics, u stands for θ1 , x, and g(u) for g1 (u) Table 2 MSEP × 10−2 and relative MSEP when we use the data sets D(p = 2, ρ, E), with ρ = 0.1, 0.3 and E normal and gamma distributed Method
FLM NPM FPPR (m = 1) FAM (m = 4)
D(2, 0.1, N (0, 1))
D(2, 0.1, γ (1, 4))
D(2, 0.3, N (0, 1))
D(2, 0.3, γ (1, 4))
Dense
Dense
Dense
Dense
Sparse
Sparse
Sparse
Sparse
0.2730
0.2841
0.3616
0.3874
0.5088
0.5138
0.7122
0.7249
(0.1511)
(0.1572)
(0.1676)
(0.1795)
(0.2480)
(0.2504)
(0.2631)
(0.2678)
0.3082
0.2938
0.3555
0.3224
0.5818
0.5687
0.7322
0.7035
(0.1706)
(0.1626)
(0.1647)
(0.1494)
(0.2836)
(0.2772)
(0.2705)
(0.2599)
0.1215
0.1174
0.1681
0.1582
0.3798
0.3797
0.4757
0.4554
(0.0672)
(0.0650)
(0.0779)
(0.0733)
(0.1851)
(0.1851)
(0.1757)
(0.1682)
0.2418
0.2594
0.3085
0.2528
0.4967
0.5139
0.6123
0.5051
(0.1338)
(0.1435)
(0.1430)
(0.1171)
(0.2421)
(0.2504)
(0.2262)
(0.1866)
and adding a supplementary nonparametric step, square errors are equivalent to those obtained with a three-step FAM approach. About experiments with r4 , our decomposition makes good prediction, but for best results, a nonparametric step is needed. These results suggest that when we use and additive decomposition where the directions are chosen in order to take response into in account, the number of terms involved may be smaller than in the case of directions selected a priori. The complexity of the regression operator in the last model requires a two-step FPPR method. According to Table 6, FPPR method with m = 2 produces results comparable to those derived from the nonparametric approach. With an additional nonparametric step it is possible to slightly improve the predictive performances. Moreover, in this example we can point out the lack of flexibility of FAM, which results in a deterioration of the predictive quality. To conclude this section, we summarize the main suggestions emerging from the examples treated: first, the FPPR decomposition works both in the linear and non-
Functional projection pursuit regression Table 3 MSEP and relative MSEP when we use the data sets D(p = 3, ρ, E), with ρ = 0.1, 0.3 and E normal and gamma distributed Method
FLM NPM FPPR (m = 1) FAM (m = 3) FPPR & NPM
D(3, 0.1, N (0, 1))
D(3, 0.1, γ (1, 4))
D(3, 0.3, N (0, 1))
D(3, 0.3, γ (1, 4))
Dense
Dense
Dense
Dense
Sparse
Sparse
Sparse
Sparse
0.8900
0.9024
0.9162
0.9420
1.2720
1.2789
1.3572
1.3883
(0.4215)
(0.4273)
(0.3810)
(0.3917)
(0.5042)
(0.5070)
(0.4391)
(0.4491)
0.2975
0.2986
0.3553
0.3971
0.6601
0.6774
0.8397
0.9004
(0.1409)
(0.1414)
(0.1477)
(0.1651)
(0.2617)
(0.2685)
(0.2717)
(0.2913)
0.2443
0.2722
0.2462
0.2538
0.5742
0.6163
0.6091
0.6549
(0.1157)
(0.1289)
(0.1024)
(0.1055)
(0.2276)
(0.2443)
(0.1971)
(0.2119)
0.1816
0.2369
0.2249
0.2607
0.4961
0.5627
0.6303
0.6378
(0.0859)
(0.1122)
(0.0935)
(0.1084)
(0.1967)
(0.2231)
(0.2039)
(0.2063)
0.1898
0.2111
0.1960
0.2077
0.5370
0.5559
0.6015
0.6158
(0.0899)
(0.0999)
(0.0815)
(0.0864)
(0.2129)
(0.2203)
(0.1946)
(0.1992)
Table 4 MSEP and relative MSEP when we use the data sets D(p = 4, ρ, E), with ρ = 0.1, 0.3 and E normal and gamma distributed Method
FLM NPM FPPR (m = 1) FAM (m = 3) FPPR & NPM
D(4, 0.1, N (0, 1))
D(4, 0.1, γ (1, 4))
D(4, 0.3, N (0, 1))
D(4, 0.3, γ (1, 4))
Dense
Dense
Dense
Dense
Sparse
Sparse
Sparse
Sparse
0.1046
0.1123
0.0963
0.1054
0.1179
0.1254
0.1060
0.1160
(1.0234)
(1.0989)
(1.0183)
(1.1139)
(1.0366)
(1.1029)
(1.0164)
(1.1115)
0.0144
0.0151
0.0165
0.0172
0.0341
0.0344
0.0414
0.0440
(0.1406)
(0.1475)
(0.1747)
(0.1817)
(0.2996)
(0.3021)
(0.3968)
(0.4220)
0.0201
0.0216
0.0255
0.0269
0.0366
0.0420
0.0529
0.0538
(0.1968)
(0.2118)
(0.2696)
(0.2847)
(0.3222)
(0.3689)
(0.5068)
(0.5156)
0.0119
0.0165
0.0133
0.0189
0.0304
0.0372
0.0357
0.0435
(0.1162)
(0.1612)
(0.1409)
(0.1993)
(0.2671)
(0.3265)
(0.3423)
(0.4170)
0.0124
0.0123
0.0155
0.0174
0.0319
0.0327
0.0420
0.0457
(0.1209)
(0.1201)
(0.1642)
(0.1839)
(0.2809)
(0.2875)
(0.4024)
(0.4384)
linear cases, and it allows one to reveal the nature of links; second, it appears robust with respect to sparseness of the design and the asymmetries in the error distributions. Hence, FPPR offers an interesting alternative to NPM, and when we boost FPPR by adding an NPM step, the predictive performances are still improved. Finally, with respect to the case in which directions are fixed, FPPR produces a more parsimonious model than the one obtained with FAM: a smaller number of informative directions with similar predictive power. 5.2 Chemometric application The data-set consists of 215 finely chopped pieces of meat (i.e., 215 individuals) illuminated by a light beam at 100 equally spaced wavelengths in the near infrared
F. Ferraty et al. Table 5 MSEP and relative MSEP when we use the data sets D(p = 5, ρ, E), with ρ = 0.1, 0.3 and E normal and gamma distributed Method
PLM NPM FPPR (m = 1) FAM (m = 3) FPPR & NPM
D(5, 0.1, N (0, 1))
D(5, 0.1, γ (1, 4))
D(5, 0.3, N (0, 1))
D(5, 0.3, γ (1, 4))
Dense
Dense
Dense
Dense
Sparse
Sparse
Sparse
Sparse
0.0849
0.0817
0.1059
0.1020
0.1378
0.1337
0.1838
0.1786
(0.2629)
(0.2530)
(0.2741)
(0.2638)
(0.3626)
(0.3519)
(0.3679)
(0.3576)
0.0423
0.0422
0.0627
0.0652
0.0953
0.0957
0.1426
0.1466
(0.1310)
(0.1306)
(0.1622)
(0.1688)
(0.2509)
(0.2520)
(0.2856)
(0.2936)
0.0389
0.0400
0.0502
0.0507
0.0846
0.0854
0.1252
0.1107
(0.1205)
(0.1238)
(0.1298)
(0.1313)
(0.2228)
(0.2248)
(0.2507)
(0.2217)
0.0352
0.0368
0.0491
0.0495
0.0823
0.0839
0.1140
0.1112
(0.1091)
(0.1140)
(0.1269)
(0.1281)
(0.2166)
(0.2207)
(0.2282)
(0.2227)
0.0304
0.0320
0.0370
0.0380
0.0803
0.0817
0.1086
0.0979
(0.0942)
(0.099)
(0.0956)
(0.0983)
(0.2114)
(0.2151)
(0.2174)
(0.1959)
Table 6 MSEP and relative MSEP when we use the data sets D(p = 6, ρ, E), with ρ = 0.1, 0.3 and E normal and gamma distributed Method
FLM NPM FPPR (m = 1) FPPR (m = 2) FAM (m = 4) FPPR & NPM
D(6, 0.1, N (0, 1))
D(6, 0.1, γ (1, 4))
D(6, 0.3, N (0, 1))
D(6, 0.3, γ (1, 4))
Dense
Dense
Dense
Dense
Sparse
Sparse
Sparse
Sparse
0.0626
0.0675
0.0601
0.0663
0.0793
0.0837
0.0771
0.0839
(0.7916)
(0.8532)
(0.6672)
(0.7368)
(0.8096)
(0.8542)
(0.6411)
(0.6980)
0.0144
0.0131
0.0170
0.0169
0.0342
0.0325
0.0413
0.0429
(0.1822)
(0.1651)
(0.1888)
(0.1880)
(0.3494)
(0.3315)
(0.3438)
(0.3567)
0.0324
0.0325
0.0388
0.0366
0.0499
0.0505
0.0543
0.0619
(0.4101)
(0.4104)
(0.4314)
(0.4062)
(0.5094)
(0.5150)
(0.4516)
(0.5150)
0.0123
0.0132
0.0156
0.0158
0.0280
0.0294
0.0422
0.0373
(0.1558)
(0.1666)
(0.1734)
(0.1752)
(0.2854)
(0.3002)
(0.3512)
(0.3101)
0.0234
0.0235
0.0289
0.0293
0.0407
0.0424
0.0527
0.0535
(0.2952)
(0.2971)
(0.3206)
(0.3252)
(0.4155)
(0.4327)
(0.4381)
(0.4452)
0.0115
0.0127
0.0135
0.0137
0.0285
0.0301
0.0388
0.0377
(0.1451)
(0.1598)
(0.1501)
(0.1522)
(0.2911)
(0.3075)
(0.3228)
(0.3136)
(NIR) range 850–1050. For each wavelength and each piece of meat, the absorption of radiation is recorded on a Tecator Infratec Food Analyzer. Therefore, we have a collection of 215 NIR absorbance spectra (i.e., spectrometric curves) discretized over 100 measurements. To each curve corresponds a content in percentage of water, fat, and protein determined by analytic chemistry. Our goal is to predict the fat content from the spectrometric curves. Left panel in Fig. 3 shows a random selection of 30 spectrometric curves obtained from a near infrared spectrometer analyzer. Before studying these data, it is important to point out that the particle size induces some shift in the NIR spectrometric
Functional projection pursuit regression
Fig. 3 A sample of selected curves (left panel) and corresponding second derivatives (right panel)
curve. This problem is well known as the calibration problem in the chemometrician community. This shift acts as a perturbation which does not reflect the chemical composition and hence noises in the whole NIR spectra. To cancel their effect, it is useful to take the first (or second) derivatives of the curves rather than the original data. The reader will find more details about the derivative-based semi-metric in Ferraty and Vieu (2002) and a general overview on calibration problem in Martens and Naes (1991). When curves are sufficiently smooth, they can be well approximated by means of first (or second) differences. Since these successive differences could be unstable, we estimate the derivatives by fitting a local polynomial (see, e.g., Fan and Gijbels 2000). The right panel of Fig. 3 displays such second derivatives, corresponding to the selected curves. Data set has been split into two parts: a training-set including the first 160 elements and a testing-set with the remaining 55 ones. The second derivative spectrum is used to predict the content of fat. We have run an FPPR procedure: it is based on cubic splines, and selection of the number of knots is made by using AIC. According to the penalized cross-validation criterion, we have stopped the algorithm at m = 2. A pure nonparametric step on the residuals after these two steps has been performed in order to capture some further links. The out-of-sample performances are compared in Table 7. The functional linear coefficient has been estimated by using the first 15 linear functional principal components. Concerning the nonparametric estimator, according to Ferraty and Vieu (2006), we have used the semi-norm ϕ2 (·, ·), whereas for completeness, we report the results of FPPR obtained in each step: they show that our method is equivalent to the nonparametric estimation, and, using a boosting procedure, we obtain the best results. When the directions are fixed, the best result is obtained with m = 2, but the performances of FAM are not as good as those obtained with FPPR. Goodness of the fit of our estimation method can be further appreciated in Fig. 4. Estimates of the directions and the additive components are shown in Figs. 5 and 6. Note that g1 and g2 clearly suggest nonlinear relationships: the first one is monoton-
F. Ferraty et al. Table 7 MSEP on the testing-set for linear regression, nonparametric approach, and FPPR method
Method
MSEP
FLM
7.1744
NPM
1.9152
FPPR—Step 1 (8 knots)
3.2893
FPPR—Step 1 & Step 2 (15 knots)
2.0368
FAM (m = 2)
4.7313
FPPR & NPM
1.6473
Fig. 4 True vs. predicted values in the testing-set
Fig. 5 Estimates of the first most predictive direction θ1 (left panel) and of the corresponding additive component g1 (right panel)
Functional projection pursuit regression
Fig. 6 Estimates of the second most predictive direction θ2 (left panel) and of the corresponding additive component g2 (right panel)
ically decreasing, whereas the second one shows a more pronounced nonlinearity. This also explains the difficulty in prediction with the linear model.
6 Theoretical results We study the convergence of the kernel estimate of the components in Sect. 6.1. In particular, we show in Theorem 4 that the functional nature of the predictor does not affect the rate of convergence. The result is stated for fixed/known directional functional parameters θ1 , . . . , θm . Section 6.2 shows how this result can be used to θm is estabestimate the optimal direction. The optimality of those estimates θ1 , . . . , lished in Theorem 5 for fixed value m. The theoretical study is completed in Sect. 6.3 by showing how m introduced in Remark 2 behaves asymptotically (see Theorem 6). In this section, when no confusion arises, we will denote by C st any nonnegative real constant. 6.1 Convergence of the estimated univariate components Assume that the parameters m and (θ1 , . . . , θm ) are fixed. The following result establishes the rate of almost sure convergence of the estimates of the univariate functions involved in the FPPR decomposition. Asymptotics are obtained by assuming that all the random variables θ, X, for all θ ∈ H , have values on a set C where C is a compact subset of R,
(8)
while the nonparametric modeling is defined through the following smoothness assumption on the targets gj,θj . It is assumed that for some β > 0 and some C = C(θ1 , . . . , θm ) ∈ R+ , we have, for all j = 1, . . . , m,
∀(u, v) ∈ C 2 , gj,θj (u) − gj,θj (v) ≤ C|u − v|β . (9)
F. Ferraty et al.
The usual conditional moments requirement, ∀r ≥ 1, ∃C < ∞, E |Y |r | X = σr (X) < C r!
(10)
with σr (·) continuous and bounded, is needed for the scalar response variable Y , while for the explanatory functional variable, it is assumed that for all θ ∈ H , the distribution of the real random variable θ, X is absolutely continuous with respect to the Lebesgue measure, with density fθ satisfying 0 < inf fθ (u) ≤ sup fθ (u) < ∞. u∈C
(11)
u∈C
Finally, the specific conditions required for the nonparametric kernel smoother are the following: for any j = 1, . . . , m, Kj is integrable and bounded with support (−1, 1),
(12)
and ∃τj > 0, ∃αj ≥ 0,
hj ∼ C
st
(log n)αj n
1 2τj +1
.
(13)
This set of conditions is necessary to get the almost sure asymptotic properties (see (17) below). A little more is needed to get L2 asymptotic expansions (see (18) below). These additional conditions consist in assuming an extra smoothness of the targets gj : ∀j = 1, . . . , m,
gj has qj (qj > 0) continuous derivatives,
(14)
and in using higher-order kernels Kj able to capture the information included in these additional regularity structures. Precisely, it is assumed that each kernel Kj is of order kj in the sense that for all j = 1, . . . , m, k ∀k = 1, . . . , kj − 1, u Kj (u) du = 0 and ukj Kj (u) du > 0 (15) with kj ≥ q j
and kj < kj −1 .
(16)
Theorem 4 (i) Assume that conditions (8)–(13) hold with τj = β and αj = 1 in (13). Then we have β
log n 2β+1
gj,θj (u) − gj,θj (u) = O a.s. (17) sup n u∈C (ii) Assume that conditions (8)–(16) hold with τj = kj and αj = 0 in (13). Then we have 2kj 2kj +1 2 st 1 gj,θj (u) − gj,θj (u) du ∼ C . n C
E
(18)
Functional projection pursuit regression
The proof of this theorem is given in Sect. A.1. It is worth noting that various theorems of this kind can be obtained, for instance, by relaxing the density condition into small ball probabilities considerations, as in Ferraty et al. (2010b). This was not done in order not to mask the main conclusion of Sect. 6.1: one can estimate each component involved in the FPPR decomposition without being affected by the dimensionality of the problem since the rates of convergence obtained in Theorem 4 have been shown by Stone (1982) to be the optimal ones for univariate regression problems. Another important point is that the leading terms in the asymptotic expansions obtained in the proof of Theorem 4 depend on θ only through the density functions fθ and their derivatives. As a consequence, if all the assumptions needed for Theorem 4 are met for any θj ∈ Θ and the uniform versions of assumptions (11) and (9) are made 0<
inf
θ∈Θ,u∈C
and for some β > 0 and for any
fθ (u) ≤
sup
fθ (u) < ∞,
θ∈Θ,u∈C (θ1 , . . . , θm ) ∈ H m ,
gj,θ (u) − gj,θ (v) ≤ C st |u − v|β , j j
∀j = 1, . . . , m, ∀(u, v) ∈ C 2 ,
(19)
(20)
then the results in Theorem 4 are valid uniformly over θj ∈ Θ. 6.2 Optimality of the estimated directions The aim of this section is to prove that the estimated directions θj , j = 1, . . . , m, defined in Sect. 3.2, are L2 -asymptotically optimal in the sense that they minimize, as n grows to infinity, some quadratic measure of accuracy. This is stated explicitly in Theorem 5 below. The L2 theoretical measure of accuracy we use for θj is defined as follows: 2 gj,θj∗ (u) − gj,θj (u) du , MISEj (θj ) = E C
and the theoretical L2 -optimal value of θj is defined as θj = arg min MISEj (θj ). θj ∈Θ
Theorem 5 below shows how the data-driven parameters θj are asymptotically good approximations of these uncomputable theoretical optimal functional parameters θj . But before that, we need to impose the next standard conditions on the set Θ. The space Θ is allowed to grow with n with the restriction ∃ζ ∈ R+ :
Card(Θ) ∼ C st nζ ,
(21)
and, in a natural way, we have to consider a set Θ which is not too far from the true target directions θj . This is made precise in the following condition: 2 ∀j = 1, . . . , m, ∃θ j ∈ Θ, gj,θj∗ (u) − gj,θ j (u) du = o n−1 . (22) C
Finally, the distribution of the functional variable X is supposed to be such that P ω : X(ω) ∈ G = τ > 0, and ∀θ ∈ Θ, ∀x ∈ G, θ, x ∈ C. (23)
F. Ferraty et al.
Theorem 5 Assume that the conditions of Theorem 4(ii) hold for any θ ∈ Θ. Assume also that (19)–(23) hold. Then, for any j = 1, . . . , m, we have θj ) MISEj ( → 1 a.s. MISEj ( θj )
as n → ∞.
(24)
It is worth noting that both Theorems 4 and 5 are quite general. This is because, until this stage, the integer m is fixed and the only decomposition that has been used is (5), which is always true. 6.3 Asymptotic behavior of the dimension estimate For each fixed number m of components, we have now at hand optimal estimates θ1 , . . . , θm of the relevant directions θ1∗ , . . . , θm∗ . The next result gives the asymptotic behavior of m∗ constructed from these estimated optimal functional direcθm by means of the penalized cross-validation criterion introduced in tions θ1 , . . . , Sect. 3.3. This is done under the simple following additional condition on the penalty term: lim un = 0.
n→∞
(25)
From now we have to assume that the FPPR decomposition (1) holds for a value m with an equality. Using the notation σj2 = Var(Ej,θj∗ 1X∈G ), the fact that m corresponds to the “true” FPPR decomposition can be stated by means of the following condition: ∀j ≥ m ,
2 σj2 = σm ,
(26)
while to insure that m is the smallest value such that (1) holds, we need to assume that ∀j < j ,
σj2 > σj2 .
(27)
Theorem 6 Assume that the conditions of Theorem 5 hold for any integer m > 0. If in addition the FPPR model holds (see Remark 2) together with conditions (25)–(27), then, for n large enough, we have = 1. (28) P m∗ ≥ m
7 Concluding remarks In this paper we have proposed a flexible functional regression analysis which adapts the idea of projection pursuit to situations where the explanatory variable is possibly of infinite-dimensional nature. Through theoretical analysis we have shown that the required components of the so-called functional projection pursuit regression decomposition can be estimated without being affected by the high/infinite dimensionality of the variables. The applicability of the procedure is illustrated with simulation studies and a real data example. The main features of our procedure are its predictability and interpretability. The estimated functional directions and the associated additive
Functional projection pursuit regression
components are often interpretable, and thus our procedure also provides an informative explanatory tool. It is worth noting that, even though the application of the method deals with curves, the general methodology developed in Sect. 3 and the theoretical results stated in Sect. 6 can be formulated within a general space H . This may include more general functional objects such as images and arrays or the conventional finite-dimensional objects. In addition, it is straightforward to extend our FPPR approach to the situation with a categorical response variable. Acknowledgements This work is part of the current research advances on functional statistics developed through the working group STAPH in Toulouse (http://www.math.univ-toulouse.fr/staph). The first and fourth authors would like to thank all the participants in the activities of this group for continuous and fruitful support. All the authors want to thank two anonymous referees and the Associate Editor for their pertinent remarks which have deeply improved our paper. All the authors wish also to thank prof. Juyhun Park (Lancaster) for her very fruitful comments and proofreading on an earlier version of this work.
Appendix: Proofs of theoretical results A.1 Proof of Theorem 4 For both assertions (i) and (ii), we first give proofs for the first estimated component g1,θ1 , and we then quickly indicate how the results can be iterated for higher-order component gj,θj . (a) Proof of (i): Case j = 1. The problem of estimating the function g1,θ1 is a standard problem of estimating a regression function between one scalar variable U and a functional variable V valued in some semi-metric space (V, d). It corresponds here to U = Y , V = X and for fixed θ1 to the semi-metric d(v, v ) = θ1 , v − v . The results in Theorem 2 of Ferraty et al. (2010b) concern this general situation and state that
β ΨC ( logn n )
g1,θ1 (u) − g1,θ1 (u) = O h1 + Oa.s. , sup nφC (h1 ) u∈C where the functions ΨC and φC control the topological structure induced by the semi-metric d. More precisely, the function φC is the small ball probability function φC () = inf P d(V , v) ≤ v∈C
which controls the concentration of the variable V (in the sense of the topology induced by d), and the function ΨC is the Kolmogorov entropy ΨC () = log(minimal number of balls of radius needed to cover C), which controls the complexity of the set C. The compactness condition (8) on C allows us to directly write that for the semi-metric d(v, v ) = θ1 , v − v , we have
F. Ferraty et al.
ΨC
log n n
∼ C st log n,
while the additional density condition (11) allows us to get φC () ∼ C st . Finally, the claimed result (17) follows directly for j = 1 from the bandwidth condition (13). (b) Proof of (i): Case j > 1. We just present the proof for j = 2, and, by induction, it will directly be valid for j = 2, . . . , m. We have the following decomposition: g2,θ2 (u) − g2,θ2 (u) = A(u) + B(u) with
n A(u) =
and
i=1 (Yi
− g1 (θ1 , Xi ))K2 ( u−θh22,Xi ) − g2,θ2 (u) n u−θ2 ,Xi ) i=1 K2 ( h2
n B(u) =
(29)
g1 (θ1 , Xi ))K2 ( i=1 (g1 (θ1 , Xi ) − n u−θ2 ,Xi ) i=1 K2 ( h2
u−θ2 ,Xi ) h2
.
Because E[Y − g1 (θ1 , X)|θ2 , X] = g2 (θ2 , X), we have directly by using the same methodology as before in point (a) β
log n 2β+1
a.s. sup A(u) = O n u∈C On the other hand, using condition (12) on the kernel and then the rate of convergence stated in part (a) of this proof, we directly have that
sup B(u) ≤ C st max g1 θ1 , Xi − g1 θ1 , Xi
i=1,...,n u∈C
≤ C st sup g1,θ (u) − g1,θ (u)
u∈C
=O
log n n
1
1
β 2β+1
,
a.s.
These two last results are enough to show that assertion (17) is true for j = 2, and it extends obviously in an iterative way to any j ≥ 2. (c) Proof of (ii): Case j = 1. This has been obtained in various earlier papers on nonparametric one-dimensional regression setting (see, for instance, Vieu 1991 for a general form of the result (18)). (d) Proof of (ii): Case j > 1. We just present the proof for j = 2, and, by induction, it will directly be valid for j = 2, . . . , m. It suffices to use again decomposition (29) and to follow step by step the scheme of proof of (b) before. By this way: – the term A can be treated by means of usual results on univariate kernel regression (see again Vieu 1991 for instance), and the rate of convergence C st n−2kj /(2kj +1) appears naturally;
Functional projection pursuit regression
– the term B involves quantities like g1,θ1 − g1,θ1 which are of lower order than n−2kj /(2kj +1) because of point (c) before and because of condition (16), which insures that k2 < k1 . This is enough to prove that (18) holds for j = 2, and, in an obvious way, the same reasoning can be done for any j ≥ 2. A.2 Proof of Theorem 5 We first give the proof for the two first estimated directions θ1 and θ2 . Then it quickly extends just by iterating for higher-order directions θj . All these proofs are presented in a rather sketched way because they follow standard route on cross-validation. (a) Case j = 1. Because the Functional Single Index Model is a special case (with m = 1) of the FPPR studied here, we can directly apply Theorem 1 in Ait-Saidi et al. (2008) to get the result (24) for j = 1. We just recall, because it will be helpful in the following, two intermediary results stated in Ait-Saidi et al. (2008). First of all, note that the main line of the proof in Ait-Saidi et al. (2008) consists in showing that uniformly over Θ we have CV1 (θ1 ) = MISE1 (θ1 ) +
n 1 2 E1,θ ∗ ,i IXi ∈G + o MISE1 (θ1 ) 1 n
a.s.
(30)
i=1
Secondly, note that one key tool used in Ait-Saidi et al. (2008) was the equivalence between various quadratic measures of errors; in particular, it was stated that uniformly over Θ one has the equivalence MISE1 (θ1 ) ∼
n 2 1 ∗ g1 θ 1 , X i − g1 θ1 , Xi IXi ∈G n
as n → ∞. (31)
i=1
(b) Case j = 2. The proof is based on the following decomposition: CV2 (θ2 ) = A2 (θ2 ) + B2 + C2 + D2 (θ2 ) with A2 (θ2 ) =
n 2 1 Yi − g1 θ1∗ , Xi − g2−i θ2 , Xi IXi ∈G , n i=1
n 2 1 ∗ B2 = g1 θ 1 , X i − g1 θ 1 , Xi IXi ∈ G , n i=1
n 2 ∗ C2 = g1 θ 1 , X i − g1 θ1 , Xi Yi − g1 θ1∗ , Xi n i=1 − g2 θ2∗ , Xi IXi ∈G ,
D2 (θ2 ) =
n 2 ∗ g1 θ 1 , X i − g1 θ1 , Xi g2 θ2∗ , Xi n i=1 − g2−i θ2 , Xi IXi ∈G .
We will now treat each of the four terms involved in this decomposition.
(32)
F. Ferraty et al.
– The term A2 (θ2 ) corresponds exactly to the cross-validation criterion studied in point (a) with a new response variable Y − g1 (θ1∗ , X), and without any additional computation we can write directly from (30) that, uniformly for θ2 ∈ Θ, 1 2 E2,θ ∗ ,i IXi ∈G 2 n i=1 + o MISE2 (θ2 ) a.s. n
A2 (θ2 ) = MISE2 (θ2 ) +
– Although B2 does not depend on θ2 and has no influence on the minimization of the criterion CV2 , we will give its asymptotic expansion because it will be helpful later. For that, note that Theorem 4-(ii) (with j = 1), combined with (31) and condition (22), gives directly that B2 = O n−2k1 /(2k1 +1) a.s. Because k2 < k1 , one gets directly by using again 4-(ii) (but this time with j = 2): (33) B2 = o MISE2 (θ2 ) a.s. It is worth noting that the small o(·) involved in this last equation is uniform over Θ (see the discussion at the end of Sect. 6.1). – The term C2 can be written as a linear combination 2 W i Ui n n
C2 =
i=1
of the centered independent variables Ui = Yi − g1 θ1∗ , Xi − g2 θ2∗ , Xi , and so, it can be treated by means of standard higher-order moment inequalities. More precisely, this term C2 has exactly the same structure as the one treated in Lemma 4 of Ait-Saidi et al. (2008), and the proof in this paper can be followed line by line by using, as often as needed, the Cauchy–Schwarz inequality and (33) to bound the terms involving the factors Wi . At the end, we easily get that, uniformly over Θ, (34) C2 = o MISE2 (θ2 ) a.s. - The term D2 (θ2 ) can be treated directly by using the Cauchy–Schwarz inequality together with the result (33), and we have, uniformly over Θ, D2 (θ2 ) = o MISE2 (θ2 ) a.s. (35) At the end, from the results (32)–(35) we arrive at the same expression as (30) before, namely, n 1 2 CV2 (θ2 ) = MISE2 (θ2 ) + E2,θ ∗ ,i IXi ∈G + o MISE2 (θ2 ) 2 n i=1
a.s.
(36)
Functional projection pursuit regression
Because this last result is uniform over θ2 ∈ Θ and because the second term in right-hand side of (36) does not depend on θ2 ∈ Θ, the claimed result (24) is shown for j = 2. (c) Case j > 2. The result (24) can be shown for higher values of j in an obvious similar way. This proof is finished, but it will be useful for the remaining of the paper to note that we have obtained precisely that, for any j ≥ 1, CVj (θj ) = MISEj (θj ) +
n 1 2 Ej,θ ∗ ,i IXi ∈G + o MISEj (θj ) j n
a.s.
i=1
This result, being uniform for θj ∈ Θ, can be combined with Theorem 4(ii) to lead for any j ≥ 1 to CVj ( θj ) =
1 2 Ej,θ ∗ ,i IXi ∈G + O n−2kj /(2kj +1) j n n
a.s.
(37)
i=1
A.3 Proof of Theorem 6 Condition (25) insures that the result (37) holds as well for the penalized criterion P CV as for the standard one CV . That means that we have PCV(j ) = σj2 + o(1)
∀j > 0,
a.s.,
and finally (27) insures that, for n large enough, ∀j < m ,
PCV(j + 1) − PCV(j ) < 0
a.s.
This is enough to see that, for n large enough, −1 ∗ m P m
≤
m −1
P PCV(j + 1) > PCV(j ) = 0.
j =1
References Ait-Saidi A, Ferraty F, Kassa R, Vieu P (2008) Cross-validated estimation in the single-functional index model. Statistics 42:475–494 Amato U, Antoniadis A, De Feis I (2006) Dimension reduction in functional regression with application. Comput Stat Data Anal 50:2422–2446 Aneiros-Pérez G, Vieu P (2006) Semi-functional partial linear regression. Stat Probab Lett 76:1102–1110 Burnham KP, Anderson DR (2002) Model selection and multimodel inference, 2nd edn. Springer, Berlin Cardot H, Sarda P (2005) Estimation in generalized linear models for functional data via penalized likelihood. J Multivar Anal 92:24–41 Cardot H, Ferraty F, Sarda P (1999) Functional linear model. Stat Probab Lett 45:11–22 Cardot H, Ferraty F, Sarda P (2003) Spline estimators for the functional linear model. Stat Sin 13:571–591
F. Ferraty et al. Cardot H, Mas A, Sarda P (2007) CLT in functional linear regression models. Probab Theory Relat Fields 138:325–361 Chen H (1991) Estimation of a projection-pursuit type regression model. Ann Stat 19:142–157 Crambes C, Kneip A, Sarda P (2009) Smoothing splines estimators for functional linear regression. Ann Stat 37:35–72 De Boor C (2001) A practical guide to splines. Series in probability and statistics. Springer, Berlin Eilers PHC, Li B, Marx BD (2009) Multivariate calibration with single-index signal regression. Chemom Intell Lab 96:196–202 Fan J, Gijbels I (2000) Local polynomial fitting. In: Schimek MG (ed) Smoothing and regression. Approaches, computation, and application. Wiley series in probability and statistics Febrero-Bande M, Gonzalez-Manteiga W (2011) Generalized additive models for functional data. In: Ferraty F (ed) Recent advanced in functional data analysis and related topics. Physica-Verlag, Heidelberg Ferraty F (ed) (2011) Recent advanced in functional data analysis and related topics. Physica-Verlag, Heidelberg Ferraty F, Romain Y (2010) Oxford handbook on functional data analysis. Oxford University Press, London Ferraty F, Vieu P (2002) The functional nonparametric model and applications to spectrometric data. Comput Stat 17:545–564 Ferraty F, Vieu P (2006) Nonparametric functional data analysis. Springer, New York Ferraty F, Vieu P (2009) Additive prediction and boosting for functional data. Comput Stat Data Anal 53:1400–1413 Ferraty F, Peuch A, Vieu P (2003) Modèle à indice fonctionnel simple. C R Acad Sci Paris 336:1025–1028 Ferraty F, Hall P, Vieu P (2010a) Most predictive design points for functional data predictor. Biometrika 97:807–824 Ferraty F, Laksaci A, Tadj A, Vieu P (2010b) Rate of uniform consistency for nonparametric estimates with functional variables. J Stat Plan Inference 140:235–260 Friedman JH, Stuetzle W (1981) Projection pursuit regression. J Am Stat Assoc 76:817–823 Gasser T, Kneip A, Koehler W (1991) A flexible and fast method for automatic smoothing. J Am Stat Assoc 86:643–652 Hall P (1989) On projection pursuit regression. Ann Stat 17:573–588 Huber PJ (1985) Projection pursuit. Ann Stat 13:435–475 James G (2002) Generalized linear models with functional predictors. J R Stat Soc B 64:411–432 James GM, Silverman BW (2005) Functional adaptive model estimation. J Am Stat Assoc 100:565–576 Jones MC, Sibson R (1987) What is projection pursuit? J R Stat Soc A 150:1–37 Martens H, Naes T (1991) Multivariate calibration. Wiley, New York M˝uller HG, Yao F (2008) Functional additive model. J Am Stat Assoc 103:1534–1544 Nelder JA, Mead R (1965) A simplex algorithm for function minimization. Comput J 7:308–313 Ramsay JO, Silverman BW (2005) Functional data analysis, 2nd edn. Springer, New York Stone C (1982) Optimal global rates of convergence for nonparametric estimators. Ann Stat 10:1040–1053 Vieu P (1991) Quadratic errors for nonparametric estimates under dependence. J Multivar Anal 39:324– 347 Vieu P (2002) Data-driven model choice in non parametric regression estimation. Statistics 36:231–246