c Allerton Press, Inc., 2007. ISSN 1066-5307, Mathematical Methods of Statistics, 2007, Vol. 16, No. 4, pp. 298–317.
Kernel Regression Estimation for Continuous Spatial Processes S. Dabo-Niang1* and A.-F. Yao2** 1
Univ. Charles De Gaulle, Lille, France 2 Univ. Aix-Marseille 2, France
Received January 19, 2007; in final form, July 27, 2007
Abstract—We investigate here a kernel estimate of the spatial regression function r(x) = E(Yu | Xu = x), x ∈ Rd , of a stationary multidimensional spatial process {Zu = (Xu , Yu ), u ∈ RN }. The weak and strong consistency of the estimate is shown under sufficient conditions on the mixing coefficients and the bandwidth, when the process is observed over a rectangular domain of RN . Special attention is paid to achieve optimal and suroptimal strong rates of convergence. It is also shown that this suroptimal rate is preserved by using a suitable spatial sampling scheme. Key words: kernel density estimation, kernel regression estimation, spatial process, spatial prediction, optimal rate of convergence. 2000 Mathematics Subject Classification: 62G07, 62G08, 62M20, 62M30. DOI: 10.3103/S1066530707040023
1. INTRODUCTION Spatial regression is applied to data collected in different locations (e.g., points of the earth, as in a variety of fields including soil science, geology, oceanography, econometrics, epidemiology, environmental science, forestry and many others) of a subset of RN , N ≥ 1. Such data sets indexed by ZN happen either because the studied process is essentially discrete (e.g., in forestry data) or just because it is the result of sampling a continuous process (as in oil data). In this latter case, the data result from discretization of realizations of a random process (Zu , u ∈ I), I being a subset of RN . Then, there are some issues, where it is not reasonable to assume that the stochastic process under study is varying in ZN , such as in oil engineering, geostatistics, spatial heterogeneity study in biology bioturbation processes, image analysis, and computer vision. As for continuous time processes (where discrete observations are available in practice), one has to take into account the continuous characteristic of the spatial process. This requires, for example, more flexibility in data collection. The available literature on spatial parametric models is rather extensive (we refer to Guyon [24], Anselin and Florax [1], Cressie [16] or Ripley [43] for methods, applications and a list of references). However, the nonparametric treatment of spatial data is limited. In fact, most nonparametric results for dependent random variables are derived from time series or continuous time processes (see, for ¨ et al. [25], Truong and Stone [48], Masry example, Yakowitz [50], Roussas [44], Bosq [8, 9], Gyorfi and Tjostheim [37], Lu [35], and Lu and Chen [36]), where the construction of models follows the natural order relation of the real line. Naturally, one tries to extend the nonparametric time series models to the spatial case by mimicking the order relation of R. case, this extension leads us to consider the rectangular regions In = In the spatial discrete i = (i1 , . . . , iN ) ∈ (N∗ )N , 1 ≤ ik ≤ nk , k = 1, . . . , N , n = (n1 , . . . , nN ) ∈ (N∗ )N (or more generally lattices in RN ). Such regions are used to estimate nonparametrically the spatial density, see the key * **
E-mail:
[email protected] E-mail:
[email protected]
298
KERNEL REGRESSION ESTIMATION
299
references: Tran [46], Tran and Yakowitz [47], Carbon, Hallin, and Tran [14], Carbon, Tran, and Wu [15], Hallin, Lu, and Tran [26], and Carbon [13]. These density estimation results are often the first steps towards the study of regression estimation. Except some papers, Biau and Cadre [5], Lu and Chen [33, 34], Hallin, Lu, and Tran [27], Carbon, Francq, and Tran [12], little attention has been paid to estimate spatial regression functions nonparametrically. The literature on nonparametric treatment of continuously indexed spatial processes is very limited, with a paper published (Biau [4]) dealing with kernel estimation of the density of spatial continuously indexed random processes (he gives optimal, parametric and intermediate rates of mean square convergence). Different approaches for continuously indexed random fields models have been used by Efros and Leung [22] and Levina and Bickel [31] for nonparametric texture synthesis methods. The method of Efros and Leung [22] is based on an heuristic texture synthesis method while that of Levina and Bickel [31] is a nonparametric bootstrap of stationary random fields. Levina and Bickel [33] prove also strong consistency properties of their algorithm under mixing conditions in the random field. They deal with uniform almost sure consistency of the kernel estimate of the joint distribution function of some continuous spatial processes. In this paper, we contribute to Biau’s [4] investigations, namely, we are interested in the kernel regression estimation of continuously indexed random fields. We mainly investigate weak and strong consistency and also convergence rates of the kernel estimator. This paper contains also some modest ideas about construction and consistency study of a nonparametric predictor of a continuously indexed spatial process. Our results represent an extension of the abundant continuous time processes results ` (see, for example, Banon and Nguyen [3], Bosq [9, 10], Leblanc [30], Bosq, Merlevede, and Peligrad [11], Blanke and Bosq [7], Blanke and Pumo [6], among others). This extension is far from being trivial. In fact, in order to mimic time series models, we construct an estimator based on hyperrectangles of RN corresponding to intervals in case N = 1. As already mentioned above, nonparametric temporal models are based on the natural order of the real line which does not exist in our case. We obtain similar results by using a partial order, namely, the Lexicographic order. The paper is organized as follows. In Section 2 we provide the notation, assumptions and introduce the kernel regression estimate. Section 3 is devoted to some preliminary results. Section 4 states the strong convergence of the regression estimate, under various types of asymptotic and mixing assumptions. A sampling scheme is studied in Section 5. In Section 6, we give a strong consistency result of a first-step nonparametric predictor. Proofs and technical lemmas are given in Section 7. 2. GENERAL SETTING AND NOTATION We consider some Rd × R-valued measurable and stationary spatial process Zu = (Xu , Yu ), u ∈ RN defined on a probability space (Ω, A, P), where the Zu ’s have the same distribution as (X, Y ) admitting an unknown density fX,Y with respect to the Lebesgue measure λ over Rd+1 , d ≥ 1, N ≥ 1. The pair of variables (X, Y ) is such that Y is integrable and X has a density f such that inf x∈D f (x) > 0, where D is a compact subset of Rd . Then the regression function r of Y on X is defined by ⎧ ⎨ ϕ(x) if f (x) = 0, r(x) = f (x) ⎩ EY if f (x) = 0, where ϕ(x) = R yfX,Y (x, y) dy, x ∈ Rd . In the following, a point u = (u1 , . . . , uN ) ∈ RN will be referred to as a site and 1N = (1, . . . , 1) ∈ N paper, · will denote any norm over Rd and C an arbitrary positive constant. If A R . Throughout the
1 if u ∈ A, 0 otherwise. The notation WT = Op (VT ) (respectively WT = Oa.s. (VT )) means that WT = VT ST for a sequence ST , which is bounded in probability (respectively almost surely). is a set, let 1A (u) =
MATHEMATICAL METHODS OF STATISTICS
Vol. 16
No. 4
2007
300
DABO-NIANG, YAO
2.1. The Spatial Estimator We deal with the problem of estimating r from observations of the process (Zu , u ∈ RN ) on some region. Since rectangular sets span the Borel sets of Rd , we can restrict the study to a rectangular region without lost of generality. For all T =(T1 , . . . , TN ), we will denote by IT the rectangular region IT = u ∈ RN + , 0 ≤ ui ≤ Ti , Ti ≥ 1, i = 1, . . . , N . We set T = T1 × · · · × TN and [T] = ([T1 ], . . . , [TN ]), where [t] denotes the integer part of a real t. The notation T → +∞ means that mini=1,...,N Ti → +∞. Then, the kernel regression estimator based on (Zu = (Xu , Yu ), u ∈ IT ) is defined by ⎧ ϕT (x) ⎪ ⎪ if fT (x) = 0, ⎪ ⎨ f (x) T
rT (x) = ⎪ 1 ⎪ ⎪ Yu du if fT (x) = 0, ⎩ T IT with fT (x) =
1 d Th T
K IT
x − Xu hT
ϕT (x) =
du,
1 d Th T
Yu K IT
x − Xu hT
du,
x ∈ Rd ,
where fT is a kernel density estimate of f , K : Rd → R is a bounded integrable kernel such that K(x) dx = 1 and the bandwidth hT is such that limT→+∞ hT = 0(+).
2.2. Spatial Dependence Measures 2.2.1. Mixing condition. The spatial dependence of the process will be measured by means of αmixing. Then, we consider the α-mixing coefficients of the field (Zu , u ∈ RN ) defined by: α σ(Zu ), σ(Zu ) , v ≥ 0, α(v) = sup
u, u ∈RN , u−u =v
where, for B and C two sub σ-fields of A, α(B, C) =
sup
|P (B ∩ C) − P (B)P (C)|.
B∈B, C∈C
We recall that the strong mixing condition (α(u) → 0 as u → ∞) is difficult to check in general, see Doukhan [21], Rio [42] and references therein for a complete discussion on mixing and examples. Nevertheless, there is a type of mixing condition which is less difficult to verify: the geometrically strong mixing condition. A process Zu is called geometrically strongly mixing (GSM) if there exist some s > 0 and C > 0 such that α(u) ≤ C exp(−su).
(1)
In fact, geometric ergodicity implies GSM (see Dielbolt and Guegan [20]), this is why the GSM condition is often used in the literature even for continuous time processes (e.g., Bosq [10]) or for discrete random fields (e.g., Carbon et al. [14, 15]). We will consider such a property later on. 2.2.2. Local dependence condition. Since we aim to get the same rate of convergence as in the i.i.d. case, we need some local dependence assumptions. Then, we consider the local measures of dependence of the processes (Xu ) and (Zu ) respectively defined by gu,w (x, x ) = f(Xu ,Xw ) − f ⊗ f,
u, w ∈ IT ,
∗ = f(Zu ,Zw ) − fX,Y ⊗ fX,Y , gu,w
u, w ∈ IT ,
and
where f(Xu ,Xw ) and f(Zu ,Zw ) denote respectively the joint probability densities of (Xu , Xw ) and (Zu , Zw ), and we make the following assumptions. The function gu,w satisfies MATHEMATICAL METHODS OF STATISTICS
Vol. 16 No. 4 2007
KERNEL REGRESSION ESTIMATION
301
Assumption H(Γ, q). There exists a Borel set Γ in RN × RN containing DN = {(u, w) ∈ RN × RN : u = w} and q ∈]2, +∞] such that / Γ; (1) gu,w exists for (u, w) ∈ (2) δq (Γ) = sup(u,w)∈Γ / gu,w q < +∞; (3) there exists a positive constant Γ such that
1 dudw = Γ < +∞. lim sup T→+∞ T IT ×IT ∩Γ
∗ and Gu,w (x, x ) = The functions gu,w
∗ (x, y; x , y ) dydy satisfy y y gu,w
Assumption H ∗ (Γ∗ , q ∗ ). There exists a Borel set Γ∗ of RN × RN containing DN = {(u, w) ∈ RN × RN : u = w} and q ∗ ∈]2, +∞] such that ∗ and Gu,w exist for (u, w) ∈ / Γ∗ ; (1) gu,w (2) δq∗ (Γ∗ ) = sup(u,w)∈Γ / ∗ Gu,w q ∗ < +∞; (3) there exists a positive constant Γ∗ such that
1 dudw = Γ∗ < +∞. lim sup ˆ T ˆ T→+∞ ∗ IT ×IT ∩Γ
/ Γ, where γ > 0 and β > 0. Assumption A(γ, β). α(u − w) ≤ γu − w−β for (u, w) ∈ ∗ ∗ ∗ Comments. Actually, assumptions H(Γ, q), H (Γ , q ), and A(γ, β) are generalizations of those used in the context of continuous time processes (see Bosq [10]). Assumption A(γ, β) is the classical polynomial mixing condition and is satisfied by a large class of processes, for example, stationary diffusion processes (see Doukhan [21] for more information). Assumption H(Γ, q) (or H ∗ (Γ∗ , q ∗ )) is the continuous counterpart of the classical assumption used in the discrete case (see, for example, Carbon et al. [15], Biau and Cadre [5] for spatially dependent processes and Bosq [10] for discrete time processes). More precisely, the similarity with the assumption used by Bosq [10] (Section 2.2) is more relevant if one replaces assumption H(Γ, q) by the following (as suggested in Biau [4] and Bosq [10]): Assumption H (Γ, q). / Γ and is Lipschitzian uniformly in (u, w), where Γ is such that (1) gu,w exists for (u, w) ∈ g (2) δq (Γ) = sup(u,w)∈Γ u,w q < +∞; / (3) there exists a positive constant Γ such that
1 dudw = Γ < +∞. lim sup T→+∞ T IT ×IT ∩Γ
H (Γ, q)
does not change anything on the results given here. Moreover, replacing H(Γ, q) by Condition H(Γ, q) is obtained, for example, by choosing sets Γ of the form: with 0 < < ∞. Γ = (u, v) ∈ RN × RN : u − v <
Then, Condition H(Γ, q)-(3) is automatically satisfied, since Γ < V (BN ) N , where V (BN ) is the volume of the unit ball BN = {u ∈ RN : u ≤ 1} in RN . Now, if in addition, there exists > 0 such that / Γ , if f q < +∞ and sup(u, w)∈Γ f(Xu ,Xw ) exists for (u, w) ∈ / f(Xu ,Xw ) q < +∞, then assumptions H(Γ, q)–((1), (2)) hold. Note that this argument is also valid for q = +∞. The most well-known version of assumption H(Γ, q) in the discrete case is (see Carbon et al. [15] or Biau and Cadre [5]): • there exists a constant D ≥ 0 such that f(Xi ,Xj ) (or equivalently, gi,j ), i, j ∈ NN , exists as soon as i − j > D and additionally |f(Xi ,Xj ) − f ⊗ f | ≤ C for some constant C ≥ 0 . MATHEMATICAL METHODS OF STATISTICS
Vol. 16
No. 4
2007
302
DABO-NIANG, YAO
• Clearly, this implies that conditions H(Γ, q)-(1) and H(Γ, q)-(2) are satisfied with Γ = ˜ D = {(u, v) ∈ ZN × ZN : u − v < D}, moreover, condition H(Γ, q)-(3), which correΓ ˜D) ≤ n ˆ DN , is also satisfied. sponds to Card(In × In ∩ Γ All these comments are also valid for assumption H ∗ (Γ∗ , q ∗ ). 2.2.3. Link between both measures. Bosq [10] shows that the relationship between the mixing coefficient and the local measure can be expressed as: α(u − w) ≤
1 gu,w 1 , 2
(u, w) ∈ R2N .
And if gu,w is Lipschitzian, then
√ 1/(2d+1) gu,w ∞ ≤ V (BN )−2 + C 2 α(u − w)
for some constant C > 0. 3. PRELIMINARY RESULTS In this section, we establish some preliminary results needed to get the rates of convergence of the regression estimate.
A Borel–Cantelli Result Our almost sure convergence results are obtained by using the following lemma, a Borel–Cantelli type result for continuous spatial processes. This is an extension of the case N = 1 given in Bosq [10]. Nevertheless, the proof in the case N = 1 is based on the natural order relation in R, which does not exist in our case. We show that a similar result is obtained by taking the Lexicographic order. Lemma 3.1. Let (Qt , t ∈ RN + ), N ≥ 1, be a continuous spatial process such that: (i) for each η, there exists a real decreasing function (with respect to Lexicographic order1 ) φη , integrable over RN + and satisfying P (|Qt | > η) ≤ φη (t). (ii) The sample paths of (Qt ) are uniformly continuous with probability 1. Then, we have lim Qt = 0 a.s.
t→∞
A Large Deviation Inequality In order to get rates of convergence, we need the following two large deviation lemmas for bounded spatial dependent random variables. Lemma 3.2. Let (ζv , v ∈ NN ) be a zero-mean real-valued random spatial process such that supv |ζv | ≤ b, and let Sn = v∈In ζv for n ∈ (N∗ )N . Then for each t ∈ (N∗ )N such that 1 ≤ ti ≤ 12 ni and each ε > 0, ε2 ˆ 2N +2 b α([p]) N +1 ˆε ≤ 2 , (2) exp − 2 t + P |Sn | > n 4v (t) ε 4 ˆ = n1 × · · · × nN , In = i = (i1 , . . . , iN ) ∈ NN , 1 ≤ ik ≤ nk , σ 2 (t) + bε, n where v 2 (t) = p2N 2 w(j, v) ζv , (3) σ 2 (t) = max E j∈It+1N
1
[2pjk ]+1≤vk ≤[(2jk +1)p]+1, 1≤k≤N
u ≤ w, u, w ∈ RN + ⇐⇒ u1 < w1 or (∃ i > 1 | ui < wi and uk = wk , ∀ 1 ≤ k < i). MATHEMATICAL METHODS OF STATISTICS
Vol. 16 No. 4 2007
KERNEL REGRESSION ESTIMATION
where p =
ni 2ti ,
303
w(j, v) ∈ (R)N such that
w(j, v)k = 1{[[2pjk ]+2, [(2jk +1)p]+1]} (vk ) + ([2pjk ] + 1 − 2pjk )1{[2pjk ]+1} (vk ) + (2jk + 1)p − [(2jk + 1)p] 1{[(2jk +1)p]+1} (vk ), w(j, v) = w(j, v)1 × · · · × w(j, v)N , and α([p]) =
|P (A ∩ B) − P (A)P (B)|.
sup A∈σ(ζs ,s∈In ), B∈σ(ζs ,∃ si ≥ni +[p])
If the process (ζv , v ∈ NN ) is a strictly stationary sequence, then 2 2 N vk ζv + (p − [p]) ζv , σ (t) = E 1≤vk ≤[p], k=1,...,N
1≤vk ≤[p+1], ∃ vk =[p+1]
where N vk is the number of components vk of the vector v equal to [p + 1]. Remark 3.3. (1) Note that if p is an integer, σ 2 (t) = var 1≤vk ≤p, k=1,...,N ζv . (2) Although this lemma is the spatial counterpart of Theorem 1.3 of Bosq [10], where N = 1, one does not get exactly the same result by taking N = 1. This is due to the fact that on the one hand we do not use Bradley’s lemma in the proof. On the other hand, the main difference comes from the decomposition in inequality (20) of the proof which is different from inequality (1.30) of Bosq [10]. Then N+2 the second bound 2 bεα([p]) obtained here is smaller than the one obtained by sketching the result of Bosq. So, if we set N = 1, our lemma improves the time series result of [10]. (3) A similar result can be obtained by replacing the assumption supu |ζu | ≤ b by E exp(a|ζu |c ) < ∞ (for some positive constants a and c). The use of Bradley’s Lemma is then needed.
A Large Deviation Result for the Kernel Density Estimator The previous Lemma 3.2 allows the following large deviation inequality for the process fT (x) − EfT (x), T ∈ RN : Lemma 3.4. Let T ∈ RN , n = [T], Ti = ni δi (Ti ≥ 1). Without loss of generality, we assume that there exists an integer p and a vector t ∈ (N∗ )N such that ni = 2pti . If f ∞ < ∞, H(Γ, q) and A(γ, β) hold for some Γ, q, γ, and β ≥ 2N (q − 1)/(q − 2), then for any ε > 0, 1 ˆ d 2N +3 Cα(p) N +1 exp − εthT + (4) P |fT (x) − EfT (x)| > ε ≤ 2 4A0 ε hdT for some positive constants A0 and C.
Remark 3.5. A similar result is obtained for the process ϕT (x) − EϕT (x), T ∈ RN under the assumptions H ∗ (Γ∗ , q ∗ ), A(γ, β) and an additional condition specified in the following section. 4. CONSISTENCY RESULTS In this section, we deal with weak and strong convergence rates of the regression estimate. The convergence of the bias ErT (x) − r(x) to zero behaves as in the i.i.d. case, therefore it is omitted. From now on, we assume that the following conditions are satisfied: • the process (Yu , u ∈ RN ) is bounded: |Yu | ≤ b, b > 0; • the function ψ(x) = E[Y 2 |X = x] is continuous; • H(Γ, q) holds for some Γ and q; • H ∗ (Γ∗ , q ∗ ) holds for some Γ∗ and q ∗ . MATHEMATICAL METHODS OF STATISTICS
Vol. 16
No. 4
2007
304
DABO-NIANG, YAO
4.1. Weak and Strong Consistency In addition, we assume in this subsection that: • the functions f and ϕ satisfy a Lipschitz condition, • the kernel K satisfies a Lipschitz condition and is such that xK(x) dx < ∞. First, we study the convergence rate:
ΨT =
1/2 log T , d Th
(5)
T
and set β1 =
d(dN +3N +β) β−(d+1)N ,
where β is such that α(u) ≤ γu−β , u > 0.
4.1.1. Weak consistency. As a consequence of Lemma 3.4 we obtain the weak consistency given in the following Theorem 4.1: Theorem 4.1. If α(u) ≤ γu−β , γ > 0, u > 0, with 2N (q − 1) 2N (q ∗ − 1) , , N (d + 1) ; β > max q−2 q∗ − 2 −1 → 0, and Th β1 (log T) −1 → ∞, then, 2+d (log T) Th T T sup |rT (x) − r(x)| = Op ΨT .
(6)
x∈D
Remark. Theorem 4.1 holds if instead of our assumptions on K, ϕ, and f (see Biau [4]) we assume ˜ = inf y∈D x − y ≤ κ} of the ˜ = {x ∈ Rd : dist(x, D) that f and ϕ are continuous on a dilated set D compact set D, for some κ > 0, where inf D f > 0 and K is a Lipschitz function of bounded support. 4.1.2. Strong consistency. To get the almost sure consistency result, we apply the Borel–Cantelli Lemma 3.1 to the sample paths of QT = ΨT (VT (x) − EVT (x)) (where VT = fT or ϕT ). In addition to the conditions given at the beginning of this section we assume that the paths of QT = ΨT (VT (x) − EVT (x)) are uniformly continuous with probability 1 and the bandwidth hT is such that: −β+N(d+1) β1 (log T)−1 ) 2N is decreasing (7) φη = C(Th T
and β3 g(T)−2N/(β−(d+3)N ) → ∞, β2 (log T) Th T
(8)
where β2 =
d(dN + 3N + β) , β − (d + 3)N
β3 =
N (d + 1) − β , β − (d + 3)N
1+ , and > 0 is an arbitrary small real number. Then we get the g(T) = N i=1 (log Ti )(log log Ti ) following strong consistency result of the regression function under the exponential mixing condition. Theorem 4.2. If α(u) ≤ γu−β with u > 0, γ > 0 and 2N (q − 1) 2N (q ∗ − 1) , , N (d + 1) , β > max q−2 q∗ − 2 −1 → 0, then 2+d (log T) Th T sup |rT (x) − r(x)| = Oa.s. (ΨT ).
(9)
x∈D
The next theorem gives a strong consistency result under the geometrically mixing condition. 2+d (log T) −1 → 0, and Th d (log T) −2N −1 → ∞, then Theorem 4.3. If (Xu ) is GSM, Th T T sup |rT (x) − r(x)| = Oa.s. (ΨT ). x∈D
MATHEMATICAL METHODS OF STATISTICS
Vol. 16 No. 4 2007
KERNEL REGRESSION ESTIMATION
305
4.2. Optimal and Suroptimal Uniform Convergence Rate In order to achieve the same optimal uniform rates of convergence of the regression estimate as in the i.i.d. case, we consider spatial processes which satisfy the following additional conditions: • f ∈ Crd (l), ϕ ∈ Crd (l ) for some l and l and are bounded; • K ∈ H(k, c); • hT = CT
log T T
1 2r+d
(CT → c > 0),
where Crd (l) (r = k + λ, 0 < λ ≤ 1, k ∈ N)) is the space of real-valued k times differentiable functions G defined on Rd and such that ∂G(k) ∂G(k) (x) − j1 (y) ≤ lx − yλ ; j1 jd jd ∂x1 . . . xd ∂x1 . . . xd here x, y ∈ Rd ; j1 + · · · + jd = k and K ∈ H(k, c) (k ∈ N∗ , 0 < c ≤ 1) means that
uc |u1 |α1 . . . |ud |αd |K(u)| du < +∞, Rd
uα1 1
. . . uαd d |K(u)| du
(10) α1 , . . . , αd ∈ N,
= 0,
α1 + · · · + αd = j,
j ∈ [1, k].
Rd
Such a kernel is usually called a kernel of order (k, c). Note that traditional kernels are of order (1, 1). In the non-spatial case (N = 1) it is well known that if the conditions above are satisfied (see − r 2r+d T , and the Bosq [10]), the optimal uniform strong convergence rate for rT is logm T log T 1 T − 2 . Here, we show that the same rates hold in the spatial case. suroptimal strong rate is logm T log T
This optimal rate is obtained by applying the Borel–Cantelli Lemma 3.1 to the process PT = r log T 2r+d 1 VT (x) − EVT (x) (where VT = fT or ϕT ) considered in this section.
logm T
T
The following two lemmas ensure that conditions (i) and (ii) of Lemma 3.1 are satisfied for QT = PT .
Lemma 4.4. (1) If α(u) ≤ γu−β , γ > 0, u > 0, where 2N (q − 1) 2N (q ∗ − 1) N (3r + 2d) , , , β > max q−2 q∗ − 2 r then P (|PT | > η) ≤
A 1+μ T
,
η > 0,
Ti ≥ 1,
i = 1, . . . , N,
(11)
where A and μ do not depend on x. (2) If (Zu )is GSM, then P (|PT | > η) ≤
B
2 C(logm T) T where B and C do not depend on x.
,
η > 0,
Ti ≥ 1,
i = 1, . . . , N,
(12)
The following lemma is the spatial counterpart of Bosq’s [10] Lemma 4.4 (p. 112) when N = 1, its proof is similar and will be omitted. Lemma 4.5. The process (PT ) satisfies the uniform Lipschitz condition sup
|PT (x, ω) − PS (x, ω)| ≤ ΛT − S;
(13)
x∈Rd ,ω∈Ω
T = (T1 , . . . , TN ), S = (S1 , . . . , SN ), Ti > 1, Si > 1, i = 1, . . . , N , where Λ does not depend on (x, ω, S, T). MATHEMATICAL METHODS OF STATISTICS
Vol. 16
No. 4
2007
306
DABO-NIANG, YAO
4.2.1. Optimal rates. We obtain the same rates as in the non-spatial case, see Bosq [10]. The following theorem gives a strong rate of convergence of the regression estimate under the exponential mixing condition. This result is a consequence of Lemmas 3.1, 4.4, and 4.5. Theorem 4.6. If α(u) ≤ γu−β , γ > 0, 2N (q − 1) 2N (q ∗ − 1) N (3r + 2d) , , , β > max q−2 q∗ − 2 r then
r 2r+d T |rT (x) − r(x)| → 0 a.s., log T T→∞ logm T 1
(14)
m ≥ 1, x ∈ Rd . We now state the following uniform result, which is equivalent to Theorem 4.10 (p. 114) of Bosq [10] for the case N = 1. The proof in [10] is essentially based on arguments concerning the observed values x of the random variable X rather than on the indices. So, although N > 1, the proof is obtained by using similar arguments as in Bosq [10] and will be omitted. Theorem 4.7. If (Zu ) is GSM, then sup |PT (x)| → 0 a.s.
(15)
ˆa x≤T
m ≥ 1, a > 0. We deduce from this theorem the uniform optimal rate of convergence under the geometric mixing condition. decreasing2with respect to the norm · , Theorem 4.8. If (Zu ) is GSM, if f and ϕ are ultimately if supu∈IT Xu is measurable for each T, and if E supu∈IT Xu a < ∞ for some a > 0, then: r 2r+d T 1 sup |rT (x) − r(x)| → 0 a.s., (16) log T T→∞ logm T x∈D m ≥ 1, x ∈ Rd . 4.2.2. Suroptimal rate. If we replace the assumptions H and H ∗ by Hr : gu,w = gu−w exists for u = ∗ ∗ = gu−w and Gu,w = Gu−w w and the mapping u −→ uN −1 gu ∞ ∈ L1 (]0, +∞[) and Hr ∗ : gu,w exist for u = w and the mapping u −→ uN −1 Gu ∞ ∈ L1 (]0, +∞[) respectively, then we get:
Theorem 4.9. Under the conditions of Theorem 4.8 with H(Γ, q) and H ∗ (Γ∗ , q ∗ ) replaced by Hr ˆ −γ , where 1 ≤ γ ≤ 1 , we have for all m ≥ 1 and Hr ∗ and assuming that hT ∼ T 2r 2d 1 2 T 1 sup |rT (x) − r(x)| → 0 a.s. (17) log T T→∞ x∈D logm T +∞ Remark 4.10. The integrability condition of the mapping (x1 , x2 ) −→ 0 uN −1 Gu ∞ du is the spatial equivalent of assumption A(Γ, λ) in Bosq [10]. The consequences of this condition are the same as those in [10]. This condition contains an asymptotic independence and a local irregularity conditions. It means that the information provided by (Xu , Xu+h ) differs significantly from the one given by Xu even if h is small. It also means that the sample trajectories are not smooth. Local irregularity of the observed paths provides more information than discrete data. This partly explains suroptimal rates (see above). We refer to Biau [4] for an example of spatial Gaussian process that satisfies conditions in Theorem 4.9. 2
limu→∞ f (u) = 0. MATHEMATICAL METHODS OF STATISTICS
Vol. 16 No. 4 2007
KERNEL REGRESSION ESTIMATION
307
5. SAMPLING In practice, data are collected by using a sampling scheme. In other words, this means that a spatial continuously indexed process is often observed at a finite number of appropriately chosen points. Various sampling designs can be employed, depending on the way of collecting the data. But two kinds of sampling designs are most useful: deterministic (points are chosen according to a deterministic rule, for example, periodic sampling) and random (points are chosen randomly, for example, Poisson sampling). Here we treat two deterministic designs: irregular sampling and admissible sampling. To our knowledge, little attention has been paid to the study of spatial regression models under sampling designs. Biau [4] studied the influence of discretization on the rate of convergence of his kernel estimator. Lahiri and Zhu [29] studied the properties of the M -estimator of a class of spatial regression models under a stochastic sampling design.
5.1. Irregular Sampling We consider the process (Zu , u ∈ RN ) observed on a grid Gn = t1,1 , . . . , t1,n1 } × · · · × {tN,1 , . . . , tN,nN = n1 × · · · × nN , where the tk,j ’s are such that for k = 1, . . . , N , with n = (n1 , . . . , nN ) ∈ NN and n min
(tk,j+1 − tk,j ) ≥ m > 0 for some constant m.
1≤j≤nk −1
The associated kernel estimator of r is:
ϕ (x)/f n (x) if r n (x) = 1 n if i∈Gn Yi n
where f n (x) =
1 x − Xi K and hdn n hn
ϕn (x) =
i∈Gn
f n (x) = 0, f n (x) = 0,
1 x − Xi Y K , i hdn n hn
x ∈ Rd .
i∈Gn
Since we are dealing with a discrete set, it appears obvious that the behavior of these estimators is the same as of those studied in Biau and Cadre [5], where estimators of f and r for discrete random fields are considered. Thus all the results of [5] remain valid with slight modifications.
5.2. Admissible Sampling Biau [4] shows that the parametric rate in mean square is obtained by using an admissible sampling. Here, we state that the almost sure suroptiomal rate is also reached if one uses the same sampling. We consider the process (Zu , u ∈ RN ) observed at sampling locations. As in Biau [4], in order to model the fact that the observations are frequent on a large region of the space, we assume that the process (Zu , u ∈ RN ) is observed on a discrete grid In = {δn , 2δn , . . . , nδn }N , where n = n · 1N , n ∈ N, δn → 0(+) and Tn = nδn → +∞ as n → +∞. Then, the kernel estimator of r based on the observations on In is:
ϕ∗n (x)/fn∗ (x) if fn∗ (x) = 0, rn∗ (x) = n−N i∈In Yi if fn∗ (x) = 0, where fn∗ (x)
1 x − Xi = N d K n hn hn
and
ϕ∗n (x)
i∈In
1 x − Xi = N d Yi K , n hn hn
x ∈ Rd .
i∈In
As in the case N = 1, (δn ) is said to be an admissible sampling if the parametric rate of mean square error remains valid with observations on In and minimal sample size nN . Biau [4] shows that this rate is MATHEMATICAL METHODS OF STATISTICS
Vol. 16
No. 4
2007
308
DABO-NIANG, YAO −d/2r
obtained with δn ∼ Tn for fn∗ . Now, since the Yu ’s are bounded, the same result remains valid for rn∗ under similar conditions. The following theorem shows that the superoptimal uniform convergence rate remains valid if the sampling is admissible. The proof of this theorem is simple and is omitted (see the proof of Theorem 4.9 and [4]). −d/2r
−N
, hn ∼ Tn 2r and if r > d, Theorem 5.1. Under the conditions of Theorem 4.9 with δn = Tn 1 N 2 ∗ N log Tn a.s., m ≥ 1. sup |rn (x) − r(x)| = o logm Tn TnN x∈D 6. SPATIAL PREDICTION — FIRST APPROACH N We consider the set IT ⊂ RN defined as before for some T ∈ RN + . Let (ξt , t ∈ R+ ) be an R-valued strictly stationary random spatial process assumed to be bounded and observed over a subset OT ⊂ IT . The aim of this section is to predict the square integrable value, ξt0 , at a given fixed point t0 ∈ IT \OT . In practice (e.g., for simplicity), we expect that ξt0 depends only on the values of the process in a bounded neighborhood Vt0 ⊂ OT . In other words, we expect that the process (ξt ) has a Markov property. In the time continuous context, the Markov property means that “the past and the future are independent” and is based on the natural total order of the real line R. One would like to extend this property to the spatial context by considering a partial order relation like the Lexicographic order (this corresponds to the “simultaneous vertical and horizontal” Markov property for N = 2) as it was done in the previous sections. Unfortunately, whereas the Lebesgue integral properties allow us to limit the study of the regression estimation to hyperrectangles (without loss of generality), it is not the case for spatial Markov properties. Therefore we cannot limit the prediction study to the case of Markov random field satisfying a “simultaneous vertical and horizontal” property. Apart from the “simultaneous vertical and horizontal” Markov property, two types of Markov properties can be considered: the sharp and germ Markov properties with these three Markov properties being not equivalent. ´ [32] for hypperrectangles and generalized The sharp Markov property was first introduced by Levy later by Russo [45] to more general sets. The advantage of this latter case is that it does not require a partial order. A process (Xt ) is said to satisfy a sharp Markov property with respect to a set A if the σfields FA and FAc (where Ac is the complement of A) are conditionally independent given the so-called sharp σ-fields F∂A , where FD = σ(Xt , t ∈ D) and ∂D denotes the boundary of a set D ⊆ Rd . Then, the values of the process in A, ∂A, and Ac can be regarded as the “past”, the “present”, and the “future” respectively. This property is very restrictive and processes with this property are difficult to handle. For example, Walsh [49] showed that the Brownian sheet does not possess the sharp Markov property with respect to the triangle defined by (0, 0), (0, 1) and (1, 0). In fact, it seems that a process satisfying a sharp Markov property is a bit cramped inthe sharp σ-field. This leads some authors (the first was McKean [28]) to consider a larger set F ∂A = FG , ∂A ⊆ G, G is an open set (called the germ σ-field) and to introduce the germ Markov property. A process (Xt ) is said to satisfy a germ Markov property with respect to a set A if the σ-fields FA and FAc are conditionally independent given the germ σ-field F ∂A . In general, F ∂A is different from F∂A , but there are some cases where these σ-fields are equal, see Pitt and Robeva [40] for examples where F ∂A = F∂A . For more details on processes satisfying a germ Markov property (with respect to some set A) and condition on the set ∂A, see, for example, Moura and Goswani [38] and Pitt and Robeva [39]. In the following, we limit ourselves to germ and sharp Markov random fields, but there exist other Markov properties in the literature. Balan and Ivano [2] introduced the nice set-Markov property where the notion of present, past and future is more tractable. This property is based on properties of setindexed processes. Moreover it appears to be a natural extension of the continuous time Markov property. ◦
Thereafter, the symbols S, S and ∂S denote the interior, the closure, and the boundary of a set S ⊂ RN respectively. We suppose that the process (ξt , t ∈ RN ) is such that each connected set S with MATHEMATICAL METHODS OF STATISTICS
Vol. 16 No. 4 2007
KERNEL REGRESSION ESTIMATION − − smooth boundary satisfies: F∂S = F∂S , where F∂S =
,∂ ◦ >0 FS∩∂ S
S
309
= {t ∈ RN ; dist(t, ∂S) < }
is the -neighborhood of S and dist(t, ∂S) = inf u∈∂S {t − u}. This condition corresponds to the leftcontinuity condition for continuous time processes; it is needed for the construction of the associated process (Zt ) of the next section.
6.1. The Functional Predictor We suppose that the bounded neighborhood Vt0 of t0 is a connected open set. For simplicity, we consider the special case, where F ∂Vt0 = F∂Vt0 (the general case will be the subject of a future work). Then, the minimum mean-square error of prediction of ξt0 given the data in Vt0 is attained by E(ξt0 | ξu , u ∈ Vt0 ) = E(ξt0 | ξu , u ∈ ∂Vt0 ). So, the predictor is a continuous set, namely, the set of the values of the function ξ˜t0 defined by: ∂Vt0 −→ R, ξ˜t0 : u −→ ξu , which belongs to the space of continuous and bounded functions Cb (R). If we set V0 = Vt0 − t0 and Vt = V0 + t = {u + t, u ∈ V0 } for each t ∈ RN , we can consider the process t ∈ RN , Zt = (Xt , Yt ) = (ξ˜t , ξt ), where ξ˜t ∈ Cb (R) is the function defined by ξ˜t (u) = ξu , u ∈ ∂Vt , with ∂Vt ⊂ OT . Then, the kernel regression estimator based on the data (Zt , t ∈ OT , ∂Vt ⊂ OT ) is an estimator of a regression with functional explanatory variable: ⎧
⎪ ET Yt K(d(x, Xt )s /hT ) dt ⎪ if K(d(x, Xt )s /hT ) dt = 0, ⎪ ⎨ K(d(x, X ) /h ) dt t s T OT rˆT (x) = OT
⎪ ⎪ 1 ⎪ ⎩ Yt dt otherwise, V (ET ) ET x ∈ Cb (R), where ET = {t ∈ OT , ∂Vt ⊂ OT }, V (ET ) is the volume of ET , K is a real-valued function defined on R+ , d(·, ·)s is a metric on the space Cb (R), hT is a bandwidth such that hT ≥ 0 and limT→∞ hT = 0. Note that for N = 1 such estimators have been studied in i.i.d. cases, for example, by Dabo-Niang [18], Dabo-Niang and Rhomari [19] and Ferraty and Vieu [23] and by Ferraty and Vieu [23] in the case of dependent observations. The general mixing case (N ≥ 1) is the subject of two papers. In Dabo-Niang and Yao [17], we study the consistency of the density estimate for functional random fields. In a work in progress, we study the properties of the regression estimator rˆT for functional spatial random fields and we propose a predictor based on such estimator. Nevertheless, as a first and basic approach, we suggest here a predictor based on projection of ξ˜t in some functional basis.
6.2. The Spatial Predictor Based on Projection We suppose that for all t ∈ IT , ξ˜t belongs to the space L2 (R). Let (φk )k∈N∗ be a basis of L2 (R) (for example, Laguerre or Hermite basis). Then, for all t ∈ IT , there exists a sequence of real numbers (ct, k ), d k ∈ N ∗ , such that ξ˜t (u) = +∞ k=1 ct, k φk (u). For a fixed known integer d < ∞, let ξt = (ct, k )1≤k≤d . We d suppose that d is small and chosen so that the vector ξt provides as much information (about the link between ξt and ξ˜t ) as ξ˜t . Then, we can predict ξt0 by: ξtd −ξtd 0 ξ K dt t hT E ∗ = rT (ξtd0 ). ξˆt0 = T ξtd −ξtd0 dt hT ET K The following theorem shows that this predictor is strongly consistent and gives the suroptimal strong rate of convergence. MATHEMATICAL METHODS OF STATISTICS
Vol. 16
No. 4
2007
310
DABO-NIANG, YAO
Theorem 6.1. Under the conditions of Theorem 4.9 with H(Γ, q) and H ∗ (Γ∗ , q ∗ ) replaced by Hr ˆ −γ , where 1 ≤ γ ≤ 1 , we have for all m ≥ 1 and Hr ∗ and assuming that hT ∼ T 2r 2d ˆ 1 2 T 1 ∗ |rT (ξtd0 ) − r(ξtd0 )|1{ξ d ∈D} → 0 a.s., t0 log T T→∞ logm T where D is a compact subset of Rd . Proof. The proof is similar to that of Theorem 4.9 and hence is omitted. 6.2.1. Remark. As mentioned in Section 6, in practice, data are collected at sampling locations. So another “natural” approach is to consider the “discrete” non-parametric spatial predictor proposed by Biau and Cadre [5] for data located in the integer lattice (N∗ )N . Note that this way of prediction corresponds to a particular spatial kernel regression estimation with functional predictor mentioned before, namely, the one based on sampling the set-value of the regressor.
Conclusion This study was motivated by the desire to provide a nonparametric methodology for spatial statistics of continuously indexed processes. The first step was provided by Biau [4], where mean square error consistency of the kernel density estimate was considered. The present paper extends his result to regression estimation, but still does not completely solve the problem due to the remaining prediction issue, where only a first modest step is considered. In further work, special attention will be paid to real data applications. 7. APPENDIX Proof of Lemma 3.1. We follow the proof of Lemma 4.2 in Bosq [10]. Let a be a constant and Tn = (T1,n , . . . , TN,n ) a sequence of real numbers which satisfies Ti,n+1 − Ti,n ≥ a, n ≥ 1, ∀1 ≤ i ≤ N . Since φη is decreasing, we have +∞
+∞
φη (t) d(t) ≥
... T1,Q
N (Ti,n+1 − Ti,n )φη (Tn+1 ) ≥ aN φη (Tn+1 ). n≥Q i=1
TN,Q
n≥Q
Thus n φη (Tn ) < ∞ and P (lim supn |QTn | > η) = 0, η > 0, by the Borel–Cantelli Lemma. This implies that QTn → 0 a.s. as n → ∞. Let now (Tn ) be a sequence of real numbers satisfying Tn ↑ +∞. We associate with each integer k (k) a subsequence (Tp ) of (Tn ) defined as follows: (k)
where
n1 = 1,
(k)
where
1 Ti,n2 − Ti,n1 ≥ √ , k N
where
1 Ti,np − Ti,np−1 ≥ √ , k N
T 1 = T n1 , T 2 = T n2 ,
1 Ti,n2 −1 − Ti,n1 < √ , k N
∀ i = 1, . . . , N,
.. . T(k) p = T np ,
1 Ti,np −1 − Ti,np−1 < √ , k N
∀ i = 1, . . . , N.
.. . The first part of the current proof shows that QT(k) → 0 a.s. as p → ∞ for each k. Now let us set p
Ω0 = {ω : t → Qt (ω) is uniformly continuous, QT(k) → 0, k ≥ 1}. p
MATHEMATICAL METHODS OF STATISTICS
Vol. 16 No. 4 2007
KERNEL REGRESSION ESTIMATION
311
It is clear that P (Ω0 ) = 1. If ω ∈ Ω0 and η > 0, there exists k = k(η, ω) such that t − s ≤
1 k
(k)
implies that |Qt (ω) − Qs (ω)| < η2 . Considering the sequence (Tp ) for each p and each n such that np ≤ n < np+1 , we have Tn − Tnp < k1 (here · is the Euclidean norm), hence |QTn (ω) − QTnp (ω)| < η2 . We have |QTnp (ω)| < η2 for p large enough. This implies that |QTn (ω)| < η for n large enough. This is valid for each η > 0 and each ω ∈ Ω0 , thus QTn → 0 a.s. Remark 7.1. In the proof we use the Euclidean norm, nevertheless any other norm can be used. For example, the use of the sup-norm allows us to replace Ti,np −1 − Ti,np−1 < k√1N by Ti,np −1 − Ti,np−1 < k1 . Proof of Lemma 3.2. Let Yu = ζv=([ui ]+1, 1≤i≤N ) , u ∈ RN . The following spatial blocking idea is due to Tran [46] and Politis and Romano [41]. i i Let Δi = (i11 −1) . . . (iNN −1) Yu du. Then,
n1 Sn =
nN ...
0
Yu du =
Δi .
1≤ik ≤nk , k=1,...,N
0
So, Sn is the sum of 2N pN t1 × t2 × · · · × tN terms Δi , each of which is an integral of Yu over a cubic block of side p. Let
(2ji +1)p
U (1, n, j) =
Δk ,
ki =2ji p+1, 1≤i≤N
2(jN +1)p
(2ji +1)p
U (2, n, j) =
Δk ,
ki =2ji p+1, 1≤i≤N −1 kN =(2jN +1)p+1
2(jN−1 +1)p
(2ji +1)p
U (3, n, j) =
(2jN +1)p
Δk ,
ki =2ji p+1, 1≤i≤N −2 kN−1 =(2jN−1 +1)p+1 kN =2jN p+1
2(jN−1 +1)p
(2ji +1)p
U (4, n, j) =
2(jN +1)p
Δk ,
ki =2ji p+1, 1≤i≤N −2 kN−1 =(2jN−1 +1)p+1 kN =(2jN +1)p+1
and so on. Note that
2(ji +1)p
U (2N −1 , n, j) =
(2jN +1)p
Δk .
ki =(2ji +1)p+1, 1≤i≤N −1 kN =2jN p+1
Finally,
2(ji +1)p
U (2N , n, j) =
Δk .
ki =(2ji +1)p+1, 1≤i≤N
So, 2 N
Sn = with T (n, i) =
tl −1 jl =0, l=1,...,N
(18)
T (n, i),
i=1
U (i, n, j).
MATHEMATICAL METHODS OF STATISTICS
Vol. 16
No. 4
2007
312
DABO-NIANG, YAO
If there are no integers t1 , . . . , tN such that ni = 2pti , i = 1, . . . , N , then a term, say, T (n, 2N + 1) containing all the Δk ’s at the ends not included in the blocks above can be added (see Tran [46] or Biau and Cadre [4]). This term will not change the proof much. By (18) it suffices to show that ˆε n 4b ε2 ˆ P |T (n, i)| > N ≤ 2 exp − 2 2 t + α(p) (19) 2 2 v (t) ε for each 1 ≤ i ≤ 2N . Without loss of generality we will show (19) for i = 1. Now, we enumerate in an arbitrary way the ˆt = t1 × t2 × · · · × tN terms U (1, n, j) of the sum T (n, 1), which we denote W1 , . . . , Wˆ . Note that t the U (1, n, j) are measurable with respect to the σ-field generated by Yu with u such that 2ji p ≤ ui ≤ (2ji + 1)p, i = 1, . . . , N . These sets of sites are separated by a distance at least p and since the Yu ’s are bounded by b, we have, for all m = 1, . . . , ˆt, |Wm | ≤ bpN . According to Lemma 4.4 in Carbon, Tran, and Wu [15], there exist independent random variables W1∗ , . . . , Wˆt∗ such that for all m = 1, . . . , ˆt, ∗ | ≤ 2bpN α(p). E|Wm − Wm
The Markov inequality leads to 2ˆtbpN α(p) ∗ . |>ε ≤ P |Wm − Wm ε By the Bernstein inequality we have ε2 ∗ . P Wm > ε ≤ 2 exp − ˆ t ∗ )2 + 2bpN ε 4 E(W m m=1 Now, since for all m = 1, . . . , ˆ t there exists a j(m) such that Wm = U (1, n, j(m)) has the same ∗ , we get distribution as Wm 2 EWm
=
∗ 2 E(Wm )
(2j1 (m)+1)p
=E
2 Yu du .
(2jN (m)+1)p
... 2j1 (m)p
2jN (m)p
To get the equality (3), note that if all the components of the index vector u are fixed except the kth one, then (2jk (m)+1)p
[2jk (m)p]+1
Yu du =
Yu du +
2jk (m)p
vk =[2jk (m)p]+2
2jk (m)p
2jk (m)+1)p
[(2jk (m)+1)p]
ζv +
Yu du [(2jk (m)+1)p]
[(2jk (m)+1)p]
= [2jk (m)p] + 1 − 2jk (m)p ζ(v, vk =[2jk (m)p]+1) +
ζv
vk =[2jk (m)p]+2
+ (2jk (m) + 1)p − [(2jk (m) + 1)p] ζ(v, vk =[(2jk (m)+1)p]+1)
[(2jk (m)+1)p]+1
=
w(j, v)k ζv .
vk =[2jk (m)p]+1
ˆ Since T (n, 1) = tm=1 Wm , we obtain ˆ ˆ t t ˆε ˆε ˆε n n n ∗ ∗ Wm > N +1 + P |Wm − Wm | > N +1 . P |T (n, 1)| > N ≤ P 2 2 2 m=1
(20)
m=1
MATHEMATICAL METHODS OF STATISTICS
Vol. 16 No. 4 2007
KERNEL REGRESSION ESTIMATION
313
By Bernstein’s and Markov’s inequalities we have ˆ 2 ε2 2ˆtbpN α(p) n N ˆ ε/2 ≤ 2 exp − P |T (n, 1)| > n + ˆ ε/2N +1 n ˆ ε/2N ) 22N +2 (4 ˆt σ 2 (t) + bpN n 22 bα(p) ε2 ˆ t + , ≤ 2 exp − σ2 (t) ε 4 4 p2N + bε ˆ = 2N pN ˆt. Then, we get inequality (19), so since ˆt = t1 × · · · × tN and n ε2 ˆ 2N +2 bα(p) N N N +1 ˆ ε) ≤ 2 P |T (n, i)| > n ˆ ε/2 ≤ 2 . exp − 2 t + P (|Sn | > n 4v (t) ε
Proof of Lemma 3.4. We set nk δk = Tk , k = 1, . . . , N , δˆ = δ1 × · · · × δN , n = [T] (Tk ≥ 1), then consequently 2 > δk ≥ 1. We set 1 Δj,n (x) = δˆ
j1 δ1
j N δN
KhT (x − Xu ) du,
... (j1 −1)δ1
j ∈ In ,
(jN −1)δN
where Kh (x) = (1/hd )K(x/h). Without loss of generality, assume that nk = 2ptk , where p is an integer and t = (t1 , . . . , tN ) ∈ (N∗ )N . Then, let fT (x) − E(fT (x)) = (Δj,n − EΔj,n ). Sn = n 1≤jk ≤nk , k=1,...,N
In order to apply inequality (2) in Lemma 3.2, note that here, p is an integer and 2
σ (t) = var
Δj,n
1≤jk ≤nk , k=1,...,N
pδN pδ1
1 ... KhT (x − Xu ) du . = var δˆ 0
0
= (T1 = pδ1 , . . . , TN = pδN ) instead of T. To this aim we may use inequality (6.1) of Biau [4] with T Then
1 1 2 x − X0 1 2N 2 E dK dx dy · p σ (t) = var(fT (x)) ≤ ˆ d hT h pN δh pN δˆ T
T
(IT ×IT )∩Γ
2d
4V (SN −1 )γK2∞ + V (BN )δq (Γ)K2q1 + β−N where q1 =
q q−1 ,
V (BN ) is the volume of the unit ball BN = {u ∈
RN
N
h q1 (1− β )−d
: u ≤ 1} in
T
ˆ d pN δh
,
T
RN
and V (SN −1 )
is the volume of the unit sphere SN −1 = {u ∈ RN : u = 1} in RN . Therefore since β > have a , σ 2 (t) ≤ N ˆ d p δh
2N (q−1) q−2 ,
we
T
∞ , we obtain where a = a(K, f ∞ , d, γ, β) does not depend on x. Now since |Δj,n − EΔj,n | ≤ 2CK hdT 1 1 2CK∞ ε . If we choose p = ε− N δˆ− N , then v 2 (t) ≤ Ah0dε , where A0 is a positive v 2 (t) ≤ N4a ˆ d + hd
p δhT
T
T
constant. Therefore, by (2), we get inequality (4). MATHEMATICAL METHODS OF STATISTICS
Vol. 16
No. 4
2007
314
DABO-NIANG, YAO
Proof of Theorem 4.1. We have: rT (x) − r(x) =
ϕT (x) − ϕ(x) f (x) − fT (x) − ϕ(x) . fT (x) fT (x)f (x)
We first prove that supx∈D |fT (x) − f (x)| = O(ΨT ) in probability. For x ∈ Rd , we have fT (x) − 2+d (log T) −1 → 0 implies that f (x) = fT (x) − EfT (x) + EfT (x) − f (x) . The condition Th T sup |EfT (x) − f (x)| = o(ΨT ).
(21)
x∈D
Since D is compact, it can be covered with ν balls Bk of radius vT and centers at xk . Choose vT = hd+1 T ΨT ,
−d ν ≤ C(vT )−d = C(hd+1 T ΨT ) .
(22)
Then, we can write that sup |fT (x) − EfT (x)| ≤ S1n + S2n + S3n , x∈D
where S1n = max sup |fT (x) − fT (xk )|, 1≤k≤ν x∈Bk
S2n = max sup |EfT (xk ) − EfT (x)|, 1≤k≤ν x∈Bk
S3n = max |fT (xk ) − EfT (xk )|. 1≤k≤ν
Now, since K satisfies a Lipschitz condition it is easy to show that both S1n and S2n are O(ΨT ). It remains to show that S3n = O(ΨT ) in probability. In order to apply (4) in Lemma 3.4, we set ε = ηΨT , where η > 0 is an appropriately chosen constant, − N1 −d −1 β1T (see the proof of Lemma 3.4), then ˆt ≥ Ψ2TNT and using (5), = hT α(p)ΨT . Now, we set p ∼ ΨT we get −c +β , P |S3n | > ε ≤ Cν T 1T where c is an arbitrary large positive constant. −c → 0 and νβ → 0. Using (22) and (5) we get To complete the proof, we will show that ν T 1T d/2 h−d(d/2+1) . −d/2 T ν ≤ C(log T) T d → ∞ (recall that β1 = β1 (log T)−1 → ∞ implies that Th Notice that β > d and Th T T
d(dN +3N +β) β−(d+1)N ).
d/2+1 . Therefore > Ch−d or h−d(d/2+1) ≤ C(T) Hence we have T T T −d(d/2+1)
−d/2 T d/2 h −c ≤ CT−c (log T) νT T
d+1−c (log T) −d/2 = Δ1 (T), ≤ C(T)
which goes to zero if c > d + 1. Using (22), we get −d(d+2) −β
−d −1 νβ1T = νhT α(p)ΨT ∼ hT
p
−(d+1)
ΨT
(23)
.
β1 (log T)−1 → ∞ is equivalent to (νβ )−1 → ∞, This last relation and (5) allow us to show that Th T 1T which implies that νβ1T → 0, because −d(d+2) −β
νβ1T ≤ ChT
−(d+1)
p ΨT d −(β/2N ) −d−1 β1 −β+N(d+1) −d(d+2) ThT 2N d )−1 log T 2 = C Th −1 ≤ ChT (log T) . (Th T T log T
It is necessary that β > (d + 1)N . Since (Yu , u ∈ RN ) is bounded, we obtain for |ϕT (x) − ϕ(x)| the same result as for |fT (x) − f (x)|. MATHEMATICAL METHODS OF STATISTICS
Vol. 16 No. 4 2007
KERNEL REGRESSION ESTIMATION
315
Proof of Theorem 4.2. Let us focus first on fT . We deduce from the previous proof that β1 −β+N(d+1) 2N −1 νβ1T = Δ2 (T) = φη . ≤ C ThT (log T) Then, 1−β/2N +(d+1)/2 h−d(d+2)−dβ/2N +d(d+1)/2 (log T) β/2N −(d+1)/2 g(T) νβ1T Tg(T) ≤ C T T −β+N(d+3) β2 −2N/(β−(d+3)N ) 2N β3 = Tg(T)Δ ≤ C Th 2 (T), T (log T) g(T) 1+ is such that where g(T) = N i=1 (log Ti )(log log Ti ) RN 1/(Tg(T)) dT < ∞. Then, condition (8) is equivalent to Tg(T)Δ N Δ2 (t) dt < ∞. 2 (T) → 0, which implies that [0, ∞[
Let hT be such that Δ2 is decreasing, and if the sample paths of ΨT (fT (x) − EfT (x)) are uniformly continuous with probability 1, then the theorem follows from Lemma 3.1, since Δ1 (T) is integrable and decreasing for T large enough. The result |ϕT (x) − ϕ(x)| = O(ΨT ) follows from the fact that (Yu , u ∈ RN ) is bounded. This yields the proof. d (log T) −2N −1 → ∞ implies Proof of Theorem 4.3. It is similar to that of Theorem 4.2. Condition Th T that 1/2N (log T) −1 → ∞. d / log T) (Th T
Suppose that a is an arbitrarily given large positive number. Since p ∼
Th d T log T
1 2N
except , for all T
finitely many, 1/2N ≥ (a/s) log T, d / log T) 2p ≥ (Th T
s > 0.
Therefore, since Zu is GSM, there exists s > 0 such that = CT −a . α(p) ≤ C exp(−sp) ≤ C exp(−s(a/s) log T)
(24)
As in the proof If necessary the constant C in (24) can be increased so that the inequality holds for all T. above, using (22) and (24), it is easy to show that νβ1T → 0 and Tg(T)νβ1T → 0, which implies that dT < ∞. The Borel–Cantelli Lemma 3.1 yields the proof. [0, ∞[N νβ1T Proof of Lemma 4.4. We apply inequality (4) of Lemma 3.4 by choosing ε = hr (logm T)η, p= 2 2 C ε n δ ε T n −1/N ] (η > 0). Since (logm T) η , we have t = 2N pN ≥ 2N ≥ 2N , ε thdT ≥ 2NT (log T) [(εδ) 1 d 4 N +1 exp − εthT ≤ , A > 0. 2 2 2 Aη (log 4A0 m T) T After some calculations, we obtain since β >
N (3r+2d) , c1 r
2N+3 α(p) ε hdT
β − ≤ c1 η N T
rβ−N(d+r) N(2r+d)
(log T)
rβ−N(d+r) N(2r+d)
( Nβ −1) and (logm T)
> 0, we get c2 2N +3 α(p) ≤ d 1+μ ε hT T
(c2 > 0,
μ > 0).
β
If η > 1, replace c2 by c2 η N . If α(·) tends to zero at an exponential rate (GSM), we get r 2N +3 α(p) −1 −d −β p N (2r+d) −ξ , ≤ c ε h γe ≤ c exp − c T 3 4 5 T ε hdT
(25)
where ξ > 0 is arbitrarily small, c3 > 0, c4 > 0. Then, the second inequality of the theorem is proved. MATHEMATICAL METHODS OF STATISTICS
Vol. 16
No. 4
2007
316
DABO-NIANG, YAO
Proof of Theorem 4.6. Inequalities (11) and (13) imply conditions (i) and (ii) in Lemma 3.1 respectively. Hence (14) is obtained from Theorem 4.2, the facts that the process (Yu , u ∈ RN ) is bounded and the function ψ is continuous, and Lemma 6.1 of Biau [4]. Proof of Theorem 4.8. The proof follows directly from Theorem 4.7. Proof of Theorem 4.9. As mentioned above, the proof of this theorem is based on an upper bound for var(fT ) given in the proof of Proposition 6.2 of Biau [4]. The proof is achieved by sketching that of Theorem 4.11 of Bosq [10] and follows from Theorem 4.7 and Lemma 3.1. ACKNOWLEDGMENTS The authors thank Professors G. Biau (University Montpellier 2), L. Bel and A. Bar-Hen (INAPG), and the referee for helpful comments that improved the paper. REFERENCES 1. L. Anselin and R. J. G. M. Florax, New Directions in Spatial Econometrics (Springer, Berlin, 1995). 2. R. M. Balan and B. G. Ivano, “A Markov Property for Set-Indexed Processes”, J. Theor. Probab. 15, 553–588 (2002). ´ 3. G. Banon and H. T. Nguyen, “Sur l’estimation recurrente de la densite´ pour un processus de Markov”, C. R. Acad. Sci. Paris 286, 691–694 (1978). 4. G. Biau, “Spatial Kernel Density Estimation”, Math. Methods Statist. 12, 371–390 (2003). 5. G. Biau and B. Cadre, “Nonparametric Spatial Prediction”, Statist. Inference Stochastic Proc. 7, 327–349 (2004). 6. D. Blanke and B. Pumo, “Optimal Sampling for Density Estimation in Continuous Time”, J. Time Series Anal. 24 (1), 1–23 (2003). 7. D. Blanke and D. Bosq, “A Family of Minimax Rates for Density Estimators in Continuous Time”, Stochastic Anal. and Appl. 18, 871–900 (2000). ´ ´ 8. D. Bosq, “Estimation et prevision nonparametrique d’un processus stationnaire”, C. R. Acad. Sci. Paris 308, 453–456 (1989). 9. D. Bosq, “Parametric Rates of Nonparametric Estimators and Predictors for Continuous Time Processes”, Ann. Statist. 25, 982–1000 (1997). 10. D. Bosq, Nonparametric Statistics for Stochastic Processes — Estimation and Prediction, in Lecture Notes in Statist., 2nd ed. (Springer, New York, 1998). ` 11. D. Bosq, F. Merlevede, and V. Peligrad, “Asymptotic Normality for Density Kernel Estimators in Discrete and Continuous Time”, J. Multivar. Anal. 68, 78–95 (1999). 12. M. Carbon, C. Francq, and L. T. Tran, “Kernel Regression Estimation for Random Fields”, J. Statist. Plann. Inference 137 (3), 778–798 (2007). ´ ´ 13. M. Carbon, “Polygon des frequences pour des champs aleatoires”, C. R. Acad. Sci. Paris I 342 (9), 693–696 (2006). 14. M. Carbon, M. Hallin, and L. T. Tran, “Kernel Density Estimation for Random Fields”, Statist. and Probab. Lett. 36, 115–125 (1996). 15. M. Carbon, L. T. Tran, and B. Wu, “Kernel Density Estimation for Random Fields: The L1 Theory”, Nonparam. Statist. 6, 157–170 (1997). 16. N. A. C. Cressie, Statistics for Spatial Data, in Wiley Series in Probability and Mathematical Statistics (Wiley, New York, 1991). 17. S. Dabo-Niang and A. F. Yao, Density Estimation for Spatial Functional Random Variables Preprint (2007). 18. S. Dabo-Niang, “Kernel Density Estimator in an Infinite Dimensional Space with a Rate of Convergence in the Case of Diffusion Process”, Appl. Math. Lett. 17 (4), 381–386 (2004). ´ ´ 19. S. Dabo-Niang and N. Rhomari, “Estimation non parametrique de la regression avec variable explicative ´ dans un espace metrique”, C. R. Acad. Sci. Paris Ser. I 336, 75–80 (2003) 20. J. Dielbolt and D. Guegan, Probabilistic Properties of the General Nonlinear Auto-Regressive Process of Order One, Technical Report No. 125 (Univ. Paris VI, Paris, 1992). 21. P. Doukhan, Mixing — Properties and Examples, in Lecture Notes in Statist. (Springer, New York, 1994). 22. A. A. Efros and T. Leng, “Texture Synthesis by Non-Parametric Sampling”, in Proc. IEEE Internat. Conf. on Computer Vision, Corfu, Greece (1999), pp. 1033–1038. 23. F. Ferraty and Ph. Vieu, Nonparametric Functional Data Analysis (Springer, New York, 2006). MATHEMATICAL METHODS OF STATISTICS
Vol. 16 No. 4 2007
KERNEL REGRESSION ESTIMATION
317
24. X. Guyon, Random Fields on a Network – Modeling, Statistics, and Applications (Springer, New York, 1995). ¨ ¨ 25. L. Gyorfi, W. Hardle, P. Sarda, and P. Vieu, (1990). Nonparametric Curve Estimation from Time Series, in Lecture Notes in Statist. (Springer, New York, 1990), Vol. 60. 26. M. Hallin, Z. Lu, and L. T. Tran, “Kernel Density Estimation for Spatial Processes: The L1 Theory”, J. Multivar. Anal. 88 (1), 61–75 (2004). 27. M. Hallin, Z. Lu, and L. T. Tran, “Local Linear Spatial Regression”, Ann. Statist. 32 (6), 2469–2500 (2004). 28. J. H. P. Mc Kean, “Brownian Motion with Several-Dimensional Time”, Theory Probab. Appl. 8, 335–354 (1963). 29. S. N. Lahiri, “Resampling Methods for Spatial Regression Models under a Class of Stochastic Designs”, Ann. Statist. 34 (4), 1774–1813 (2006). 30. F. Leblanc, “Density Estimation for a Class of Continuous Time Processes”, Math. Methods Statist. 6, 171– 199 (1997). 31. E. Levina and P. J. Bickel, “Texture Synthesis and Non-Parametric Resampling of Random Fields”, Ann. Statist. 34 (4), 1751–1773 (2006). ´ 32. P. Levy, “Exemples de processus doubles de Markoff”, C. R. Acad. Sci. Paris 226, 307–308 (1948). 33. Z. Lu and X. Chen, “Spatial Nonparametric Regression Estimation: Non-Isotropic Case”, Acta Mathematicae Applicatae Sinica, English Series, 18 (4), 641–656 (2002). 34. Z. Lu and X. Chen, “Spatial Kernel Regression Estimation: Weak Consistency”, Statist. and Probab. Lett. 68, 125–136 (2004). 35. Z. Lu, “Weak Consistency of Nonparametric Kernel Regression under Alpha-Mixing Dependence”, Chinese Science Bull. 41, 2219–2221 (1996). 36. Z. Lu and P. Chen, “Distribution-Free Strong Consistency for Nonparametric Kernel Regression Involving Nonlinear Time Series”, J. Statist. Plann. Inference 65, 67–86 (1997). 37. E. Masry and D. Tjostheim, “Nonparametric Estimation and Identification of Nonlinear ARCH Time Series: Strong Convergence and Asymptotic Normality”, Econom. Theory 11, 258–289 (1995). 38. M. F. Moura and S. Goswami, “Gauss–Markov Random Fields (Gmrf) with Continuous Indices”, IEEE Trans. Inform. Theory 46, 1560–1573 (1997). 39. L. Pitt and R. Robeva, “On the Sharp Markov Property for Gaussian Random Fields and Spectral Synthesis in Spaces of Bessel Potentials”, Ann. Probab. 31, 1138–1176 (2003). 40. L. Pitt and R. Robeva, “On the Equality of Sharp and Germ Sigma-Fields for Gaussian Processes and Fields”, Pliska Studia Mathematica 16, 183–205 (2004). 41. D. N. Politis and J. P. Romano, “Nonparametric Resampling for Homogeneous Strong Mixing Random Fields”, J. Multivar. Anal. 47 (2), 301–328 (1993). 42. E. Rio, Theorie ´ Asymptotic des Processus Aleatoires ´ Faiblement Dependants ´ (Springer, Berlin, 2000). 43. B. Ripley, Spatial Statistics (Wiley, New York, 1981). 44. G. G. Roussas, “Nonparametric Estimation in Mixing Sequences of Random Variables”, J. Statist. Plann. Inference 18, 135–149 (1988). ´ ´ e´ de Markov etroite ´ 45. F. Russo, “Etude de la propriet en relation avec les processus planaires a` accroissements ´ independants”, in Lecture Notes in Math. (Springer, 1984), Vol. 1059, pp. 353–378. 46. L. T. Tran, “Kernel Density Estimation on Random Fields”, J. Multivar. Anal. 34, 37–53 (1990). 47. L. T. Tran and S. Yakowitz, “Nearest Neighbor Estimators for Random Fields”, J. Multivar. Anal. 44, 23–46 (1993). 48. Y. M. Truong and C. J. Stone, “Nonparametric Function Estimation Involving Time Series”, Ann. Statist. 20, 77–97 (1992). 49. J. B. Walsh, Martingales with a Multidimensional Parameter and Stochastic Integrals in the Plane ´ Univ. Paris VI, Paris, 1976–1977), Cours III cycle. (Laboratoire de Probabilites. 50. S. Yakowitz, “Nearest-Neighbor Methods for Time Series Analysis”, J. Time Series Anal. 8, 235–247 (1987).
MATHEMATICAL METHODS OF STATISTICS
Vol. 16
No. 4
2007