ROCK MECHANICS AND ROCK PRESSURE USE OF R E G R E S S I O N A N D C O R R E L A T I O N FOR P R O C E S S I N G PROPERTIES
EMPIRICAL
ANALYSIS
D A T A ON T H E P H Y S I C A L
OF R O C K S
V. V. B u l a t o v
and
E. Z.
Lipelis
U D C 622,02 : 519,272.119
In this article we will use the following notation: M(~/x) is the conditional m a t h e m a t i c a l expectation of a random variate 7, corresponding to a given value of a random variate x; 6 z is the dispersion of the general population; NIX is the mathematical expectation of a random var~ate x; Sx is the standard deviation of a random variate x; r is the correlation coefficient; p is the correlation ratio; ~ (xl) is the value of the dependent random variate y determined from the regression curve at the point x~. Regression-correlation analysis has an exhaustive theoretical justification on the following conditions: the random variates under investigation must be stochastically independent; the dispersion of the dependent variate must remain approximately constant during changes of the argument, or must be proportional to some function of this argument,
D(Ti/x)=a*----r
D(~l/X)=a~h2(x);
or
the random variates must have normal disttibutiom. In practice, we have to. deal with statistical populations which do not fully satisfy these requirements. However, work on such statistical data have revealed that in this case also regression-correlation analysis is formally applicable. It has been shown that many parameters of populations which do not satisfy the above requirements do remain consistent, unbiased, and effective for large samples. In research on the physical properties of rocks over some area, or in investigations of the variation of the physical properties of homotypic rocks with depth, we get a field of random events. These events are evidently stochastically independent. Stochastic independence between the results of separate measurements is governed by various natural, industrial, and laboratory factors. The physical properties of rocks are determined by innumerable random phenomena which have occurred throughout the history of the formation and the development of the given section of the Earth's crust, by the particular features of the exposure of the given horizon, by the transportation and storage of the specimen, by the measuring apparatus, by the skill and individual traits of the researchers, and by other factors. Satisfaction of the second condition can be tested by means of one of the criteria of uniformity of a series of dispersions. If we divide the range of the independent variate into k intervals, and if x i,~x2,0 . . . . x~ are the centers of these intervals, then for a given fixed x ~ the sampling value of the conditional dispersion S~(x~) is given by m I
Syo ,) =
t=1 where ra i
y/j. j =
1. 2, 3~ . . . .
mi.
]=1 Siberian Scientific Research Institute of Geology, Geophysics, and Mineral Raw Materials, Novosibirsk. Translated from Fiziko-Tekhnicheskie Problemy Razrabotki Poleznykh lskopaemykh, No. 6, pp. 3-9, November-December, 1967. Original articlesubmitted luly 15, 1966.
547
To s e e w h e t h e r t h e c o n d i t i o n of a c o n s t a n c y o f t h e s a m p l i n g dispersions is satisfied, w e find t h e c r i t e r i o n A:
A ......
rrl I I n i=1
where /e
C "-- [ -~-
k
I
1
3(/z--l)
ml
1
; Ss
~
n ,
_
rt
i=1
--
# i~l
S z is the overall dispersion of the population, and n is the total number of points, n =
Z
//l i
i,=|
The statisticalstudy of the constancy of the sampling dispersion is based on the fact that if the theoretical conditional dispersions are constant the quantity A (when m i > 2) has approximately the clR-square distributionwith (k - 1) degrees of freedom. If the estimated value of A fallsin the region of improbable values of a random xZ(k - 1) distribution,then we must reject the hypothesis that the theoret/cal conditional ~lispersionsare equal. In this case, to use the methods of regression-correlationanalysis we must select a particular form of the function h~(x) expressing the law of variation of the condirlonal theorer/cal dispersions IX~/x) with the argument x [1]. If a <
-
I) w e conclude that D
=
D
=
... =
D
x~
=
In most cases
applicaaon of the A criterion to estimate the dispersions of the physical parameters of rocks reveals that the theoretical partial dispersions are not in fact equal. Since rocks are b/gldy anisotropic, measurements of their characteristicswill give a large scatter between individual results. Thus as a rule it is impossible to assess the constancy of the dispersions of the results of measurements of the physical properties of rocks by the A parameter. Clearly the A condition is of littleuse in this type of star/sr/calmaterial. W e can, however, test the hypothesis that the sampling dispersions are equal by using the fact that their ratio obeys the F distribution. By comparing the ratios of the sample dispersions we can get an unbiased estimate. Calculations reveal that in using the F criterion at the c~ = 0.05 significance level, in the overwhe.lming majority of cases we are adopting the hypothesis that the theoretical conditional dispersions are equal Let us consider one more method of testing the hypothesis of equaRty between a number of dispersions- Bartlett's method, which presupposes that the partial dispersions are calculated from equal-sized samples. Now, in the construction of correlation graphs of the relations between different physical parameters of rocks, to each fixed interval of the argument there corresponds, as a role, a different number of values of the dependent variate; thus Bartlett'scriterion can be applied only as an approximate assessment of the hypothesis of uniformity of a series of dispersions, provided that the sample sizes do not differtoo much. In estimating the conditional dispersionsby Bartlett'scriterion, as the number of tests for each value A x i (i = I, 2 ..... k) we arbitrarilytook the value k --
i=!
k
From analyses of a large number of different relations we concluded that the theoretical partial dispersions of the resultsof measurements of the physical properties of rocks do not vary with the argument. Variations of the conditional empirical dispersions with changes in the argument are not regular, owing to the effects of various random factom. This inference is of great practical importance, because in the case ~ = a~(xi) correction factors (the so-called "weights") would have to be introduced into all the forrnulas,a n d in many cases the formulas would become m u c h more complicated. Analysis of the empirical probability density functions of various physical parameters (permeability, residual water content, available static capacity), and also of the density of distribution of the number of measurements at different depths, reveals that these are not normally distributed. In addition, in some cases the empirical distribution density functions of the physical characteristics of the rock have several modes. W h e n the distributionlaw of a random variate deviates from the normal, the methods of correlation and regression analysis can be used formally "in processing eml~rical data. especially with large samples. 548
Kramer [2] states: "A knowledge of the first four moments for any density function belonging to this system is sufficient for complete determination of the function." It follows from what has been said that the e m p i r i c a l parameters MX, MY, $x, Sy, r, p are valid statistical criteria for representing sample populations whose distribution law differs from the normal. It has been found that to use the method of least squares for finding a particular form of the regression equation is very laborious in view of the large scatter of the random values of the physical parameters of rocks about their mathematical expectations. In most cases it is impossible to give a concrete desctiption in the form of r e l a tions ~(x i) = r i, a, b . . . . ). The system of differential equations which arises in the method of least squares c a n not always be solved in general form, and it is necessary to assume a particular form for the regression equation: (xi). We therefore found the regression equations by trial and error. This method is less laborious and quite permissible in treating emlgrical mater~al on the physical properties of rocks, The regression equations found were checked by the v z ct,'ter~on, which is based on the fact that Fisher's distribution, with m = k - 2 degrees of freedom for the numerator and 1 = n - k for the denominator, is obeyed, in the case of correct representation of the regression equation, by the expression k ..011 - -
A
I---1
k mi
(k - 2) ~ ~ (y,/-;,), 1=1j=1
(where i = 1, 2 . . . . .
k; to each xi there corresponds a set of values Yij, J = 1, 2 . . . . .
mi).
In the general case, the validity of the chosen form of the regression equation is checked as follows: a low enough value is chosen for the significance level a; then from the tables of the F(n-2, n-k) distribution we find the 100% a point Fa. The tabulated value of Fox is compared with v z calculated from the experimental points. If v z 9 Fa (m, 1), then the chosen regression equation of y (xi ) is incorrect. On the other hand, if v z < Fa (m, 1), the regression equation satisfactorily fits the statistical data. Assessment of the validity of the regression equations, obtained by trial and error for the relation between the various physical characteristics of rocks and their relation to depth, revealed that these equations are quitesatisfactory. Before choosing the general form of the relation between the quantities under test i t is important to d e t e r mine whether or not the relation is linear. For this purpose we can use the W2 criterion. This is based on the fact that, if the hypothesis of linear regression is correct, the expression A
A
(n - - k) ( 1~/~ - - r ~) A2
obeys the F(m, I) distribution with m = k-2 degrees of freedom for the numerator and 1 = n - k degrees of freedom for the denominator. We set a low enough significance level a , and from the tables we find the 100% a point For(m, 1). If we find that the calculated value of Wz is greater than Fa(m, 1), then the hypothesis of Linear regression is statisticaUy unjustified. If, however, Wz < Fa(m, 1), this hypothesis is consistent with the expetimental data. Investigation of the relation between two variates reduces to seeking the most probable mean value of one variate for a fixed value of the other. Various accuracy requirements will arise for the c a l c u l a t e d regression equation, according to the physical nature of the statistical data and the particular aims of the investigation. If we know the confidence limits, we can say with a given degree of probability that the conditional m a t h e m a t i c a l expectation of the dependent variate appertaining to the line of theoretical regression does not fall outside these limits. For linear regression equations, the confidence limits are found from the usual formulas. In some cases, by transforming to a new coordinate system, we can convert a curvillnear equation to a r e c tilinear one. For the latter we can find the confidence range, and then by transforming back the coordinates we can find the confidence ranges for the original curvilinear regression curves. Methods of linearization are given for a number of equations in works by Aivazyan [1] and others.
549
9
8
8"0
C 0
K
C'
70. $0" ~0" 30-
I >o l
I
I
I
20-
i :'
I .
iAchimovskay 4 ~
t0.
:
~ulomzinskaya
%achka
, Tarskaya
mbo
2~oo
I 9e LokosovsI "~1
kaya
2,~oo 8y
= 21.856;
E
~"
M
I
..,,oo
Fig. i. l~gression graph in original coordinates XOY: x = 2420. 5; 68.156;
~.~
p = 0.9380;
= 70.508;
Sx =
k = 40.
As an example, let us consider the construction of confidence limits for the regression equation y = 63 e - L ~ V ' I : 1) By means o f the transformation u = ~x, z = In y, the curve y = 63 exp (-1.254-x) is converted to the
straight line z = In 63 = 1.25 u (there is no need to draw the UOZ plane graphically). For the x i (i = 1, 2 . . . . . we find the u i = 4~i; for the Yi v~e find the z i = In Yi" We then calculate u, z, Su, Sz, and r.
n),
2) To locate the specific points belonging to the confidence limits, we take a number of successive values
and for these we calculate the values
.0 = V ' 4 ,
d = in 63 -
~./-
Sgl c r,) V 1 4
~z~ = t ,(n - 2) 7
1.25.0;
(=0_;), - - .
"
where t~/2(n-2) is Student's distribution for the cr
(I)
s,]
significance level and (n-2) degrees of freedom.
We then find the values of z~ + Az~, z~ - A z ~ . Going back to the XOY system of coordinates by meam of the formulas y = e z, x = u 2, we get a set of points representing the desired confidence limits. We draw the lines representing the confidence limits through these points. However, there are many complicated functions which cannot be linearized. Sometimes the regression curve can be liuearized by means of a transformation in which some of the values of the dependent variate are imaginary. In these cases we can construct approximate confidence limits by converting the given curve to a nearlystralght line. Example. For boreholes in the Upper and Lower Vartovsk areas, the dependence of the residual water content a of the rock on the depth H is as follows (see Fig. 1): a=30+
46 e - 0.:~,.,,10-,(tt- : ~ ) , .
Tl~s function can be linearized by means of the substitution z = In (a-30), u = (H-2400) z, but then we cannot include any points with ordinates ai _ 30, because for them a~-30 _~ 0 and of course negative numbers have no logarithms (in the real number field). Therefore for the numerical values of the experimental points and some fixed points b e longing to the line of regression we make the transformation ~4 = log ai, ~ = (Hi-2400)~; as a result we get an a l m n s t linear regression curve in the new system UOZ (Fig. 2). We then find the parameters ~, fi, Sz, Su' and r. By means of Eq. (1) we find Az~ and calculate the confidence limits (u~, z ~ A , A ~ ) after which we return to the original coordinates XOY (see Fig. 1). The confidence limits in this case are, of c o m e , approximate. As a rule, the depth dependences of the physical properties.of rocks have been constructed from a number of boreholes of any area, and therefore the stratigraphic boundaries on the graphs were averaged in depth (see Fig. 1). Figure 1 gives the principal statistical characteristics of the population - the m a t b e m a t i c a l expectations i , ~, the rms deviations Sx, $ , the correlation coefficient r, and the correlation ratio 9. We also give the total number of observations n and t~e number of fixed values of the argument k, and the values of W* and V. 550
z'log a
2-
Lt At
U = (H- 2400) ?
0
foooo
200o0
i
!
30000
aoooo
i
50000
!
|
60000
Fig. 2. Graph of regression i n transformed coordinates UOZ. A The criterion 1/--" p VZn -- 2 / I - - p= is used to judge the significance of the correlation ratio at the level a. For large a, the correlation ratio has an approx2roately normal distribur/on with parameters p, (1-pz)/aC'n-~-1. Then provided that V < t a (n - 2), we can infer that there is no correlation between the random variates. If, on the other hand, V > t a (n - 2), then the correlation is significant at the a level. Thus for the relation a = a(H) i n Fig. 1, V = 38.386, t0.0s = 2.06, i.e., V > t0.0s, and therefore p is significant at the 0.05 l e v e l Knowing nificance l e v e l population, In O. 95 we can
the empirical correlation coefficient and the number of measurements n, and given a certain siga. from the tables we can find the c o n f i d e n c e range for the correlation c o e f f i c i e n t of the g e n e r a l this case, r = - 0 . 3 8 9 ; n = 96; a = 0.05; P ( - 0 , 7 5 < r < 0.13) = 0.95. Consequently, with p r o b a b i l i assert that the v a l u e of the t h e o r e t i c a l correlation c o e f f i c i e n t lies between --0.75 and O, 13.
In this case Wz = 6.1727 > F0.0s(ro, 1) = 2.6870, and therefore the correlation is not linear at the 0.05 c o n f i dence level. LITERATURE 1. 2.
CITED
S . A . Aivazyan, "Use of methods of correlation and regression analysis in the processing o f e x p e r i m e n t a l r e sults," Zavod. Lab., No. 7, 8 (1964). G. Kramer, M a t h e m a t i c a l Methods in Statistics [Russian translation], Moscow (1948).
551