Methodological and computational considerations for multiple correlation analysis
GWOWEN SHIEH
0
CHIEN-FENG KUNG
0
0
National Chiao Tung University
, Hsinchu,
Taiwan
The squared multiple correlation coefficient has been widely employed to assess the goodness-of-fit of linear regression models in many applications. Although there are numerous published sources that present inferential issues and computing algorithms for multinormal correlation models, the statistical procedure for testing substantive significance by specifying the nonzero-effect null hypothesis has received little attention. This article emphasizes the importance of determining whether the squared multiple correlation coefficient is small or large in comparison with some prescribed standard and develops corresponding Excel worksheets that facilitate the implementation of various aspects of the suggested significance tests. In view of the extensive accessibility of Microsoft Excel software and the ultimate convenience of general-purpose statistical packages, the associated computer routines for interval estimation, power calculation, and sample size determination are also provided for completeness. The statistical methods and available programs of multiple correlation analysis described in this article purport to enhance pedagogical presentation in academic curricula and practical application in psychological research.
-
The study of correlation coefficients among variables is
one of the most fundamental issues across a variety of
disciplines including psychological research. In particular,
the majority of the literature has been focused on the
multiple correlation coefficient between a criterion variable
and one set of predictor variables in the context of linear
regression models. As the squared multiple correlation
coefficient or the strength of association 2 represents the
fraction of reduction in the variance of criterion variable
accounted for by the predictor variables and the overall
usefulness of the regression model, extensive results have
been derived that give various expressions,
approximations, and computing algorithms for the theoretical
properties of the sample squared multiple correlation
coefficient or the coefficient of determination, R2, when the
criterion and predictor variables have a joint multivariate
normal distribution. See Johnson, Kotz, and Balakrishnan
(1995, chap. 32) and Stuart and Ord (1994, chap. 16) for
comprehensive discussions and further details. A primary
concern regarding regression analysis is the conception
of the two distinct scenarios of fixed (conditional) and
random (unconditional) modeling formulations that
ultimately lead to different inferential procedures. One must
have a clear understanding of the respective setups and
how they can be utilized before the issues involved in the
construction of an appropriate regression model can be
fully explained. Notably, Sampson (1974) gave an
excellent and thorough description of the two modeling
formulations in which the random setting adopts the convenient
assumption that all variables have a joint multivariate
normal distribution. The procedures for power calculation,
interval estimation, and sample size determination under the
fixed regression models are well known (see Murphy &
Myors, 2004; Smithson, 2003, and the references therein
for further details). However, the corresponding methods
are more complex under the random model. Specifically,
we focus our attention here on the situation that the
criterion and predictor variables have a joint multinormal
distribution.
METHODOLOGICAL CONSIDERATION
Although the sample squared multiple correlation
coefficient is routinely computed with all major commercial
software packages, these packages do not offer a full range
of inferential procedures for multiple correlation analysis.
For the purpose of interval estimation, power calculation,
and sample size determination for the squared multiple
correlation coefficient, various methodological
developments and computer programs are presented in Algina and
Olejnik (2003), Dunlap, Xin, and Myers (2004), Mendoza
and Stafford (2001), Shieh (2006), and Steiger and
Fouladi (1992). The currently available computing algorithms
presented in these articles are useful in important and
distinctive ways, in that they implement different
statistical methods under diverse user interfaces. In
considering inference on multiple correlation coefficients, the R2
package developed by Steiger and Fouladi (1992) appears
to be the most versatile numerical routine because it
includes not only the percentile and cumulative distribution
function of R2, exact confidence interval estimation for
2, power calculation, and sample size determination for
standard significance test H0: 2 0, but also the
hypothesis testing procedure H0: 2 20 ( 20 0) not generally
available in the other aforementioned computer programs.
However, there is room for some further extensions in two
aspects. First, the calculation of minimum sample size
required for the sample squared multiple correlation
coefficient R2 to fall into a prescribed interval with adequate
accuracy was illustrated in Algina and Olejnik (2003) and
Shieh (2006). See Kelley and Maxwell (2003) for a recent
treatment of sample size planning for accuracy in
parameter estimation within the multiple regression framework.
Second, it is a natural generalization to incorporate both
the power calculation and sample size determination for
testing hypotheses involving nonzero target values for 2.
Note that Wilcox (1980) presented the exact sample sizes
using the indifference zone approach to the problem of
determining whether 2 is above or below a known constant.
Moreover, Fowler (1985), Murphy and Myors (1999), and
Steiger (2004) repeatedly stressed the notion of testing
substantive significance in the context of general linear
models.
COMPUTATIONAL CONSIDERATION
From a purely computational perspective, the R2
package of Steiger and Fouladi (1992) for multiple correlation
analysis is a stand-alone MSDOS program, whereas the
existing algorithms of Algina and Olejnik (2003), Dunlap,
Xin, and Myers (2004), Mendoza and Stafford (2001),
and Shieh (2006) are associated with more advanced
software systems, such as FORTRAN, Mathematica, SAS,
and SPSS. Since these large commercial packages may
not be generally available and have their own unique user
interfaces, extra effort is required for performing the
necessary analyses, thus incurring greater complexity.
Conceivably, it should be of practical interest for researchers
or students to conduct multiple correlation analysis on
a readily accessible computing platform. In view of the
widespread availability of personal computers operated
with Microsoft Windows systems, this article aims to
present a Microsoft Excel program that contains a full range
of inference procedures for multiple correlation analysis.
The program is referred to as RHO-SQUARE, and
detailed information is described in the following section.
The proposed package has the distinct features of being
accessible, accurate, flexible, and free.
It should be emphasized that there has been some
concern regarding the accuracy of statistical procedures in
Microsoft Excel (see, for example, Knusel, 2005; McCullough
& Wilson, 2005). In the RHO-SQUARE program, only
limited statistical functions of Excel are employed in our
developed routines, along with other standard
mathematical functions. Specifically, the central F cumulative
distribution function and its inverse are utilized for
computing the cumulative probability and quantile of regular F
distribution. In the vital part of numerical computation,
the formulation and expansion of Lee (1972) are
incorporated to evaluate the notoriously sophisticated cumulative
distribution function of R2. The computation is
theoretically exact provided that the auxiliary functions can be
evaluated exactly. In conjunction with basic computation
techniques that require the standard numerical methods
of one-dimensional integration and interval-halving
algorithms, the suggested Excel program provides alternative
routines for performing multiple correlation analysis. In
order to verify the accuracy of the RHO-SQUARE
program, four sets of comparisons were conducted. First, the
lower bounds of the upper 95% confidence intervals
presented in Table 5 of Mendoza and Stafford (2001) with
N 50 for given values of number of predictor variables
P and observed value of R2 were recalculated. It appears
that the results are almost identical, and the discrepancy is
obviously due to a different roundoff scheme. Second, the
present program yielded exactly the same minimum
sample sizes required for the prescribed interval (0, 2 b)
of squared multiple correlation coefficient with coverage
probability at least 0.95 and P 5 given in Table 4 of
Shieh (2006). Third, the computed powers and those
presented in Table 1 of Dunlap, Xin, and Myers (2004) are
practically equal for the selected sample size N, number
of predictor variables P, target value 02 0 under null
hypothesis, true value 12 2 under alternative hypothesis,
and significance level 0.05. The last comparison is
performed for the minimum sample size N needed to test
hypothesis in order to attain the specified power for the
chosen number of predictor variables P 5, target value
20 0 under null hypothesis, true value 21 2 under
alternative hypothesis, and significance level 0.05.
All the results generated by the RHO-SQUARE program
coincided with those presented in Gatsonis and Sampson
(1989) except for three cases that differed only by one
unit. The differences seem to be negligible with respect to
the magnitude of the resulting sample sizes. According to
these comparisons, we conclude that there is an excellent
agreement between the RHO-SQUARE program and the
existing algorithms in all cases. Other Excel programs that
attempt to address related but different aspects of normal
correlation analysis can be found in Alf and Graf (2002)
and Barnette (2005). Finally, it is noteworthy that the
inferential procedures considered in the RHO-SQUARE
program depend on the specific multinormal assumption of
the criterion and predictor variables as in the other
abovementioned software packages for multiple correlation
analysis. When the underlying normality assumption is
not present, it is questionable that the procedures will give
accurate and reasonable results. In view of this significant
limitation, therefore, it seems prudent to ensure that the
properties of the associated variables are well understood
before the standard inferential methods are adopted by
researchers as general analytic procedures.
THE RHO-SQUARE PROGRAM
For the ultimate goal of presenting a full account of
exact procedures for the analysis of squared multiple
correlation coefficient, the RHO-SQUARE program includes
three sets of algorithms for performing the calculations
related to the basic distributional properties of R2 and the
CONSIDERATIONS FOR MULTIPLE CORRELATION ANALYSIS 733 statistical methods of interval estimation and hypothesis
testing for 2. The program has four pages of worksheets.
The first page contains a brief introduction, followed by
three worksheets that are organized to present the
following features.
Distributional Properties of R2
The probability density function and cumulative
density function of R2 are plotted for given values of
population squared multiple correlation coefficient 2, sample
size N, and number of predictor variables P. Moreover,
the percentile and cumulative probability for the
prescribed model configuration can be computed by
specifying the selected cumulative probability and percentage
points of R2.
Hypothesis Testing for 2
The subsequent one- and two-tail tests of hypotheses
can be conducted: H0: 2 20, H0: 2 20, and H0: 2
20. In each case, the critical values and p value are
calculated under the given quantities of sample size N, number
of predictor variables P, significance level , target value
20 under null hypothesis, and observed value of R2. For
the purpose of power calculation, the exact power is
computed for the input values of sample size N, number of
predictor variables P, significance level , target value 20
under null hypothesis, and true value 21 under alternative
hypothesis. The power approach to sample size
determination can be performed as well. The program calculates
the minimum sample size N needed to test hypotheses in
order to attain the specified power for the chosen number
of predictor variables P, significance level , target value
20 under null hypothesis, true value 21 under alternative
hypothesis, and desired power level.
In order to facilitate the application and illustrate the
features of the RHO-SQUARE program, the following
numerical examples are presented.
Example 1
Suppose a linear regression analysis is performed with
N 50, P 5, and the sample squared multiple
correlation coefficient R2 .3. Then the computed lower, upper,
and two-sided 95% confidence intervals are (0, .4245),
(.0589, 1), and (.0337, .4603), respectively. For research
planning purposes, RHO-SQUARE can easily determine
the precise minimum sample size for R2 to fall into the
interval (R2L, R2 ) with a prescribed probability. Assume
U
that 2 .4 and P 5, the required sample sizes to
ensure the desired accuracy with probability .95 for selected
intervals (R2L, R2 ) (0, .6), (.3, 1), and (.3, .6) are 58, 99,
U
and 109, respectively.
Example 2
Consider the hypothesis testing problem for the
squared multiple correlation coefficient 2 with N 100
and P 5. The test for confirming a substantial level of
strength of association in terms of H0: 2 .3 versus H1:
2 .3 can be readily conducted with RHO-SQUARE.
The respective critical values for Type I error rates .05 and
.01 are .4566 and .5068. Suppose the observed value of
R2 .5, then the associated p value can be found as .0128.
Hence, the rejection of null hypothesis at the .05
significance level demonstrates that the strength of association
exceeds the chosen threshold of .3. On the other hand, one
may be interested in determining whether the strength of
association is trivial or minimum with H0: 2 .2 versus
H1: 2 .2. In this situation, the critical values for .05
and .01 are .1242 and .0865, respectively, if one observed
R2 .10, so the corresponding p value is .0192 and the
null hypothesis is rejected at the .05 significance level.
Accordingly, the result suggests that the level of strength
of association is not high enough to make a real difference.
Moreover, the power approach to sample size
determination can be performed as well. The minimum sample size
N 153 is required for testing the hypothesis H 0: 2 .2
with specified parameter values of 20 .2 and 21 .05,
significance level .05, and nominal power .90.
Given the complex interrelationships that exist among
multiple variables in psychology and other social science
settings, it is important for researchers to become
conversant with various analytic techniques for squared multiple
correlation coefficient. Knowledge of corresponding
inferential procedures is often critical for investigators to
address scientific hypotheses and confirm credible
effects. Furthermore, the a priori determination of a proper
sample size necessary to achieve some specified power
and accuracy is a salient problem encountered frequently
in applied settings. More important, it is more efficient
for researchers or students to be able to conduct multiple
correlation analysis on a readily accessible computing
platform with modern personal computers. The developed
Excel program offers a wide range of potentially useful
tools for multiple correlation analysis and concurrently
accounts for some prominent statistical notions that were
not found in the existing routines.
This research was partially supported by National Science Council
Grant NSC-95-2416-H-009-031-MY2. The authors thank the referees for
several valuable comments. The complete set of numerical verifications
for the accuracy of the RHO-SQUARE program with other published
results is available from the first author at . The
program is also available at no cost to interested researchers upon request.
It is hoped that the proposed multiple correlation analysis software will
facilitate pedagogical presentation in academic curriculum and practical
application in psychological research. Correspondence concerning this
article should be addressed to G. Shieh, Department of Management
Science, National Chiao Tung University, 1001 Ta Hsueh Road, Hsinchu,
Taiwan 30050, Taiwan (e-mail: ).