Estimation of the simple correlation coefficient

Behavior Research Methods, Nov 2010

This article investigates some unfamiliar properties of the Pearson product—moment correlation coefficient for the estimation of simple correlation coefficient. Although Pearson’s r is biased, except for limited situations, and the minimum variance unbiased estimator has been proposed in the literature, researchers routinely employ the sample correlation coefficient in their practical applications, because of its simplicity and popularity. In order to support such practice, this study examines the mean squared errors of r and several prominent formulas. The results reveal specific situations in which the sample correlation coefficient performs better than the unbiased and nearly unbiased estimators, facilitating recommendation of r as an effect size index for the strength of linear association between two variables. In addition, related issues of estimating the squared simple correlation coefficient are also considered.

A PDF file should load here. If you do not see its contents the file may be temporarily unavailable at the journal website or you do not have a PDF plug-in installed and enabled in your browser.

Alternatively, you can download the file locally and open with any standalone PDF reader:

http://link.springer.com/content/pdf/10.3758%2FBRM.42.4.906.pdf

Estimation of the simple correlation coefficient

GWOWEN SHIEH 0 0 National Chiao Tung University , Hsinchu, Taiwan This article investigates some unfamiliar properties of the Pearson product-moment correlation coefficient for the estimation of simple correlation coefficient. Although Pearson's r is biased, except for limited situations, and the minimum variance unbiased estimator has been proposed in the literature, researchers routinely employ the sample correlation coefficient in their practical applications, because of its simplicity and popularity. In order to support such practice, this study examines the mean squared errors of r and several prominent formulas. The results reveal specific situations in which the sample correlation coefficient performs better than the unbiased and nearly unbiased estimators, facilitating recommendation of r as an effect size index for the strength of linear association between two variables. In addition, related issues of estimating the squared simple correlation coefficient are also considered. - with sample means X Y N 3 Alternatively, exact inferential procedures are available, and interested readers are referred to a recent article by Shieh (2006). Here, we focus on the point estimation problem of under the ultimate notion of choosing a profound correlational effect size measure for the strength of association between the two variables X and Y. It can be seen from Equation A3 that r is a biased estimator of , and the mean and variance of r can be approximated by E[r ] & and Var[r ] & 8N (N r .000000 .001295 .002573 .003816 .005006 .006126 .007156 .008080 .008876 .009526 .010007 .010300 .010379 .010222 .009801 .009090 .008058 .006671 .004894 .002686 .000578 Table 1 Bias for Estimators of With N OP1 .000000 .000138 .000272 .000400 .000518 .000623 .000711 .000781 .000830 .000855 .000856 .000832 .000783 .000709 .000614 .000500 .000375 .000247 .000128 .000038 .000002 OPA .000000 .000007 .000015 .000027 .000043 .000065 .000095 .000131 .000176 .000228 .000287 .000351 .000417 .000480 .000535 .000574 .000586 .000557 .000468 .000293 .000070 OP2 .000000 .000025 .000049 .000071 .000091 .000108 .000121 .000130 .000134 .000134 .000128 .000119 .000105 .000089 .000070 .000051 .000033 .000018 .000007 .000001 .000000 OP5 .000000 .000001 .000001 .000002 .000002 .000003 .000003 .000003 .000003 .000002 .000002 .000002 .000001 .000001 .000001 .000000 .000000 .000000 .000000 .000000 .000000 M .000000 .002386 .004742 .007038 .009243 .011326 .013254 .014994 .016511 .017768 .018725 .019341 .019569 .019361 .018660 .017407 .015531 .012955 .009584 .005311 .001151 Note that r and M are asymptotically equivalent and yield similar estimation performance in finite samples. In addition to the unbiasedness consideration, MSE is another useful performance criterion obtained by incorporating the bias (accuracy) and variability (precision) of an estimator. Specifically, the MSE of an estimator of is the function MSE( , ) ) 2] {Bias( , )} 2 Var[ ], Table 2 Bias for Estimators of With N OPA .000000 .000001 .000002 .000002 .000002 .000000 .000003 .000008 .000014 .000022 .000031 .000041 .000051 .000061 .000069 .000075 .000077 .000073 .000061 .000038 .000009 OP2 .000000 .000002 .000003 .000005 .000006 .000007 .000008 .000009 .000009 .000009 .000008 .000007 .000006 .000005 .000004 .000003 .000002 .000001 .000000 .000000 .000000 OP5 .000000 .000000 .000000 .000000 .000000 .000000 .000000 .000000 .000000 .000000 .000000 .000000 .000000 .000000 .000000 .000000 .000000 .000000 .000000 .000000 .000000 M .000000 .000980 .001945 .002883 .003780 .004621 .005393 .006082 .006672 .007150 .007498 .007702 .007744 .007607 .007274 .006725 .005940 .004899 .003578 .001954 .000418 estimator is better only for certain combined configurations of and N. Specifically, when N 20, the order of MSE is MSE( OPA, ) MSE( OP5, ) for MSE( OPA, ) MSE( OP5, ) for On the other hand, when N behavior is 50 and 100, the resulting MSE( OPA, ) MSE( OP5, ) for .15 and MSE( OPA, ) MSE( OP5, ) for .15 MSE(r, ) MSE(r, ) MSE(r, ) Second, it is important to note that the interrelationships between r, M, and OPA are as follows: MSE( M, ) MSE(r, ) MSE( OPA, ) .50 when N 20, 50, and 100, MSE( M, ) MSE(r, ) MSE( OPA, ) .55 when N 20, .55 when N 50, MSE( M, ) MSE( OPA, ) MSE( OPA, ) MSE( M, ) .55 when N 100, MSE( OPA, ) MSE( M, ) .000000 .000506 .001005 .001490 .001953 .002386 .002782 .003135 .003435 .003676 .003850 .003948 .003962 .003884 .003705 .003417 .003011 .002475 .001802 .000981 .000209 .000000 .000251 .000499 .000739 .000968 .001182 .001378 .001551 .001699 .001816 .001900 .001946 .001950 .001909 .001818 .001674 .001472 .001208 .000878 .000476 .000102 OP1 .000000 .000022 .000044 .000064 .000083 .000099 .000113 .000123 .000129 .000132 .000131 .000126 .000116 .000104 .000088 .000070 .000052 .000033 .000017 .000005 .000000 OP1 .000000 .000006 .000011 .000016 .000021 .000025 .000028 .000031 .000032 .000033 .000032 .000031 .000028 .000025 .000021 .000017 .000012 .000008 .000004 .000001 .000000 Table 3 Bias for Estimators of With N OPA .000000 .000000 .000001 .000001 .000001 .000001 .000000 .000001 .000003 .000005 .000007 .000009 .000012 .000014 .000016 .000018 .000018 .000017 .000014 .000009 .000002 OP2 .000000 .000000 .000000 .000001 .000001 .000001 .000001 .000001 .000001 .000001 .000001 .000001 .000001 .000001 .000000 .000000 .000000 .000000 .000000 .000000 .000000 OP5 .000000 .000000 .000000 .000000 .000000 .000000 .000000 .000000 .000000 .000000 .000000 .000000 .000000 .000000 .000000 .000000 .000000 .000000 .000000 .000000 .000000 M .000000 .000496 .000984 .001456 .001906 .002327 .002713 .003056 .003348 .003582 .003750 .003844 .003856 .003779 .003604 .003322 .002925 .002404 .001749 .000951 .000203 1.0 .8 Correlation 1.0 .8 Correlation Table 4 Root-Mean Squared Error for Estimators of With N .60 when N 20, MSE( OPA, ) MSE(r, ) MSE( M, ) .60 when N 50 and 100, MSE( OPA, ) MSE(r, ) MSE( M, ) .60 when N 20, 50, and 100. The corresponding situations among r, M, and OP5 are identical to those of r, M, and OPA just described, with the only exception in the case of MSE( OP5, ) MSE(r, ) MSE( M, ) .60 when N 20. Table 5 Root-Mean Squared Error for Estimators of With N .229416 .228920 .227431 .224944 .221449 .216933 .211379 .204764 .197064 .188245 .178271 .167097 .154672 .140935 .125816 .109230 .091078 .071242 .049578 .025906 .005373 .142857 .142520 .141508 .139820 .137453 .134402 .130664 .126231 .121097 .115253 .108688 .101391 .093349 .084547 .074969 .064597 .053409 .041382 .028491 .014708 .003017 OP1 .234879 .234337 .232711 .229997 .226188 .221277 .215250 .208096 .199795 .190329 .179675 .167805 .154690 .140296 .124584 .107513 .089034 .069099 .047651 .024635 .005059 OP1 .144258 .143907 .142855 .141100 .138642 .135477 .131604 .127019 .121718 .115697 .108950 .101472 .093256 .084295 .074582 .064108 .052865 .040842 .028031 .014421 .002949 OPA .235562 .235015 .233373 .230632 .226787 .221830 .215749 .208531 .200161 .190619 .179884 .167932 .154735 .140262 .124478 .107346 .088825 .068872 .047443 .024496 .005024 OPA .144319 .143967 .142914 .141156 .138694 .135525 .131646 .127055 .121747 .115718 .108964 .101478 .093255 .084288 .074569 .064091 .052845 .040822 .028014 .014410 .002947 OP2 .235414 .234865 .233219 .230473 .226620 .221653 .215564 .208338 .199963 .190421 .179691 .167751 .154575 .140132 .124389 .107307 .088842 .068943 .047552 .024600 .005057 OP5 .235529 .234978 .233327 .230570 .226705 .221723 .215616 .208372 .199978 .190418 .179672 .167720 .154534 .140087 .124344 .107267 .088810 .068923 .047543 .024598 .005057 OP2 .144317 .143966 .142911 .141152 .138688 .135516 .131635 .127041 .121731 .115700 .108945 .101460 .093238 .084274 .074559 .064086 .052846 .040828 .028023 .014418 .002949 OP5 .144322 .143971 .142915 .141156 .138691 .135519 .131636 .127042 .121731 .115700 .108944 .101458 .093236 .084272 .074557 .064084 .052845 .040828 .028023 .014418 .002949 M .224268 .223820 .222474 .220223 .217053 .212948 .207883 .201828 .194748 .186597 .177324 .166866 .155147 .142078 .127555 .111447 .093598 .073816 .051855 .027396 .005741 M .141489 .141167 .140200 .138584 .136315 .133388 .129795 .125527 .120572 .114917 .108547 .101444 .093586 .084951 .075513 .065240 .054099 .042052 .029054 .015056 .003099 Table 6 Root-Mean Squared Error for Estimators of With N .100504 .100260 .099527 .098305 .096594 .094390 .091694 .088501 .084810 .080618 .075921 .070714 .064993 .058754 .051990 .044695 .036862 .028485 .019555 .010063 .002059 .101001 .100752 .100005 .098758 .097013 .094767 .092021 .088773 .085021 .080765 .076002 .070731 .064949 .058655 .051845 .044517 .036669 .028296 .019396 .009965 .002036 OPA .101012 .100762 .100015 .098768 .097022 .094775 .092028 .088779 .085026 .080768 .076004 .070732 .064949 .058653 .051843 .044514 .036665 .028293 .019393 .009963 .002036 .101012 .100763 .100015 .098768 .097021 .094774 .092026 .088776 .085023 .080765 .076001 .070728 .064946 .058651 .051841 .044513 .036665 .028294 .019395 .009965 .002036 .101013 .100763 .100015 .098768 .097021 .094774 .092026 .088776 .085023 .080765 .076001 .070728 .064946 .058651 .051841 .044513 .036665 .028294 .019395 .009965 .002036 .100010 .099773 .099059 .097866 .096191 .094032 .091389 .088258 .084633 .080510 .075884 .070746 .065091 .058910 .052193 .044930 .037111 .028722 .019751 .010183 .002086 Simple Correlation Coefficient for certain values of and N. In this case, the so-called adjusted R2 formula reduces to Furthermore, simplified approximations of O2P1, O2P2, and OP5 can be obtained from the expansion of U2 as shown at 2 the bottom of the page. To delineate the disparate performance by estimators of 2, the exact bias and MSE of r2, E2, O2P1, P2A, O2P2, N ( N .052632 .052275 .051210 .049455 .047037 .043997 .040387 .036273 .031733 .026863 .021773 .016592 .011469 .006575 .002108 .001704 .004597 .006266 .006351 .004432 .001115 Table 7 Bias for Estimators of 2 With N 2 2 2 E OP1 PA .000000 .020050 .003212 .000238 .019926 .003157 .000945 .019557 .002996 .002103 .018949 .002733 .003683 .018117 .002379 .005642 .017076 .001948 .007925 .015851 .001456 .010462 .014468 .000924 .013171 .012959 .000375 .015950 .011360 .000163 .018684 .009713 .000664 .021236 .008061 .001100 .023450 .006451 .001440 .025144 .004932 .001660 .026109 .003552 .001739 .026104 .002357 .001662 .024853 .001390 .001430 .022030 .000678 .001065 .017260 .000234 .000620 .010094 .000034 .000202 .002283 .000000 .000010 2 OP2 .005230 .005191 .005073 .004880 .004618 .004295 .003919 .003502 .003057 .002599 .002141 .001699 .001288 .000922 .000611 .000365 .000187 .000075 .000019 .000002 .000000 2 OP5 .000257 .000254 .000246 .000233 .000215 .000194 .000171 .000145 .000120 .000095 .000072 .000052 .000035 .000021 .000012 .000006 .000002 .000001 .000000 .000000 .000000 .020408 .020261 .019823 .019103 .018112 .016871 .015404 .013740 .011916 .009975 .007963 .005938 .003960 .002099 .000433 .000953 .001964 .002496 .002432 .001647 .000405 Table 8 Bias for Estimators of 2 With N 2 2 2 E OP1 PA .000000 .003201 .000543 .000098 .003179 .000533 .000389 .003113 .000504 .000864 .003005 .000457 .001510 .002858 .000394 .002309 .002676 .000317 .003234 .002463 .000231 .004255 .002225 .000139 .005335 .001969 .000047 .006432 .001702 .000042 .007496 .001431 .000123 .008470 .001166 .000189 .009291 .000913 .000238 .009888 .000682 .000266 .010183 .000477 .000270 .010088 .000307 .000249 .009505 .000175 .000207 .008329 .000082 .000148 .006441 .000027 .000083 .003712 .000004 .000026 .000828 .000000 .000001 2 OP2 .000362 .000359 .000350 .000334 .000313 .000288 .000259 .000227 .000194 .000161 .000128 .000098 .000072 .000049 .000031 .000017 .000008 .000003 .000001 .000000 .000000 2 OP5 .000002 .000002 .000002 .000002 .000001 .000001 .000001 .000001 .000001 .000001 .000000 .000000 .000000 .000000 .000000 .000000 .000000 .000000 .000000 .000000 .000000 .010101 .010027 .009806 .009442 .008943 .008318 .007581 .006746 .005834 .004865 .003864 .002860 .001884 .000970 .000157 .000514 .000996 .001242 .001197 .000804 .000197 .000000 .000049 .000196 .000436 .000762 .001163 .001627 .002139 .002678 .003223 .003749 .004228 .004628 .004913 .005045 .004983 .004680 .004086 .003148 .001807 .000402 .000800 .000794 .000777 .000749 .000711 .000664 .000609 .000548 .000483 .000416 .000348 .000281 .000219 .000162 .000113 .000072 .000040 .000019 .000006 .000001 .000000 For ease of exposition, the following results are summarized for r2, E2, and P2A for all three different sample sizes: MSE( E2, 2) MSE( P2A, 2) MSE(r2, 2) .20 and .25, MSE( E2, 2) MSE(r2, 2) MSE( P2A, 2) MSE(r2, 2) for .30 MSE(r2, 2) for .70 MSE( E2, 2) MSE( P2A, 2) MSE( P2A, 2) MSE( E2, 2) .086711 .088394 .093190 .100457 .109401 .119261 .129373 .139172 .148164 .155897 .161934 .165831 .167120 .165286 .159751 .149849 .134792 .113632 .085197 .047988 .010581 .072739 .075241 .082189 .092322 .104326 .117143 .129978 .142212 .153333 .162879 .170406 .175456 .177544 .176135 .170621 .160303 .144353 .121772 .091326 .051437 .011338 .078993 .081460 .088333 .098385 .110295 .122965 .135544 .147367 .157880 .166594 .173050 .176797 .177374 .174296 .167046 .155062 .137724 .114349 .084172 .046355 .009997 .078710 .081353 .088684 .099340 .111889 .125166 .138287 .150565 .161434 .170394 .176978 .180728 .181179 .177845 .170214 .157732 .139802 .115775 .084953 .046598 .010008 .079300 .081944 .089278 .099935 .112478 .125733 .138809 .151013 .161778 .170604 .177027 .180597 .180859 .177344 .169560 .156980 .139033 .115097 .084480 .046410 .009997 .080321 .083000 .090428 .101212 .113886 .127258 .140420 .152668 .163428 .172197 .178513 .181929 .181999 .178266 .170254 .157452 .139312 .115227 .084519 .046414 .009997 .000138 .000135 .000128 .000115 .000099 .000079 .000057 .000034 .000010 .000012 .000032 .000048 .000060 .000066 .000067 .000061 .000050 .000035 .000019 .000006 .000000 .000047 .000046 .000045 .000043 .000040 .000037 .000033 .000028 .000024 .000020 .000016 .000012 .000009 .000006 .000004 .000002 .000001 .000000 .000000 .000000 .000000 .000000 .000000 .000000 .000000 .000000 .000000 .000000 .000000 .000000 .000000 .000000 .000000 .000000 .000000 .000000 .000000 .000000 .000000 .000000 .000000 .000000 MSE( P2A, 2) MSE(r2, 2) MSE( E2, 2) The relative performance among r2, E2, and O2P5 is analogous to the above by replacing P2A with O2P5. The only modification is MSE( E2, 2) MSE(r2, 2) MSE( O2P5, 2) .15 when N 20. 2 OP5 .029750 .032794 .040444 .050330 .060943 .071459 .081373 .090324 .098016 .104183 .108576 .110951 .111067 .108680 .103540 .095393 .083977 .069019 .050237 .027335 .005837 tion of explained variance when a researcher has some basic conceptual idea about . Concluding Remarks 100 2 OP5 .014502 .017543 .024379 .032450 .040695 .048644 .056017 .062602 .068217 .072692 .075864 .077572 .077654 .075949 .072294 .066525 .058474 .047972 .034844 .018914 .004030 performance and computational ease for estimating 2. However, r2 and E2, which are the simplified version of R2 and adjusted R2, demonstrate their own usefulness in terms of MSE for some subsets of population correlation coefficient. In view of the use of r across a wide variety of disciplines within the social sciences, the updated consideration of its benefits and costs presented here should be essential to researchers for making sound statistical analysis. AUTHOR NOTE REFERENCES N 1/2 (N Fh (a, b; c; x) Specifically, the exact bias and MSE for an estimator (r) of are computed as Bias , f (r )dr and MSE , f (r )dr, where f (r) is given in Equation A1. Similarly, the exact bias and MSE for an estimator 2 2(r2) of 2 can be computed as 2 f (r )dr and MSE 2 , 2 f (r )dr,


This is a preview of a remote PDF: http://link.springer.com/content/pdf/10.3758%2FBRM.42.4.906.pdf

Gwowen Shieh. Estimation of the simple correlation coefficient, Behavior Research Methods, 2010, 906-917, DOI: 10.3758/BRM.42.4.906