|
|
||||||||
1 Department of Biostatistics and Epidemiology and Department of Diagnostic Radiology, The Cleveland Clinic Foundation, 9500 Euclid Ave., Cleveland, OH 44195.
Received January 24, 2000;
revised February 25, 2000;
Address correspondence to N. A. Obuchowski.
Abstract
|
|
|---|
MATERIALS AND METHODS. I computed the number of patients and
observers needed as a function of five parameters: the measure of diagnostic
accuracy (area under the ROC curve, sensitivity at a false-positive rate
0.10, or specificity at a false-negative rate
0.10), conjectured level of
accuracy, suspected difference in accuracy between the two imaging techniques,
observer variability, and ratio of patients without to patients with the
condition.
RESULTS. The numbers of patients and observers required vary dramatically with these five parameters, increasing with more refined measures of accuracy, with lower accuracy levels, with smaller suspected differences, with greater observer variability, and with less balanced designs. The number of patients required for a study can be reduced by increasing the number of observers, and vice versa. When the intra- and interobserver variability is large, a study design with just four observers is usually inadequate.
CONCLUSION. Many factors must be considered when determining the appropriate sample sizes for multiobserver ROC studies. My tables serve only as initial ballpark estimates. Investigators should compute sample size using parameters that reflect their clinical application.
|
|
|---|
|
|
|---|
A mathematic model of the relationship between the number of observers and number of patients for ROC studies has been previously proposed [1]. I used this model to prepare tables of sample size. The sample sizes were determined for a study with 5% type I error rate and 80% power. The type I error rate is the probability of wrongfully rejecting the null hypothesis (i.e., concluding that the two imaging techniques have different accuracies when, in truth, the accuracies are equal). Conversely, the power of a study is the probability of correctly rejecting the null hypothesis (i.e., concluding that the two imaging techniques have different accuracies when, in truth, the accuracies are different). Studies with a type I error rate of less than 5% or power of greater than 80% require larger sample sizes (larger than reported in our tables). The Appendix provides further details about the construction of the tables.
To use my tables, you must answer five questions about the study. First,
what measure of diagnostic accuracy will be used to characterize the
observers' performance interpreting the images? For my tables, I consider
three measures of diagnostic accuracy: the area under the ROC curve, the
sensitivity at a false-positive rate less than or equal to 0.10, and the
specificity at a false-negative rate less than or equal to 0.10. The area
under the ROC curve is the most commonly used measure of accuracy. It is a
good global measure of accuracy
[2]; however, for particular
clinical applications, we are usually interested in only a certain portion of
the ROC curve [3]. For example,
in breast cancer screening, in which false-negative test results have serious
consequences, we demand high sensitivity; therefore, we provide sample size
estimates for the specificity at a false-negative rate of less than or equal
to 0.10 (sensitivity fixed at
0.90). However, for detecting asymptomatic
cerebral aneurysms, when the treatment that follows positive test results
(i.e., true-positives and false-positives) is very risky, we demand high
specificity. For these studies, we provide sample size estimates for the
sensitivity at a false-positive rate less than or equal to 0.10 (specificity
0.90).
Second, what is the conjectured accuracy of the observers with these imaging techniques? A synthesis of the relevant literature can often provide a reasonable estimate of the level of accuracy to expect. For my tables, I considered two levels of accuracy: high and moderate. For the area under the ROC curve, I set high accuracy at an area of 0.90 and moderate accuracy at 0.75. For the sensitivity at a fixed false-positive rate of less than or equal to 0.10, I define high accuracy as a sensitivity of 0.80 at a false-positive rate of 0.10 and moderate accuracy as a sensitivity of 0.60 at a false-positive rate of 0.10. Similarly, for specificity at a fixed false-negative rate of less than or equal to 0.10, I define high accuracy as a specificity of 0.80 at a false-negative rate of 0.10 and moderate accuracy as a specificity of 0.60 at a false-negative rate of 0.10.
Third, what is the suspected difference in accuracy between the two imaging techniques? Again, a synthesis of the relevant literature or a pilot study can help determine the difference to expect. For my tables, I considered three levels of suspected difference: small, moderate, and large. A small difference is 0.05 (an absolute difference in the areas under the ROC curve of 0.05 or an absolute difference in sensitivities [specificities] of 0.05 at a false-positive rate [false-negative rate] of 0.10), a moderate difference is 0.10, and a large difference is 0.15.
Fourth, what is the relative frequency of patients with and without the condition in the study sample? The relative frequency in the sample may not reflect the prevalence of the condition in the population. Especially for rare conditions, investigators often use sampling approaches that provide a more balanced number of patients with and without the condition. For my tables, I define R as the ratio of the number of patients without the condition to the number of patients with the condition in the study sample. I consider values for R of 1/1 (balanced design), 2/1, and 4/1.
Finally, what is the variability within (intra-) and between (inter-) the observers? Observer variability is probably dependent on many factors, including the condition of interest, the setting of image interpretations, and the heterogeneity in the observers' training and experience. The easiest way to quantify observer variability is in terms of the range of differences in accuracy. For example, if we expect that for a common set of images, observers' accuracies will differ by about 0.10 (using the area under the ROC curve, the lowest area of any observer in our sample might be 0.85 and the highest area of any observer in our sample might be 0.95), then we can derive an estimate of variance (Appendix). In Table 1, I set up three levels of observer variability for my sample size tables: small, moderate, and large. The interobserver ranges are based on a review article by Rockette et al. [4] in which the observer variability was estimated and reported from 49 studies. Our small range corresponds with the smallest variability reported by Rockette et al.; our large range corresponds with the largest variability reported by Rockette et al.; and our moderate range is a convenient midpoint. Rockette et al. do not report values of the intraobserver variability because of the paucity of available data. On the basis of my experience, I know that the intraobserver variability can be quite large. I arbitrarily set the intraobserver variability at half the interobserver variability.
|
|
|
|---|
|
|
|
In the following text, I illustrate the use of my tables with two examples. First, suppose we are designing a study to compare the accuracies of MR imaging and CT for the detection of cerebral aneurysms, and we choose as our measure of accuracy the sensitivity at a false-positive rate of less than or equal to 0.10. We anticipate the two imaging techniques to have high accuracy (sensitivities near 0.80 at a false-positive rate of 0.10). We would like to be able to detect a difference in sensitivities of 0.10 or more at a false-positive rate of 0.10. We anticipate that there will be four times more patients without an aneurysm than with an aneurysm in our sample, and we expect large observer variability (sensitivities of all the observers at a false-positive rate of 0.10 may differ by as much as 0.10; and the sensitivities at a false-positive rate of 0.10 by a single observer on two different interpreting occasions may differ by as much as 0.05). Referring to Table 3, a study design with 353 patients (71 with an aneurysm and 282 without) and six observers or 123 patients (25 with an aneurysm and 98 without) and 10 observers would be adequate. Note that a study design with only four observers would be inadequate for this example because of the expected large observer variability.
For my second example, suppose we use the tables as an aid to evaluate the adequacy of the sample sizes of an article. Consider the study by Powell et al. [5] that compared the diagnostic accuracies of film-screen radiography and digitized mammograms. (Note that Powell et al. were testing if the two display modes were equivalent, not if one was superior; the statistics are different, but we can still use this study as an example.) Seven observers interpreted 60 cases using both display formats. The study included 30 patients with at least one malignant lesion and 30 patients without a malignant lesion. Each breast was divided into five regions, and each region of each breast was interpreted by the observers. One of the measurements of accuracy was the area under the ROC curve. The average area under the ROC curve of the seven observers was 0.86. The range of areas under the ROC curve for the seven observers was 0.107 (0.827-0.934) for film and 0.098 (0.792-0.890) for digitized images. Five observers interpreted the images in two separate interpreting sessions. The average range of the difference in the areas under the ROC curve from these two interpretations was 0.05 (the absolute values of the differences for the five observers were 0.072, 0.047, 0.089, 0.011, and 0.029). Refer to Table 2 for the area under the ROC curve and look under high accuracy (because the average area under the ROC curve was nearly 0.90) and large observer variability (because the inter- and intraobserver variabilities were 0.10 and 0.05, respectively). Our particular observer sample size of seven is not included in the table; nor is the situation of multiple regions from the same patient considered in the table. However, we can interpolate that a moderate to large difference in the areas under the ROC curve can be detected with this study design.
|
|
|---|
Other study design issues also affect sample size. For example, in many applications, there is more than one unit per patient; for example, five regions from each of two breasts in evaluating screening mammography [5], or situations with the possibility of two or more lesions in the same patient. These units from the same patient are not statistically independent and cannot be analyzed as independent observations. My tables of sample size assume one unit per patient. When there are multiple units per patient, the sample sizes I presented will usually be an overestimation of the number of patients required. However, it would not be appropriate to assume that the number of units needed is equivalent to the number of patients estimated in my tables; this strategy would result in a study with inadequate power.
Another issue to consider in sample size estimation is the plan for subgroup analyses. Often, investigators want to report accuracy and the comparative accuracies of the two imaging techniques for particular subgroups (subgroups categorized by gender, age, or symptoms). For these analyses, sample sizes larger than those reported here are generally required.
My study of sample size for multiobserver ROC studies has several limitations. First, the sample sizes are derived from one mathematic model [1] of the relationship among accuracy, observer variability, patient variability, and the correlation in accuracy imposed by study design. Other more sophisticated models exist [6, 7], but no associated algorithms are available for calculating sample size. A study by Rockette et al. [4] suggested that for sample size estimation, the simpler model used here is adequate. Second, for simplicity, we made several assumptions about the distributions of the unobserved test results (i.e., binormality with equal variances) (Appendix). These assumptions may not be applicable for all studies; therefore, I encourage readers to assess these assumptions for their particular studies. Finally, I have focused on only one issue of study design: sample size. It is important to recognize that an adequate sample size cannot compensate for systematic biases that are common in our literature [8,9,10,11,12,13]. Therefore, it is critical that we sample patients and observers carefully for our studies, that we identify and apply appropriate procedures for determining the true diagnoses, and that images are interpreted in a blinded fashion.
APPENDIX: Details Regarding the Derivation of Tables
2,3,4
|
|
|---|
![]() | (1) |
![]() |
, where
is
the suspected difference. Obuchowski and Rockette
[14] proposed a modified F
statistic to test the null hypothesis in equation 1. The noncentrality
parameter of the F statistic, denoted
, can be used to derive sample
size estimates [1]. The
parameter
is a function of three variances, four correlations, the
number of observers (indicated by J), and
,
![]() | (2) |
b2 is the variability in
observers' accuracies when using the same imaging technique for the same
sample of patients (interobserver variability),
w2 is the variability in a single
observer's accuracies when using the same imaging technique for the same
sample of patients on different interpreting occasions (intraobserver
variability),
c2 is the variability
between samples of patients and is a function of the number of patients in the
sample, r1 is the correlation between accuracies estimated
from the same sample of patients by the same observer using different imaging
techniques, r2 is the correlation between accuracies
estimated from the same sample of patients by different observers using the
same imaging technique, r3 is the correlation between
accuracies estimated from the same sample of patients by different observers
using different imaging techniques, rb is the correlation
between accuracies obtained when a set of observers examines the same sample
of patients using different diagnostic tests, and K is the number of
times each observer interprets the same images from the same imaging
technique.
For the tables in this paper, I used three different levels of
b2 and
w2. I estimated
b2 and
w2 from the ranges in
Table 1 by assuming that
observers' accuracies follow a normal distribution. For example, for a study
design with 10 observers and a large variability,
= (range) x
(constant derived from the normal distribution
[15]) = (0.10) x
(0.3249) = 0.0325. From two interpreting sessions by the same observer,
= (range) x
(constant derived from the normal distribution
[15]) = (0.05) x
(0.8862) = 0.0443. I assumed that K = 1 because in most multiobserver
investigations, the images from each imaging technique are interpreted only
once by each observer.
For the correlations r1, r2, r3, and rb, we used values from the review article by Rockette et al. [4]. Specifically, we set r1 equal to the average correlation of all the studies reviewed by Rockette et al. (0.47). Note that the range of values of r1 was 0.35-0.59. The average value of r2 minus r3 was -0.0011. Rockette et al. recommend using zero for sample size estimation, and this is the value I used. For rb, Rockette et al. recommend a value of 0.8 for sample-size estimation on the basis of data from two large studies; I used this value for my tables.
Note that these variances and correlations were estimated by Rockette et al. [4] for the area under the ROC curve and not for the partial area under the ROC curve. For my tables, I assumed that these variances and correlation values are also reasonable when the partial area under the ROC curve is used.
For different values of J, I used the function PROBF in SAS
software (SAS Institute, Cary, NC) to determine the value of the noncentrality
parameter that would provide 80% power with a 5% type I error rate. These
values of
are 18.12, 12.36, and 9.92, for J equal to 4, 6,
and 10, respectively. I substituted the above values of
b2,
w2, r1,
r2, r3, and rb
into equation 2. Then for the different combinations of J and
, I solved for
c2. The
value of
c2, when positive,
represents the maximum patient variance permitted for the study design.
Because
c2 is a function of the
number of patients, I was able to calculate the number of patients needed for
each combination of J and
. When the estimate of
c2 was negative or almost zero,
I concluded that the number of observers was inadequate for the study
design.
For the area under the ROC curve summary measure, an estimate of
c2 for sample size estimation
is [16]
![]() | (3) |
-1 (
) x 1.414, and
is
the anticipated area under the ROC curve,
-1 is the inverse of
the cumulative normal distribution function, R is the ratio of sample sizes of
patients without the condition to with the condition, and N is the
number of patients with the condition. Substituting into equation 3 the
anticipated value of the area under the ROC curve (0.75 or 0.90), the value of
R (1.0, 2.0, or 4.0), and the maximum patient variance calculated from
equation 2, the number of patients needed with the condition can be
determined. Then the total patient sample size needed is N (1 +
R).
For the partial area under the ROC curve in the false-positive rate range
from 0.0 to 0.10 (or for the partial area under the ROC curve in the
false-negative rate range from 0.0 to 0.10), we first need to convert the
values of
and the values for the interobserver and intraobserver
ranges given in Table 1 into
values for the partial area under the ROC curve. (Note that these values have
been defined in terms of the sensitivity [specificity] at a false-positive
rate [false-negative rate] of 0.10.) To do this we need to specify an ROC
curve among all possible ROC curves that pass through the point
(false-positive rate = 0.10; sensitivity = 0.60 [or 0.80]). We assume that the
test results of patients with and without the condition follow a binormal
distribution with equal variance (binormal parameter b = 1.0).
Therefore, for sensitivities of 0.60 and 0.80 at a false-positive rate of
0.10, the corresponding binormal parameter a's (the standardized
difference in the means of the two distributions) are 1.535 and 2.120, and the
corresponding partial areas under the curves are 0.0424 and 0.0637,
respectively. For a suspected difference between the two imaging techniques of
0.05 (in terms of sensitivity at a false-positive rate of 0.10), we can define
two ROC curves with parameter a's of 1.47 and 1.60, which correspond
to sensitivities of 0.575 and 0.625 and partial areas of 0.0400 and 0.0447,
respectively. Therefore, I translated a difference of 0.05 in terms of
sensitivities at a false-positive rate of 0.10 into a difference in terms of
the partial areas of 0.0047. I performed this transformation for each of the
values of
and the values for the interobserver and intraobserver
ranges given in Table 1. I used
these transformed values in equation 2 to determine the maximum patient
variance for each study design.
Then an estimate of
c2 for
the partial areas used for sample-size estimation is
[17]
![]() | (4) |
|
|
|---|
This article has been cited by other articles:
![]() |
References J. ICRU, April 1, 2008; 8(1): 57 - 62. [PDF] |
||||
![]() |
L. V. Doering, R. Cross, M. C. Magsarili, L. Y. Howitt, and M. J. Cowan Utility of Observer-Rated and Self-Report Instruments for Detecting Major Depression in Women After Cardiac Surgery: A Pilot Study Am. J. Crit. Care., May 1, 2007; 16(3): 260 - 269. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. A. Obuchowski New Methodological Tools for Multiple-Reader ROC Studies Radiology, April 1, 2007; 243(1): 10 - 12. [Full Text] [PDF] |
||||
![]() |
G. Natalini, A. Rosano, M. Taranto, B. Faggian, E. Vittorielli, and A. Bernardini Arterial Versus Plethysmographic Dynamic Indices to Test Responsiveness for Testing Fluid Administration in Hypotensive Patients: A Clinical Trial Anesth. Analg., December 1, 2006; 103(6): 1478 - 1484. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. H. Woo, L. P. Henry, J. Krejza, and E. R. Melhem Detection of Simulated Multiple Sclerosis Lesions on T2-weighted and FLAIR Images of the Brain: Observer Performance Radiology, October 1, 2006; 241(1): 206 - 212. [Abstract] [Full Text] [PDF] |
||||
![]() |
H Hintze Diagnostic accuracy of two software modalities for detection of caries lesions in digital radiographs from four dental systems. Dentomaxillofac. Radiol., March 1, 2006; 35(2): 78 - 82. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. A. Obuchowski ROC Analysis Am. J. Roentgenol., February 1, 2005; 184(2): 364 - 372. [Full Text] [PDF] |
||||
![]() |
N. A. Obuchowski, M. L. Lieber, and F. H. Wians Jr. ROC Curves in Clinical Chemistry: Uses, Misuses, and Possible Solutions Clin. Chem., July 1, 2004; 50(7): 1118 - 1125. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. A. Obuchowski How Many Observers Are Needed in Clinical Studies of Medical Imaging? Am. J. Roentgenol., April 1, 2004; 182(4): 867 - 869. [Full Text] [PDF] |
||||
![]() |
J. Eng Sample Size Estimation: A Glimpse beyond Simple Formulas Radiology, March 1, 2004; 230(3): 606 - 612. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |