|
|
||||||||
Original Research |
1 Department of Radiology, Imaging Research Division, University of Pittsburgh,
300 Halket St., Ste. 4200, Pittsburgh, PA 15213.
2 Departments of Medicine and Epidemiology and University of Pittsburgh Cancer
Institute, University of Pittsburgh, Pittsburgh, PA 15213.
3 General Electric Global Research Center, One Research Cir., Niskayuna, NY
12309.
Received August 2, 2004;
accepted after revision November 19, 2004.
Supported in part by GE Healthcare, Waukesha, WI, and by grant number P50
CA90440 from the National Cancer Institute, Specialized Program of Research
Excellence (SPORE) in Lung Cancer at the University of Pittsburgh, National
Institutes of Health, Bethesda, MD.
Abstract
|
|
|---|
MATERIALS AND METHODS. Two hundred ninety-three selected low-dose CT examinations of the lung were independently interpreted by three radiologists to detect and classify pulmonary nodules. The data set selected was enriched with examinations depicting pulmonary nodules. A subset of 30 examinations was interpreted twice. All pulmonary nodules greater than 1.0 mm were marked. All nodules greater than 3.0 mm were marked, measured, and scored as to their probability of being benign or malignant. Nodule-based and examination-based relative reviewer agreements were evaluated using percentage of agreement and kappa statistics. Similar assessments were performed on the subset of examinations interpreted twice.
RESULTS. The three radiologists identified a total of 470, 729, and
876 pulmonary nodules of which 395, 641, and 778 were rated as noncalcified
with some level of suspicion for being malignant. Nodule-based interobserver
agreement among the radiologists was poor (highest kappa value in a paired
comparison, 0.120). Examination-based agreement was higher (highest kappa
value in a paired comparison, 0.458). Intraobserver agreement was higher than
interobserver agreement for examination-based agreement (highest
=
0.889) but lower for nodule-based agreement (highest
= -0.035).
Agreement improved as the suspicion of malignancy increased.
CONCLUSION. Unaided intra- and interobserver agreement in detecting pulmonary nodules in low-dose CT of the lung is relatively low. Computer-assisted detection may provide the consistency that is needed for this purpose.
|
|
|---|
Lung cancer had been commonly detected and diagnosed clinically or on chest radiography, but since the early 1990s X-ray CT has been reported to improve detection and characterization of both benign and malignant pulmonary nodules [9-11]. Lung cancer screening is currently implemented using low-dose CT examinations, which are generally defined as scanning techniques that use less than 100 mAs [12-14]. There are several methodologic issues regarding the optimal practice for low-dose CT screening (e.g., tube current, pitch, section thickness, reviewing format) [15-20]. In addition, the general desire to reduce motion artifacts and improve spatial resolution by rapid image acquisition with thinner image sections has resulted in advances in CT technology (e.g., multidetector scanners). Hence, the typical examination generates large-volume data sets. These large data sets challenge both the display systems and the interpreting radiologist.
Interobserver agreement for the detection of individual pulmonary nodules is reported to be relatively poor [15, 21]; one study reported a large number of missed nodules on retrospective review [22]. However, one study reported excellent interobserver agreement for examination-based interpretations namely, whether any nodule was visible in a complete examination of the lung [10]. Reports describing interobserver agreement for sizing nodules have been mixed [21, 23, 24]. To our knowledge, there are no reports of intraobserver agreement for either examination-based interpretation or detection of individual nodules. Computer-assisted detection (CAD) schemes and nodule characterization algorithms are being developed to aid the radiologist during the interpretation of chest CT examinations [25-32]. These tools have the potential to improve radiologists' performance [25-29]. Because "ground truth" (i.e., benign vs malignant finding) is unknown for those pulmonary nodules that are not biopsied or resected, studies typically use consensus of expert radiologists evaluating the performance of radiologists (without and with CAD support).
In this study we assess the relative intra- and interobserver agreement for pulmonary nodule detection when using low-dose, thin-section CT examinations for the early detection of lung cancer. The term "relative intra- or interobserver agreement" indicates that observer agreement was evaluated against the other observers or themselves and not against a consensus panel or verified truth (outcome).
|
|
|---|
MDCT Data Acquisition
The CT examinations were performed using LightSpeed Plus 4-MDCT (n
= 282) or LightSpeed Ultra 8-MDCT (n = 11) scanners (GE Healthcare).
The helical CT scans were contiguous (nonoverlapping) volume scans
encompassing the entire lung area acquired with 2.5-mm section thickness in
the axial plane. Images were reconstructed with 512 x 512 pixel matrices
using the GE Healthcare lung reconstruction kernel. The low-dose CT
acquisition protocol varied slightly depending on patient size: tube voltage
range, 120-140 kVp; mean tube current, 29.7 ± 10.7 (SD) mAs; and range
of pixel dimensions, 0.60-0.98 mm. The CT examinations were acquired with an
end-inspiratory breath-holding protocol.
Participating Radiologists
Three board-certified radiologists with 3, 21, and 24 years of experience
in interpreting chest examinations (radiography and CT) participated as
observers in the study. Two of the three specialize in thoracic imaging and
the third is a general radiologist who routinely interprets chest imaging
examinations, among others. All observers participated in the past in
different observer performance studies and are familiar with our general
procedures in performing these undertakings. They performed the
interpretations at their own pace as time permitted, and all interpretations
were completed during a 23-week period. After a minimum separation of 13
weeks, the same radiologists again interpreted (reinterpreted) a subset of 30
selected cases during an 8-week period. The project leader selected 30 cases
that were a representative sample of the entire data set in terms of the
number, size, and clinical importance of pulmonary nodules.
Interpretation Protocol
A GE Healthcare Advantage Workstation running Advanced Lung Analysis 1
(ALA) software was used to review and rate the CT examinations. The
workstation was placed in the main thoracic radiology reading room for
convenience of the participating radiologists, and reviewers were notified by
the project leader if they fell substantially behind in the planned
interpretation schedule. The full functionality of the ALA software was
available to the participating radiologists (e.g., window and level settings,
zoom, cine mode, and maximum intensity projection [MIP]). The radiologist
reviewed the CT images and identified (marked) suspected pulmonary nodules.
For each marked nodule, a computerized scoring form regarding the nodule's
size and characteristics was completed using the computer mouse and keyboard
(Table 1). Features recorded on
the scoring form included the location, size, presence or absence of
calcification, density, surface, cavitation, fat, pleural attachment, and
clinical importance in terms of a 5-category ordinal scoring system as to
whether the nodule in question was likely to be benign or malignant
(Table 1). Nodule size was
measured using the digital calipers in the ALA software. The presence or
absence of calcification could be determined by changing the window and level
values.
|
Before beginning the interpretations, observers received a detailed Instructions to Observers form that described the task at hand and specifically identified the primary and secondary questions to be answered in each study. Observers were then trained in the use of the workstation for the study. The purpose of the study and the nature of the examinations to be reviewed were explained in general terms, but the mix of positive and negative examinations was not provided. Observers were told that we used an "enriched data set."
Solid nodules were defined as any pulmonary (or pleural) lesion represented on a chest CT image (displayed on lung windows) as sharply defined, discrete, and nearly circular soft-tissue-density opacity with a diameter measuring between 1.0 and 30.0 mm. Nonsolid nodules (e.g., ground-glass opacity) were defined in the same manner except that the density of the opacity was not solid or soft-tissue attenuation but less than soft tissue, allowing visualization of background structures (e.g., blood vessels). Partially solid nodules (i.e., mixed) were defined as a combination of solid and nonsolid nodules. Reviewers were asked to mark the location of all three nodule types larger than 1.0 mm and to provide characterization information only for those larger than 3.0 mm. However, the question regarding calcification (calcified or noncalcified) was responded to for all marked nodules regardless of size.
Data and Statistical Analysis
All pulmonary nodules detected by at least one of the three radiologists
were tabulated and analyzed. This was done because there was no verified
outcome for most of the cases (and nodules). A consensus score of the three
reviewers was determined for nodule size, calcification, and clinical
importance. Individual reviewer's measured nodule size was defined as the
maximum of the length and width on one individual section depicting the
nodule. The consensus nodule size was computed as the average reported size as
indicated by those reviewers detecting (marking) the nodule in question. If a
reviewer did not score the length and width of an identified nodule (i.e., for
nodules < 3.0 mm), his or her reporting was ignored during the computation
of a consensus size. A nodule was defined as a noncalcified nodule (NCN) when
all reviewers rated it as noncalcified. If one radiologist rated a nodule as
calcified, the consensus rating was defined as calcified. The density
consensus rating was the minimum density rating among the radiologists
detecting the specific nodule. In other words, if one radiologist rated a
nodule nonsolid, the assigned consensus rating was nonsolid. The hierarchy
went from solid as the highest assigned density to nonsolid as the lowest
assigned density. The consensus scoring as to whether an individual nodule
represented cancer was defined as the highest scoring assigned to the nodule
among the reviewers detecting that nodule
(Table 1).
Descriptive statistics were tabulated for all nodules and for all NCNs with a consensus "clinical importance" equal to or greater than 1. Each group (all nodules, all NCNs) was further stratified by size as follows: less than 3.0 mm, equal to or greater than 3.0 mm but less than 10.0 mm, and equal to or greater than 10.0 mm but less than 30.0 mm. If all nodules observed (marked) by a reviewer in an examination were discounted during the categorization of group 2 (i.e., calcified nodules or those scored "definitely not cancer" [clinical importance = 0]), the examination was considered to be negative (no marked nodules). The number and percentage of missed (not marked) nodules by at least one reviewer were calculated and stratified by clinical importance, size, and density. To evaluate reviewer variability in measuring nodule size, we also calculated the fraction of nodules with differences in reported size that were greater than 3.0 mm among the reviewers and the absolute difference between two reviewers' reported sizes as a percentage of their mean reported size.
Relative reviewer agreement was evaluated for NCNs with clinical importance equal to or greater than 1 as follows: nodule-based, for individual nodules and negative examinations equally weighted; and examination-based, for positive examinations with one to six nodules and negative examinations with no marked nodules or those with more than six marked nodules. In the former, interobserver agreement was based on all nodules observed by any of the three observers; hence, a reviewer not involved in specific paired analyses could influence the measured agreement. In other words, when reviewer 1 marked a specific nodule that was not marked by either reviewer 2 or reviewer 3, reviewers 2 and 3 were considered to be in agreement. Percentage of agreement and kappa values for paired reviewers were used as measures of intra- and interobserver agreement in each of the analyses for the different categories. Finally, the positive predictive value of one reviewer scoring an NCN with clinical importance equal to or greater than 3 predicting another reviewer scoring the same nodule with clinical importance equal to or greater than 1 was computed for the six reviewer pairings. Specifically, if a reviewer was concerned about a detected nodule, what was the probability that another reviewer detected the same nodule and assigned it a clinical importance equal to or greater than 1?
|
|
|---|
|
The interpretation sessions were not monitored or timed; however, anecdotally, the reviewers reported that they typically serially scrolled through each examination image by image and occasionally used the MIP feature, with a typical interpretation time of 4-10 min per case. The most common tactic reported by the reviewers was to review one lung at a time.
Measurements of nodule size were inconsistent among the three reviewers, with relatively large differences reported. Fifty-nine percent of all nodule size measurement pairings for the three reviewers had a size difference equal to or greater than 1.0 mm. The percentages of measured size differences equal to or greater than 3.0 mm were 14.5% (39/269), 11.1% (29/261), and 13.7% (57/415) for nodules reported by both reviewers for the pairings of reviewers 1 and 2, 1 and 3, and 2 and 3, respectively. The mean absolute percentages of differences (percentage of the mean size) between the reported sizes for the pairing of reviewers 1 and 2, 1 and 3, and 2 and 3 were 27.0% ± 23.2%, 16.3% ± 16.3%, and 30.0% ± 25.9%, respectively.
Intraobserver agreement was poor in the 30 repeated examinations for the
detection of individual nodules (highest
= -0.035) but was good to
excellent in the examination-based evaluation (i.e., all examinations with one
or more detected nodules) (Table
3). Reviewer 1 had the highest intraobserver agreement for the
detection of individual nodules, but that reviewer also detected the lowest
number of nodules. Conversely, reviewer 3 had the highest intraobserver
agreement in the examination-based evaluation (
= 0.889) and detected
the highest number of nodules.
|
Interobserver agreement was poor for the detection of individual nodules
and marginal for examination-based evaluation (i.e., for examinations with one
or more detected nodules) (Table
4). The agreement between any pair of radiologists was less than
55% for the detection of individual NCNs with a clinical importance equal to
or greater than 1 based on all nodules detected by the three reviewers. The
interobserver agreement among the three reviewers was 18.9%. The agreement
between reviewers 1 and 2 was the highest (
= 0.120); those two
reviewers detected the lowest number of nodules. With respect to the group,
interobserver agreement for the detection of individual nodules improved with
increasing nodule size, but agreement between pairs of observers was not
consistent as a function of nodule size. Examination-based interobserver
agreement was relatively constant between pairs of observers. In addition,
examination-based interobserver agreement improved as a function of scored
clinical importance (data not shown). Specifically, agreement increased as the
suspicion of malignancy increased.
|
Positive predictive value for NCNs of concern (clinical importance
3)
to one reviewer ranged from 59.5% to 85.4% for the six comparisons. In other
words, when one reviewer scored a nodule with a clinical importance of 3 or
greater, most of these were detected and scored with at least a clinical
importance score of 1 by another reviewer.
The fraction (percentage) of nodules missed (not marked) by the three reviewers was relatively high for all levels of clinical importance, sizes, and densities (Table 5). The fraction of missed nodules decreased as clinical importance and size of the nodules increased. There was no correlation between missed nodules and nodule density, and the fraction of missed nodules was greater for solid nodules than for either partially solid or nonsolid nodules. The fraction and type of the missed nodules during the individual reviewer's first and second interpretations (intraobserver analysis) were similar to those found for the interobserver missed nodules (data not shown).
|
|
|
|---|
If screening for early detection of lung cancer proves to be efficacious, the volume of examinations and image data to be interpreted will increase rapidly, and therefore interpretation efficiency and providing consistent results will be important goals. One important diagnostic step that needs to be better understood is the variability among radiologists interpreting chest CT examinations in lung cancer screening programs, and there are few data in this regard. Two smaller studies on examinations with high prevalence and a relatively large number of pulmonary nodules per examination suggest that agreement among radiologists in the detection of individual pulmonary nodules is poor [15, 21]. However, a larger study reported that examination-based agreement (i.e., does an examination depict any nodules?) is quite high, but the CT images in that study were reviewed at a thickness of 5.0 mm and therefore targeted larger nodules than our study [10]. Our results are in general agreement with the reported agreement levels of these studies in regard to both individual nodule detection and examination-based evaluation. As expected, both intra- and interobserver agreement was much better in examination-based interpretations than in the detection of individual pulmonary nodules. More important perhaps is the increasing level of agreement with increasing suspicion that a finding represents cancer.
However, one must remember that these results are directly affected by the number and type of pulmonary nodules depicted in each chest CT examination. The fact is that reviewer agreement per examination may not be a clinically relevant index of performance in lung cancer screening, particularly after the initial screening examination. Our study suggests that even in a laboratory experiment when the task at hand is well defined, agreement among radiologists per individual pulmonary nodule is relatively poor, and the fraction of missed (not marked) nodules is high across all sizes and types (i.e., clinical importance and density) of nodules, but especially in smaller nodules and nodules rated as having low clinical importance. Both of these findings highlight the need for training and standardization of reporting in this area. These observations lead to the idea that even if a CAD scheme does not perform at an extremely high level, its use could help reduce variability and improve agreement within and among reviewers. In addition to possible improvement in reviewer consistency, the use of CAD may also improve consistency and possibly accuracy in the estimates of the size of pulmonary nodules, which will be an increasingly important parameter for assessing change over time in repeated examinations.
In clinical protocols that use a nodule size threshold to initiate "action" (or no action), even a relatively small difference in size measurement may be clinically significant in that different management decisions may result. We observed inconsistent measurements of nodule size and relatively large measurement differences, which findings were in agreement with other studies [21, 23, 24]. Although the magnitude of size differences reported by Wormanns et al. [21] are similar to our findings and those of Revel et al. [23], Wormanns et al. reported good agreement based on Pearson's correlation coefficients. For our study, we defined nodule size as maximum length, but there are a number of methods (e.g., linear, area, volumetric) to do so [24, 33, 34]. The most reliable method for indicating or predicting malignancy has yet to be determined [24, 33, 34].
Our study has several limitations. First, we used chest CT examinations performed at one institution and interpreted by a group of reviewers who are largely academic radiologists. Additional data are needed from other institutions and for other types of radiologists, but we suspect that the results will not differ substantially. Perhaps more training might have reduced the reviewer variability, but we believe that observers who routinely interpret chest CT examinations should be proficient at detecting pulmonary nodules. This laboratory experiment may or may not be generalized to the clinical environment. However, it will take several years before actual data on performance in the clinical environment are available, in particular for examinations with verified outcome. We focused on relative reviewer agreement in the detection of pulmonary nodules, but the ultimate performance index of interest will be the agreement in diagnosis and recommendation for follow-up. Unfortunately, a number of the pulmonary nodules detected in our study are currently being followed and the verified diagnostic outcome (pathology or otherwise) for most patients is unavailable. A consensus panel could have been used to define a gold standard (reference) for each case, but it would not have affected relative agreement (or lack thereof) among reviewers. Because of the breadth of missed (not detected) nodules, we do not believe a consensus review would reveal a common cause for missed nodules in this study. Our low-dose screening protocol, which used a somewhat lower dose than is typically used, may have resulted in an increased variability. Data entry errors due to the multiple uses of a computer mouse and keyboard may have also contributed to the variability, but we do not believe they substantially affected any of the results or conclusions presented here. Finally, because of the asymmetric distribution of examinations (i.e., the number of examinations without nodules was low), often one cell on the diagonal in the contingency tables for measuring agreement was relatively small, particularly in the intraobserver analysis. Therefore, kappa values may not have been the most appropriate measure of agreement.
In conclusion, our study indicates that detecting pulmonary nodules depicted in large-volume CT examinations is a daunting task that requires vigilance and diligence. Although intraobserver agreement was reasonably good in the examination-based analysis, intraobserver agreement was poor for the detection of individual nodules. This suggests that there may be a need for the development of consistent search criteria and standardized reporting practices. If nothing else, this preliminary study clearly suggests that there are significant observer-related issues that cannot be ignored regarding the use of low-dose chest CT examinations for the early detection of pulmonary nodules and lung cancer.
|
|
|---|
This article has been cited by other articles:
![]() |
J. W. Fletcher, S. M. Kymes, M. Gould, N. Alazraki, R. E. Coleman, V. J. Lowe, C. Marn, G. Segall, L. A. Thet, K. Lee, et al. A Comparison of the Diagnostic Accuracy of 18F-FDG PET and CT in the Characterization of Solitary Pulmonary Nodules J. Nucl. Med., February 1, 2008; 49(2): 179 - 185. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. S. Gierada, T. K. Pilgram, M. Ford, R. M. Fagerstrom, T. R. Church, H. Nath, K. Garg, and D. C. Strollo Lung Cancer: Interobserver Agreement on Interpretation of Pulmonary Findings at Low-Dose CT Screening Radiology, December 1, 2007; 246(1): 265 - 272. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. F. Pinsky, M. Freedman, P. Kvale, M. Oken, N. Caporaso, and J. Gohagan Abnormalities on chest radiograph reported in subjects in a cancer screening trial. Chest, September 1, 2006; 130(3): 688 - 693. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |