(PDF) Conditional estimation of sensitivity and specificity from a phase 2 biomarker study allowing early termination for futility - DOKUMEN.TIPS (2024)

STATISTICS IN MEDICINEStatist. Med. 2009; 28:762–779Published online 19 December 2008 in Wiley InterScience(www.interscience.wiley.com) DOI: 10.1002/sim.3506

Conditional estimation of sensitivity and specificity from a phase 2biomarker study allowing early termination for futility

Margaret Sullivan Pepe∗,†, Ziding Feng, Gary Longton and Joseph Koopmeiners

Fred Hutchinson Cancer Research Center, 1100 Fairview Avenue N., M2-B500, Seattle, WA 98109, U.S.A.


Development of a disease screening biomarker involves several phases. In phase 2 its sensitivity andspecificity is compared with established thresholds for minimally acceptable performance. Since weanticipate that most candidate markers will not prove to be useful and availability of specimens andfunding is limited, early termination of a study is appropriate, if accumulating data indicate that themarker is inadequate. Yet, for markers that complete phase 2, we seek estimates of sensitivity andspecificity to proceed with the design of subsequent phase 3 studies. We suggest early stopping criteriaand estimation procedures that adjust for bias caused by the early termination option. An important aspectof our approach is to focus on properties of estimates conditional on reaching full study enrollment.We propose the conditional-UMVUE and contrast it with other estimates, including naıve estimators,the well-studied unconditional-UMVUE and the mean and median Whitehead-adjusted estimators. Theconditional-UMVUE appears to be a very good choice. Copyright q 2008 John Wiley & Sons, Ltd.

KEY WORDS: group sequential; diagnostic test; screening; true positive rate


The Early Detection Research Network (EDRN) seeks to develop biomarkers for cancer screening,diagnosis, prognosis and risk prediction. Marker development is a process, a sequence of studies. A5-phase paradigm for this process has been adopted for the development of screening markers [1].Briefly, phase 1 concerns marker discovery, phase 2 is retrospective marker validation in specimensfrom cases concurrent with clinical disease and controls without disease, phase 3 is retrospectivemarker validation in specimens taken prior to the clinical disease, phase 4 is a prospective populationstudy of test performance and phase 5 is ideally a randomized trial comparing mortality in the

∗Correspondence to: Margaret Sullivan Pepe, Fred Hutchinson Cancer Research Center, 1100 Fairview Avenue N.,M2-B500, Seattle, WA 98109, U.S.A.

†E-mail: [emailprotected]

Contract/grant sponsor: NIH/NCI; contract/grant numbers: RO1 GM054438, UO1 CA086368

Received 7 December 2007Copyright q 2008 John Wiley & Sons, Ltd. Accepted 24 October 2008


presence and absence of screening. Most of the studies conducted by EDRN are phase 1 and 2.Here we consider the design of a phase 2 study.

Stored blood or urine specimens are typically used in a phase 2 study. The marker is measuredin specimens from a set of cases with clinical disease and from a set of appropriate controls.Considerable effort has been expended to establish high-quality specimen repositories for breast,lung and prostate cancer within the EDRN. Other groups have similarly built specimen banks forbiomarker evaluation. It is important to use these resources judiciously and efficiently.

There is great enthusiasm in the scientific and business communities about the potential fortechnology to measure biomarkers [2]. Biomarker discovery studies abound and we anticipate thata large number of candidate biomarkers will be put forward for validation. However, the falsediscovery rate from phase 1 is likely to be high. That is, we expect that the majority of markersstudied in phase 2 will not have adequate performance for proceeding to further development. This,along with concerns about conserving specimen resources and keeping study costs reasonable,motivates a group sequential approach to phase 2 study design. In particular, designs that allow earlytermination when accumulating evidence suggest poor marker performance, are very attractive.

In this paper, we consider dichotomous markers, with values denoted by Y =1 for a positiveresult and Y =0 for a negative result. Marker performance is quantified by the sensitivity, S=P[Y =1|diseased], and the false positive rate (or 1-specificity), F= P[Y =1|not diseased]. Highersensitivities and lower false positive rates indicate better performance.

When a phase 2 study terminates early, the marker is not considered for further development.In contrast when a study completes its full enrollment, estimates of (S,F) will be calculated todetermine if and how marker development should proceed further. Our particular interest is inestimating (S,F) with data from completed phase 2 studies, i.e. from studies that do not terminateearly.

Group sequential methods have received scant attention in the diagnostic testing literature.Mazumdar [3] and Mazumdar and Liu [4] consider methods for prospective comparative studies,with early termination possible for either positive or negative conclusions. The context is gearedtowards phase 4 studies, not for phase 2 validation studies. There is no existing group sequentialmethodology for phase 2 biomarker studies.

Phase 2 treatment trials have statistical elements in common with our paradigm for phase 2biomarker studies. In the prototype phase 2 treatment trial, subjects are classified as responders ornot, the parameter of interest is the binomial response probability, and early termination occurs ifthe observed response rate is low. In our setting there are two binomial probabilities, S in casesand F in controls, and a study terminates early if either is clearly unsatisfactory. For simplicity, wewill first describe methodology when only one binomial probability is of interest and later addressextensions to simultaneous consideration of two independent binomial proportions. We note thatour methods are equally relevant to phase 2 treatment trials, although our motivation is derivedfrom phase 2 biomarker study design.

Substantial methodology has been developed for estimation following the group sequential designof a phase 2 therapeutic study. A key distinction between previous methods and that proposedhere is that we are particularly concerned with the estimates when a study reaches its planned fullsample size. That is the distribution of estimates conditional on continuing to full enrollment isour particular concern, since those estimates are used for deciding if and how to proceed with thephase 3 study. In contrast when a study terminates early, the biomarker is clearly inadequate andestimates for planning phase 3 are not needed. We show that a classic group sequential estimatorthat is marginally unbiased in the sense of averaging over studies that do and do not terminate early

Copyright q 2008 John Wiley & Sons, Ltd. Statist. Med. 2009; 28:762–779DOI: 10.1002/sim

764 M. S. PEPE ET AL.

may have substantial upward bias in the subset of studies that reach full enrollment. This impliesthat the estimates used for planning phase 3 tend to be too large on average. This has seriousimplications for the integrity of phase 3 study designs. The naive estimator is also conditionallybiased. In this paper, we propose an alternative estimator that avoids this bias.

To simplify the exposition, in Sections 2–6 we discuss estimation of a single binomial probabilityand denote it generically by P= P[Y =1]. If P is sensitivity, only the case population is considered.If P is 1-specificity, only the control population is considered. In a phase 2 therapeutic study, Ydenotes response to treatment and P is the response rate. The two-stage group sequential design isdescribed in Section 2 and estimators are defined. Simulation studies described in Section 3 are usedto compare them. We contrast unconditional estimation with conditional estimation in Section 4and argue that our conditional estimators are useful even when estimates are sought at the earlystopping time as well. In Section 5 we present methods to construct confidence intervals with thebootstrap. Some numerical applications illustrate our approach in Section 6. In Section 7 we returnto the context of studying performance of diagnostic tests, illustrating in detail our procedureswhen two binomial parameters, (S,F), are simultaneously under consideration. Closing remarksand directions for further work are provided in Section 8.


2.1. Design

We consider a single binomial probability, P= P[Y =1]. To make the discussion concrete, we useterminology from diagnostic studies here with P being the sensitivity of a biomarker. Supposethat sensitivities below �0 are undesirable while values at or above �1 are desirable. In particular inphase 2 we will need to show that P>�0, the maximal undesirable sensitivity, in order to proceedwith phase 3 development. On the other hand, �1 is minimally desirable in the following sense: ifP>�1, we certainly want to proceed with development while for P ∈(�0,�1), the equivocal regionof sensitivities, there is little enthusiasm. In terms of hypotheses upon which to base study design,we write

H0 : P��0 versus H1 : P>�0

As an example, for detection of ovarian cancer sensitivities below �0=0.6 would be undesirablesince existing markers reach at least this level of detection while we seek markers with sensitivitiesof at least �1=0.8, since this would be a substantial improvement and worthy of investing resourcesfor further research.

A single-stage study will enroll n cases and reject H0 if the lower two-sided (1−�) confidencelimit for P exceeds �0. For the purpose of study monitoring after m samples are evaluated, wepropose to construct a two-sided (1−�)×100 per cent confidence interval, and if the upper limitis less than �1, the study terminates. That is, if there is strong evidence that the sensitivity is belowthe minimally desirable level, the study will not continue to completion. Otherwise, the studycontinues to evaluate the remaining n−m samples. This stopping rule is reasonable and easy toexplain to investigators. Moreover, if P is minimally desirable, i.e. P=�1, there is only a smallchance, ��/2, of stopping early, suggesting that it will maintain statistical power relative to asingle-stage study. However, other early termination criteria could be used instead.

Copyright q 2008 John Wiley & Sons, Ltd. Statist. Med. 2009; 28:762–779DOI: 10.1002/sim


2.2. Estimation at study completion

We now consider how to estimate the sensitivity, P , at the end of a completed phase 2 study. Thedata are denoted by {Yi , i=1, . . . ,n} with the index i indicating the order in which samples areevaluated. One option is to calculate the naıve estimator that ignores the early stopping procedure



However, this is likely to be biased upward since it is contingent upon an adequately highresponse rate amongst the first m samples that results in completing the study. An unbiasedestimator that is unaffected by the early stopping option uses only the second-stage samples,



Because of the relatively small sample size, this estimator is likely to suffer from imprecision.We now propose an unbiased estimator that incorporates data from both stages. Having used P

to denote simple proportions, we write this more complicated estimator as

U =E(P(stage2)|P(all),C=1)

where C=1 indicates that the criterion for continuation past the first stage was passed.

Result 1Conditional on C=1, P(all) is a complete and sufficient statistic for the distribution of P(stage2).

ProofFor sufficiency, we need to show that the conditional distribution of P(stage2) given P(all) andC=1 does not depend on the parameter P . But conditional on P(all), the distribution of P(stage2) ishypergeometric (n,n−m, P(all)). Moreover, since P(stage1)=m−1{n P(all)−(n−m)P(stage2)},C can be determined from P(all) and P(stage2). The distribution of P(stage2) conditioning onC=1 in addition to P(all) can be derived from the distribution of P(stage2) conditioning onP(all)

P(P(stage2)|P(all),C=1)= I (C=1)P(P(stage2)|P(all))


Therefore, since P(P(stage2)|P(all)) does not depend on P , neither does P(P(stage2)|P(all),C=1). The proof of completeness follows from detailed tedious arguments given in Appendix Aof Jung and Kim [5]. �

CorollaryU is the uniformally minimum variance unbiased estimator of P among all estimators that areunbiased conditional on C=1.

ProofThis follows from the fact that P(stage2) is independent of C and hence conditionally unbiased

E(P(stage2)|C=1)=E(P(stage2))= P

and the Rao–Blackwell theorem [6]. �

Copyright q 2008 John Wiley & Sons, Ltd. Statist. Med. 2009; 28:762–779DOI: 10.1002/sim

766 M. S. PEPE ET AL.

Two other estimators are inspired by Whitehead [7, 8]. They adjust P(all) for bias caused bythe early termination option. The median-adjusted estimator is

Wmed=� : P�(P∗(all)>P(all)|C∗ =1)=0.5 (1)

where the ∗ superscript denotes random variables generated from our study design using � asthe binomial response probability, �= P�(Y =1). Intuitively, Wmed is the response probability forwhich the observed naıve proportion is the median naıve proportion in studies that continue tocompletion. A mean-adjusted estimator is similarly defined as

Wmean=� :E�(P∗(all)|C∗ =1)= P(all) (2)

Whitehead also proposed estimators for use when the study terminates early, but these are notour focus here.

2.3. Calculations

We calculate U , Wmed and Wmean numerically using simulations. For U , we noted earlier thatthe conditional distribution of P(stage2) given P(all) is hypergeometric. Therefore in each of Ksimulations, we sample m cases at random from the n available to simulate the first-stage data,and the remaining n−m simulate the second-stage data. Accordingly, in the kth simulation, valuesof Pk(stage2), Pk(stage1) and Ck are calculated. Averaging Pk(stage2) across simulations whereCk =1 yields U . Exact calculations using the hypergeometric distribution are also possible. Theyrequire more programming effort, but take less time to compute.

More extensive computations are required for calculating Wmed and Wmean, because they involvesearching for � to satisfy (1) and (2), respectively. For each value of � considered, we simulate two-stage studies with binomial probability equal to � and select Pk(all) for studies that satisfy Ck =1.We calculate P�(P∗(all)>P(all)|C∗ =1) as the proportion of Pk (all) exceeding the observedP(all), and E�(P∗(all)|C∗ =1) is calculated as the mean of Pk(all). Wmed is set to be the valueof � for which P�(P∗(all)>P(all)|C∗ =1) is closest to 0.5 and Wmean to the value of � withE�(P∗(all)|C∗ =1) closest to P(all). In our applications we used K =5000 simulations to calculateU . Also for each �, P�(P∗(all)>P(all)|C∗ =1) and E�(P∗(all)|C∗ =1) were calculated withK =5000 simulations. We used a simple grid search on � and Wmed was set to be the first value of� where P�(P∗(all)>P(all)|C∗ =1) was within 0.005 of 0.5, while Wmean was set to be the firstvalue of � in the search with E�(P∗(all)|C∗ =1) within 0.005 of P(all).


3.1. Initial assessment

A single-stage study to test H0 : P��0=0.6 with 90 per cent power when �1=0.8 and allowingtype 1 error rate �=0.05 requires 42 cases according to asymptotic theory formulas [9] and wouldreject H0 if more than 31 responses are observed. We simulated 1000 studies with n=40 allowingfor early termination after responses from m=20 are observed if the upper two-sided 95 per centconfidence limit for P [10] does not exceed �1=0.8. This corresponds to early termination iffewer than 13 of the first 20 responses are positive. Results in Table I show estimates calculatedfrom studies that had complete enrollment of all 40 cases. If the true sensitivity is low, it is likely

Copyright q 2008 John Wiley & Sons, Ltd. Statist. Med. 2009; 28:762–779DOI: 10.1002/sim


Table I. Results of simulation studies with n=40 and early termination option at m=20.

True P Per cent stopping early P(all) P(stage2) Wmed Wmean U

55 73 0.623 0.547 0.576 0.550 0.553(0.062) (0.112) (0.095) (0.098) (0.102)

60 59 0.654 0.606 0.619 0.599 0.604(0.061) (0.109) (0.091) (0.094) (0.096)

65 41 0.685 0.647 0.662 0.644 0.650(0.062) (0.102) (0.086) (0.090) (0.091)

70 22 0.720 0.698 0.709 0.693 0.699(0.062) (0.099) (0.081) (0.085) (0.084)

75 8 0.761 0.756 0.760 0.746 0.751(0.061) (0.097) (0.073) (0.077) (0.075)

80 3 0.804 0.799 0.809 0.798 0.8000(0.060) (0.090) (0.067) (0.070) (0.067)

85 1 0.850 0.851 0.858 0.848 0.849(0.057) (0.083) (0.059) (0.061) (0.059)

Shown are mean (sd) of estimated sensitivities in studies that reached completion. One thousand simulationsper true sensitivity, P .

that the study will terminate early. For example, when P=0.6, 59.2 per cent of studies stop earlywhile 40.8 per cent continue to full enrollment of n=40. Thus, the means and standard deviationsin the corresponding row of Table I relate to 40.8 per cent ×1000=408 studies. Consider firstthe naıve estimator, P(all), that ignores the early stopping option. The anticipated upward bias isevident and most pronounced when P is small. For example, when P=0.55 the mean is 0.62, asubstantial bias. The other naıve estimator using only second-stage data, P(stage2) is unbiased.However, its precision is low, a problem that is evident when the probability of early stopping isvery small (i.e. P is large). Indeed when no studies terminate early (e.g. P=0.85), we note thatvar(P(stage2))=n/(n−m)var(P(all)) in general.

The conditional-UMVUE, U , appears to maintain the best properties of both naıve estimators.Like P(stage2), it is unbiased across all values of P . In addition, when early stopping is unlikely,its precision is comparable with P(all). These results are encouraging.

The performances of the mean- and median-adjusted estimators are comparable with that of U .They substantially adjust for bias when P is low and are relatively precise when P is large. Despitetheir good performance we will not study them further here for the following reasons: (i) thereis no theory to support them, unlike U , which is theoretically unbiased. A close look at Table Iindicates some residual bias in Wmed; (ii) their computation is more difficult than that for U ;and (iii) our preliminary simulation studies in Table I indicate no particular improvement in theirperformances over that of U .

3.2. Additional scenarios

Table II shows additional simulation results for studies with larger sample sizes. The top panelis motivated by the context of ovarian cancer screening, where a very high specificity is desired.False positive screening tests result in subjects undergoing laproscopic surgery, so the rate must bekept very small. Specificity values at or above 0.98 are desired while values below 0.95 would beconsidered unacceptable. A single-stage study would require n=230 specimens from nondiseasedsubjects, and we consider early termination after evaluating half that number, m=115. The bottom

Copyright q 2008 John Wiley & Sons, Ltd. Statist. Med. 2009; 28:762–779DOI: 10.1002/sim

768 M. S. PEPE ET AL.

Table II. Results of additional simulation studies with larger sample sizes. 1000 studieswere simulated for each scenario.

True P Per cent stopping early P(all) P(stage2) Wmed Wmean U

�0=0.95, �1=0.98, n=230, m=1150.90 98 0.924 0.889 0.890 0.891 0.888

(0.015) (0.029) (0.029) (0.027) (0.028)0.95 50 0.957 0.948 0.950 0.948 0.948

(0.012) (0.022) (0.019) (0.022) (0.020)0.965 22 0.968 0.965 0.966 0.965 0.965

(0.011) (0.017) (0.014) (0.016) (0.015)0.98 3 0.981 0.981 0.981 0.981 0.980

(0.008) (0.012) (0.009) (0.009) (0.009)0.99 0 0.990 0.991 0.990 0.990 0.990

(0.007) (0.009) (0.006) (0.007) (0.007)

�0=0.60, �1=0.70, n=220, m=1100.55 91 0.595 0.553 0.560 0.553 0.555

(0.021) (0.036) (0.033) (0.037) (0.037)0.60 61 0.621 0.597 0.600 0.594 0.597

(0.028) (0.050) (0.042) (0.043) (0.044)0.65 21 0.659 0.652 0.652 0.649 0.651

(0.028) (0.045) (0.036) (0.038) (0.036)0.70 2 0.700 0.699 0.700 0.698 0.698

(0.030) (0.044) (0.032) (0.033) (0.032)0.75 0 0.750 0.751 0.751 0.750 0.750

(0.030) (0.040) (0.030) (0.030) (0.030)

panel shows a setting similar to Table I, but with �1=0.70 rather than �1=0.80. The resultscorroborate those in Table I.

We also investigated choices of m other than n/2 (Table III). Since the criterion for earlystopping is based on the upper confidence limit for P not exceeding �1, the probability of earlystopping for P<�1 is larger when more data are available at stage 1. On the other hand, the bias inthe naıve estimator P(all) for studies that complete is larger with larger value of m. For example,with m/n=27/40, if P=0.55, 84 per cent of studies terminate early and the expectation of P(all)is 0.653. In contrast with m=13/40, 59 per cent of studies terminate early and the expectation ofP(all) is 0.592.The conditional-UMVUE, U , is by definition conditionally unbiased, regardless of m, as is

borne out again by Table III. Its variance, however, is larger with larger values of m, a point wereturn to in Section 6.


Estimation following group sequential designs for phase 2 therapeutic trials has been studied atleast since 1958 [11]. We refer to Jennison and Turnbull [12] and Emerson and Fleming [13] askey papers. The UMVUE for binary response data was studied recently by Jung and Kim [5],although related results for the mean of a normal distribution have long been available [14, 15].

Copyright q 2008 John Wiley & Sons, Ltd. Statist. Med. 2009; 28:762–779DOI: 10.1002/sim


Table III. Results of additional simulation studies with various choices for m/n, the fraction of totalsample size that enters into the first-stage evaluation.

True P m Per cent stopping early P(all) P(stage2) U

0.55 13 59 0.592 0.549 0.550(0.065) (0.090) (0.085)

0.60 13 41 0.628 0.597 0.597(0.071) (0.098) (0.090)

0.65 13 26 0.668 0.646 0.647(0.067) (0.091) (0.082)

0.70 13 19 0.709 0.693 0.695(0.069) (0.090) (0.081)

0.75 13 7 0.756 0.747 0.749(0.066) (0.084) (0.073)

0.80 13 2 0.804 0.800 0.801(0.061) (0.078) (0.065)

0.85 13 0 0.852 0.852 0.851(0.057) (0.068) (0.059)

0.55 27 84 0.653 0.563 0.560(0.049) (0.138) (0.112)

0.60 27 68 0.670 0.588 0.593(0.056) (0.136) (0.119)

0.65 27 50 0.698 0.641 0.651(0.054) (0.130) (0.101)

0.70 27 28 0.728 0.696 0.700(0.057) (0.127) (0.091)

0.75 27 14 0.761 0.748 0.748(0.060) (0.120) (0.081)

0.80 27 3 0.803 0.793 0.798(0.059) (0.112) (0.068)

0.85 27 1 0.852 0.852 0.851(0.056) (0.099) (0.058)

Data were simulated using the same context as Table I, �0=0.60, �1=0.80, n=40. 1000 simulated studiesper scenario.

For a two-stage study with binary response, the UMVUE is easy to calculate and is likely thepopular choice, so we consider it here.

The literature on group sequential designs considers that estimation occurs at the end of the study,i.e. at stage 1 if the study terminates there or at stage 2 if it continues. The unconditional-UMVUEis defined as

U =E(P(stage1)|P,stage)

where stage denotes the stopping stage and P denotes the response rate calculated with all datacollected in the study by the stopping stage. Thus,

U = P(stage1) if C=0

U = E(P(stage1)|P(all),C=1) if C=1

Copyright q 2008 John Wiley & Sons, Ltd. Statist. Med. 2009; 28:762–779DOI: 10.1002/sim

770 M. S. PEPE ET AL.

Table IV. Performance of the traditional unconditional-UMVUE in studies that complete evaluation of alln subjects. The scenarios and simulations are the same as in Tables I and II.

�0=0.60, �1=0.80, n=40, m=20True P 0.550 0.600 0.650 0.700 0.750 0.800 0.850Mean (U ) 0.692 0.705 0.720 0.741 0.771 0.808 0.851sd (U ) 0.023 0.029 0.035 0.043 0.050 0.054 0.055

�0=0.95, �1=0.98, n=230, m=115True P 0.900 0.950 0.965 0.980 0.990Mean (U ) 0.960 0.966 0.972 0.981 0.990sd (U ) 0.002 0.005 0.007 0.008 0.007

�0=0.60, �1=0.70, n=220, m=110True P 0.550 0.600 0.650 0.700 0.750Mean (U ) 0.635 0.646 0.667 0.701 0.750sd (U ) 0.005 0.013 0.021 0.028 0.029

Averaging over all studies, including those that terminate at stage 1, U is unbiased becauseP(stage1) is unbiased. However, if interest is only in the estimation for studies that complete bothstages, then U is biased upward, i.e. E(U |C=1)>P . Intuitively, this follows from the fact thatsince U is marginally unbiased

P=E(U )=E(U |C=0)P(C=0)+E(U |C=1)P(C=1)

and E(U |C=0)=E(P(stage1)|C=0), the mean response in stage 1 restricted to studies thatterminate early for lack of response, which, by definition, is biased low. Therefore E(U |C=1) isbiased high. For the scenarios considered in Tables I and II, we calculated the conditional meanand sd of U , shown in Table IV. The estimates calculated from studies that complete stage 2 havesubstantial bias. Interestingly, the bias is at least as large as that of the naıve uncorrected estimatorP(all).In conclusion, if one is primarily interested in estimates of the response rate for studies that

complete evaluation of all n samples, a marginally unbiased estimator may be conditionally biasedin the sense that the estimates are too large on average from the subset of studies that reachfull enrollment. We have argued that estimates are only used to plan phase 3, when the phase 2study does not terminate early. This implies that estimates used to plan phase 3 will tend tobe too large. We therefore suggest usage of the conditional-UMVUE, U , over the traditionalunconditional-UMVUE, U .

We focus on estimation in studies that do not terminate early because our purpose is to determineif and how to design the next study. In particular, they will be used in sample size calculations.If a study terminates early due to lack of response, we conclude that P<�1 and the biomarker isconsidered inadequate for further development.

Nevertheless, we believe that there may also be a role for the conditional-UMVUE in thetraditional group sequential design settings where estimation at the terminating stage is required,be it early or not. One can use the conditional-UMVUE for studies that terminate at stage 2 andanother estimator, such as a Whitehead estimator or the naive estimator for studies that terminate

Copyright q 2008 John Wiley & Sons, Ltd. Statist. Med. 2009; 28:762–779DOI: 10.1002/sim


at stage 1. For example, define

U∗ = P(stage1) if C=0

U∗ = U if C=1

The estimator U∗ is equal to the traditional UMVUE if the study stops early and equal to theconditional-UMVUE if the study completes. It is unbiased conditional on completing both stages,but is not marginally unbiased. Observe that

E(U−P)2−E(U∗−P)2 = P(C=1){E((U−P)2|C=1)−E((U−P)2|C=1)}

= P(C=1){var(U |C=1)+bias2(U |C=1)−var(U |C=1))}

From Tables I and IV we see that when P is low, the bias in U dominates and U∗ has smaller(unconditional) mean squared error than U . However, when the response rate is high there is littlebias in any of the estimates, including U . In these cases, the small conditional variance of Uis attractive. In summary in terms of mean squared error, U performs better than U∗ when theresponse rate is high, but worse than U∗ when the response rate is low. In phase 2 biomarkerdevelopment studies, we anticipate that low response rates will be more common. Hence, werecommend U and U∗ for conditional and unconditional estimation, respectively.


5.1. Confidence intervals

We seek not only an estimate of P at the end of a completed study, but a confidence interval aswell. For this we propose two resampling methods. Note that simple bootstrapping, resampling atrandom from {Wi , i=1, . . . ,n} is not valid under a group sequential design. The responses in theobserved data are biased due to having passed the early stopping criterion.

In the first resampling approach we use the estimated population response rate, U , to simulateb=1, . . . , B group sequential studies with our design. Selecting those for which the continuationcriterion is satisfied, Cb=1, and calculating the corresponding statistics, U b, we use their empiricaldistribution as an estimate of the sampling distribution of U , conditional on C=1. The �/2 and{1−�/2} empirical quantiles are used as confidence limits. We call this approach the parametricbootstrap because data are simulated with response probability U , though we note that no parametricassumptions are made.

We call the second approach the nonparametric bootstrap. Here in the bth resampling, weresample n responses with replacement from the n observed. We then repeat the numerical calcula-tion described in Section 2.3 for each bootstrap sample, i.e. we calculate U b based on those amongK simulations, where the m cases sampled from the n bootstrap observations satisfy Cbk =1.Specifically, U b=E(Pb(stage2)|Pb(all),C=1). Again, quantiles of the distribution of U b areused as confidence limits.

Note that the parametric bootstrap, using U , generates data from the binomial distributionwhile the nonparametric bootstrap generates data from the empirical hypergeometric distribution.

Copyright q 2008 John Wiley & Sons, Ltd. Statist. Med. 2009; 28:762–779DOI: 10.1002/sim

772 M. S. PEPE ET AL.

The parametric bootstrap allows unconditional estimation and confidence interval calculations, ifdesired. With the nonparametric bootstrap, only conditional estimation and confidence intervalcalculation are possible, but it can be applied more generally. For example, if the marker iscontinuous, summary indices pertaining to the receiver operating characteristic (ROC) curvewould be of interest and the nonparametric bootstrap could be applied without any distributionalassumptions.

Table V shows coverage of confidence intervals under the scenarios and design of Table I(n=40,m=20) and Table II (n=220,m=110). Owing to the extensive computation involved, weused K =500 (rather than K =5000) in calculating U . We see that coverage is reasonably close tothe nominal 95 per cent level for both bootstrap methods, but somewhat lower for the parametricbootstrap than for the nonparametric bootstrap. Correspondingly, the standard deviation tends tobe slightly underestimated with the parametric methods, but overestimated with the nonparametricbootstrap.

5.2. Power

It is natural to use the confidence interval to formally test H0 : P��0 at the end of the study.Observe that the overall type 1 error rate is less than �/2, since we allow early termination dueto futility and in addition we control the conditional type 1 error rate at �/2. One could adjustthe confidence level so that the overall type 1 error rate is �/2, but we don’t pursue that here,preferring instead to control the conditional error rate that yields a more intuitively appealing testprocedure that is directly related to the conditional confidence interval. We note that the standardhypothesis test that ignores the early stopping option is also marginally conservative, yet we donot recommend it because the estimator upon which it is based is conditionally biased and theconditional type 1 error exceeds the nominal level. Recall that only values of P at or above �1are considered desirable, so we study power for P��1. Compared with a fixed sample size studyof n samples, power is reduced by the group sequential design for two reasons. First, by allowingstudies to stop at stage 1, power is lost if some fraction of those would have proceeded to yielda positive conclusion had they not been terminated. Second, power is lost if the location is loweror the width of the confidence interval for P is wider, when it is based on an adjusted estimatorthan when it is based on the naıve estimator.

The stopping criterion used plays a large role in regards to the first power loss mechanism(although the discussion so far in this paper does not rely on it). Our proposed criterion is to stopafter evaluating m subjects, if the upper two-sided (1−�) confidence limit lies below �1. Thereforethe associated power loss at P��1 is no more than �/2. It is likely to be less than �/2 even whenP=�1 because some of those terminated studies would presumably be in the fraction of studiesdeemed to be negative even if enrollment continued to n samples.

Table VI displays the power of the standard analysis based on P(all) in a fixed sample sizedesign. That is, the power if all studies continued to n=40, regardless of interim results. Also shownare the powers associated with designs that allow early stopping and use conditional confidenceintervals based on U at the end of stage 2 for testing H0 : P��0. We see that the two-stage studiesusing the parametric bootstrap confidence interval have power comparable with the fixed samplesize studies. That is, their benefit, which is to terminate early those studies in which markers havepoor performance, is gained without substantial loss in their capacity to identify good markers assuch. The nonparametric bootstrap confidence interval seems to not achieve the same power, duepresumably to its over conservative nature.

Copyright q 2008 John Wiley & Sons, Ltd. Statist. Med. 2009; 28:762–779DOI: 10.1002/sim











































� 0=0


� 1=0






































































� 0=0


� 1=0




























































� 0=0


� 1=0


� 1=0














n byielded50







Copyright q 2008 John Wiley & Sons, Ltd. Statist. Med. 2009; 28:762–779DOI: 10.1002/sim

774 M. S. PEPE ET AL.

Table VI. Power based on P(all) in a fixed sample size study of n subjects and power based on U instudies that allow early termination. Early stopping uses 1−� confidence interval at the interim analysis.

P n Early stopping (per cent) P-BS(U ) NP-BS(U ) Logit(P(all))

0.80 40 4.2 0.722 0.638 0.7100.82 40 1.8 0.804 0.724 0.8080.85 40 0.6 0.918 0.870 0.9200.70 220 2.8 0.802 0.708 0.8720.72 220 0.2 0.942 0.904 0.9740.75 220 0.2 0.992 0.982 1.000

NP-BS(U ) nonparametric bootstrap; P-BS(U ) parametric bootstrap; Logit(P(all)) normal approximation to thedistribution of logit(P(all)). Power for U is the proportion of studies that reach complete enrollment and 95per cent confidence interval does not include �0. Scenarios of Tables 1 and 2 (lower panel) are employed,m=n/2, �0=0.60 and �=0.05. 500 simulated studies. Values of U are calculated with K =500.

Our focus here is on power achieved when P��1. We defined �1 as the minimum desirablevalue of P , meaning that values of P less than �1 are not desirable. We therefore do not seek highpower for P in the range (�0,�1). The two-stage design in fact ensures that power in this rangeis reduced relative to a single-stage study and we view this as a good attribute. Nevertheless, itunderscores that the choice of �1 should be made judiciously and must be the minimum desirablevalue. Similarly the choice of �0 is crucial; �0 is the maximal unacceptable value. Values in theequivocal range (�0,�1) may be reluctantly acceptable, but are not desirable. Specifying (�0,�1)is often a difficult challenge in practice.


To fix ideas we now provide in Table VII, a few simple illustrations using simulated data. For eachwe use the design of Table I, i.e. n=40,m=20, �0=0.6 and �1=0.8. In the first illustration, atthe interim analysis only 5 of 20 samples have a positive response. The 95 per cent confidenceinterval for P is (0.11,0.47). Since the upper limit is below �1=0.80, the study terminates early.

In the second illustration, the response rate at the interim analysis is much higher, with 18 of20 responses positive and 95 per cent confidence interval for P , (0.70, 0.97). The study continuesto accrue responses from 20 more subjects, of which 17 responses are positive, yielding P(all)=35/40=0.88. The estimates that adjust for the early stopping option, U , Wmed and Wmean, areall equal to 0.88. We calculate 95 per cent confidence intervals for P based on U as (0.77, 0.98)with the nonparametric bootstrap and (0.78, 0.98) with the parametric bootstrap. In either case,we conclude that the response rate exceeds the unacceptable level of 0.60. In fact it appears to bewithin the desirable range and deliberations about the next phase of biomarker development ensue.

Six further illustrations are shown in Table VII. Two, studies 3 and 8, terminate early. Two,studies 4 and 5, continue to completion, but do not yield positive conclusions about markerperformance. Study 6 is inconclusive. Unfortunately when the design stipulates only 90 per centpower, even with a fixed sample size design, inconclusive studies can occur. Study 7 indicatesa 100 per cent response rate (CI=(0.84,1.00)) in the initial stage. One might be tempted toterminate at that point. However a more prudent approach is to collect additional data, and indeed

Copyright q 2008 John Wiley & Sons, Ltd. Statist. Med. 2009; 28:762–779DOI: 10.1002/sim








� 0=0

.6,� 1





















































































































Copyright q 2008 John Wiley & Sons, Ltd. Statist. Med. 2009; 28:762–779DOI: 10.1002/sim

776 M. S. PEPE ET AL.

the second-stage data tempers enthusiasm somewhat, providing adjusted estimates of 0.85 for theresponse rate.

The results in Table VII suggest relationships between U , P(all) and P(stage2). Inparticular when P(all) is large, we find that U ≈ P(all). This is reasonable since U =E(P(stage2)|P(all),C=1), and when P(all) is large it follows that C=1 with high probability sothat U ≈E(P(stage2)|P(all))= P(all). On the other hand, when P(all) is small, U ≈ P(stage2).This makes sense because a small value of P(all) together with the knowledge that the continuationcriterion was passed indicates that P(stage1) was close to the critical value for continuation. Thisin turn informs about P(stage2), which is equal to (n−m)−1{n P(all)−mP(stage1)}.

These observations also have implications for the performance of U relative to P(all) andP(stage2) in general. When the true response rate is small, U behaves similarly to P(stage2), whileU behaves more like P(all), when the response rate is high. The conditional standard deviationsreported in Tables I and II bear this out. In addition, we see in Table III that differing values of mhave little impact on the conditional performance of U when P is large, but greater impact whenP is small. In the former case, U is similar to P(all), which is unaffected by m. In the latter case,U is similar to P(stage2), which is more variable when the second-stage sample size n−m issmall.


We now return to the context of evaluating a diagnostic or screening marker where considerationsof both sensitivity (S) and specificity (1−F) must be made simultaneously. Let �0 and �0 denotemaximal unacceptable values of sensitivity and specificity, respectively, while �1 and �1 denoteminimum desirable values. The design and analysis of a fixed sample size study are described indetail in Pepe [9, pp. 218–220].

Briefly, using subscripts D and D to denote cases and controls, a fixed sample size study enrollsnD cases and nD controls. A joint (1−�) confidence rectangle for (S,1−F) is calculated as theCartesian product of (1−�∗) confidence intervals for S and 1−F , where (1−�∗)=(1−�)1/2. Apositive conclusion is drawn about marker performance, if the lower limit for S exceeds �0 and thelower limit for 1−F exceeds �0. The sample sizes are chosen so that when S=�1 and 1−F=�1,the probability is high, 1−�, that both lower confidence limits exceed the thresholds �0 and �0.To illustrate, with (�0,�1)=(0.6,0.8) and (�0,�1)=(0.95,0.98), values appropriate for an ovariancancer screening marker, the sample size formulae [9, equations (8.2) and (8.3)] yield nD =78 andnD =572 to achieve size �=0.05 and power 1−�=0.90.

The study could be designed to terminate after half the cases and half the controls are evaluated,if the joint confidence rectangle does not contain both minimally desirable values for sensitivityand specificity (�1,�1). Otherwise, the study continues to complete enrollment at which timethe conditional-UMVUE estimates of S and 1−F are calculated. Corresponding (1−�∗) levelconfidence intervals yield a joint (1−�) confidence rectangle. A positive conclusion about markerperformance ensues, if the (1−�∗) confidence intervals for S and 1−F exclude �0 and �0,respectively. Table VIII shows the results of some simulation studies.

We see that the study is likely to stop early if the true sensitivity or the true specificity is low,but likely to continue if both are at the minimally desirable value. Coverage for the 95 per centparametric bootstrap confidence rectangle was slightly lower than the nominal rate; although, fourof the five scenarios achieved at least 93 per cent coverage. We observe that the study has very

Copyright q 2008 John Wiley & Sons, Ltd. Statist. Med. 2009; 28:762–779DOI: 10.1002/sim


Table VIII. Simulated studies using a two-stage design with (nD =78,mD =39,�0=0.6,�1=0.8)and (nD =572,mD =286,�0=0.95,�1=0.98).

Per cent Conditional† Conditional† Unconditional Fixed sample∗S F terminating early joint coverage power power power

0.6 0.95 95 92 0.00 0.00 0.000.6 0.98 75 90 0.02 0.01 0.020.8 0.95 77 96 0.01 0.00 0.000.8 0.98 2 94 0.83 0.81 0.900.7 0.97 38 94 0.09 0.06 0.13

Coverage and power are shown for the conditional-UMVUE estimators of S and F with parametric bootstrappedconfidence intervals. Nominal coverage probability=95 per cent.∗No option for early termination.†Restricted to studies that complete both stages.

low rejection rate when S<�0 or 1−F<�0, as desired. When S��1 and 1−F��1, we desire highpower. We observe that the 81 per cent unconditional power when S=�1 and 1−F=�1 representsa 9 per cent decrease from the fixed sample size power.

There are many variations on study design that could be explored. Our choice of interim analysiswhen both mD =nD/2 cases and mD =nD/2 controls are evaluated is arbitrary. One need notenroll cases and controls at the same relative rates. In fact one option would be to enroll allmD controls first, before using samples from cases. If the study terminates early because of poorspecificity, precious samples from cases are saved. Yet inference is the same. In practice however,one may want to mix up the order of cases and controls somewhat in order to expose testers toheterogeneous samples and to aid with blinding. In a similar vein, for S and F we have chosenequal-adjusted significance levels �∗ for construction of their joint confidence rectangle. Unequalvalues can be employed. Letting �∗

D and �∗Ddenote adjusted values for S and F , respectively, the

requirement for joint 1−� coverage is



However, arguments leading to particular choices of (�∗D,�∗

D) that are unequal have not beendeveloped yet.


We have proposed the conditional-UMVUE, U , for estimation at the end of a phase 2 groupsequential study that does not terminate early. It is appropriate when unbiased estimation is requiredfrom studies that reach full enrollment. In our experience with phase 2 biomarker studies, calculationof estimates is of less concern in studies that terminate early, where the conclusion is simply thatthe biomarker is inadequate for further development and sufficient data for precise estimation arenot available in any case. Hence, we focused on estimators with good properties conditional on fullenrollment because estimates from such studies will be used in planning subsequent phase 3 studies.These considerations seem equally relevant for phase 2 group sequential therapeutic studies and wesuggest U for application in that context too. We noted that the standard unconditional-UMVUE,U , can show considerable conditional bias. The naive-unadjusted estimator is also conditionallybiased.

Copyright q 2008 John Wiley & Sons, Ltd. Statist. Med. 2009; 28:762–779DOI: 10.1002/sim

778 M. S. PEPE ET AL.

Classically methods for group sequential studies focus on hypothesis testing where it is naturalto calculate type 1 and 2 error rates marginally over all studies, those that stop early and thosethat do not. In addition design properties such as expected sample size are calculated marginally.However, good marginal performance does not imply good conditional performance as we havedemonstrated for the unconditional-UMVUE. The conditional performance of other unconditionalestimators should also be studied. We have focused here on properties that are conditional onreaching full enrollment. Although outside the scope of this paper, it would be interesting todetermine if estimators and tests can be found that have both good conditional performance andgood marginal performance.

Conditional inference has been discussed from a decision theoretic point of view [16] and wasrecently applied to group sequential designs [17]. In particular, Strickland and Casella consideredthe conditional confidence interval, (�L ,�U ), where limits are defined in a similar vein to White-head’s median-adjusted estimator but using target probabilities of �/2 and 1−�/2 instead of 0.5in equation (1). For normally distributed data, they proved an optimality result for these intervals.This suggests that they should be examined for binary data and compared with the confidenceintervals based on U that are proposed here. They also noted for normally distributed data, theconditional performance of unconditional confidence intervals can be very poor. Intervals derivedfrom hypothesis testing procedures that control marginal error rates [13] are amongst those withcorrect marginal coverage, but potentially poor conditional performance.

We studied a very simple two-stage design. Other options that might be considered in futurework include designs with more than two stages, different rules for early termination and allowancefor early termination if accumulating results are exceptionally good. Complications that arise whensample assays are batched or when multiple markers are being tested also need to be addressed. Thesimple design we propose for a group sequential study requires choosing values for the confidencelevel at the interim analysis, 1−�, and for the stage 1 sampling fraction, m/n. The probabilityof early stopping when P=�1 is �/2. Since this should be small, we chose �=0.05 in ourillustrations. Another attractive feature of the choice �=0.05 is that the practice of calculating 95per cent confidence intervals is familiar to our collaborators and they can easily accept abandoninga biomarker study, if the 95 per cent confidence interval does not contain �1. That is the earlystopping criterion makes sense to collaborators. Observe that one can also consider � as a type 1error for testing H : P=�1 based on m observations. The corresponding power is the probabilityof early stopping under H : P=�0. Larger values of m give rise to higher power. The choice ofm might be based on minimizing the expected sample size, which requires postulating a priorprobability distribution for P .

This paper considered biomarkers with dichotomous values. However, most biomarkers aremeasured on a continuous scale and performance is evaluated with the ROC curve. Methods forestimating the ROC curve following a group sequential phase 2 study would be worthy of research.


1. Pepe MS, Etzioni R, Feng Z, Potter JD, Thompson M, Thornquist M, Winget M, Yasui Y. Phases of biomarkerdevelopment for early detection of cancer. Journal of the National Cancer Institute 2001; 93:1054–1061. DOI:10.1093/jnci/93.14.1054.

2. Institute of Medicine. Workshop Summary: Developing Biomarker-based Tools for Cancer Screening, Diagnosis,and Therapy—The State of the Science, Evaluation, Implementation, and Economics. National Academies Press:Washington, DC, 2006.

Copyright q 2008 John Wiley & Sons, Ltd. Statist. Med. 2009; 28:762–779DOI: 10.1002/sim


3. Mazumdar M. Group sequential design for comparative diagnostic accuracy studies: implications and guidelinesfor practitioners. Medical Decision Making 2004; 24:525–533. DOI: 10.1177/0272989X04269240.

4. Mazumdar M, Liu A. Group sequential design for comparative diagnostic accuracy studies. Statistics in Medicine2003; 22:727–739. DOI: 10.1002/sim.1386.

5. Jung S-H, Kim K-M. On the estimation of the binomial probability in multistage clinical trials. Statistics inMedicine 2004; 23:881–896. DOI: 10.1002/sim.1653.

6. Bickel PJ, Doksum KA. Mathematical Statistics: Basic Ideas and Selected Topics. Holden-Day: San Francisco,1977; 121.

7. Whitehead J. The Design and Analysis of Sequential Clinical Trials. Elis Horwood: Chichester, 1983.8. Whitehead J. On the bias of maximum likelihood estimation following a sequential test. Biometrika 1986;

73:461–471. DOI: 10.1093/biomet/73.3.573.9. Pepe MS. The Statistical Evaluation of Medical Tests for Classification and Prediction. Oxford University Press:

Oxford, 2003.10. Brown LD, Cai TT, DasGupta A. Interval estimation for a binomial proportion. Statistical Science 2001;

16:101–133. DOI: 10.1214/ss/1009213286.11. Armitage P. Sequential methods in clinical trials. American Journal of Public Health 1958; 48:1395–1402.12. Jennison C, Turnbull BW. Confidence intervals for a binomial parameter following a multi-stage test with

application to MIL-STD 105D and medical trials. Technometrics 1983; 25:49–58.13. Emerson SS, Fleming TR. Parameter estimation following group sequential hypothesis testing. Biometrika 1990;

77:875–892. DOI: 10.1093/biomet/77.4.875.14. Emerson SS. Computation of the uniform minimum variance unbiased estimator of a normal mean following a

group sequential trial. Computational Biomedical Research 1993; 26:68–73. DOI: 10.1006/cbmr.1993.1004.15. Emerson SS, Kittelson JM. A computationally simpler algorithm for the UMVUE of a normal mean following

a group sequential trial. Biometrics 1997; 53:359–365.16. Kiefer J. Conditional confidence statements and confidence estimators (with Discussion). Journal of the American

Statistical Association 1977; 72:789–827.17. Strickland PA, Casella G. Conditional inference following group sequential testing. Biometrical Journal 2003;


Copyright q 2008 John Wiley & Sons, Ltd. Statist. Med. 2009; 28:762–779DOI: 10.1002/sim

(PDF) Conditional estimation of sensitivity and specificity from a phase 2 biomarker study allowing early termination for futility - DOKUMEN.TIPS (2024)


What is the difference between sensitivity and specificity in biomarkers? ›

Sensitivity is the ability to detect a disease in patients in whom the disease is truly present (i.e. , a true positive), and specificity is the ability to rule out the disease in patients in whom the disease is truly absent (i.e. , a true negative).

How to calculate sensitivity and specificity? ›

Sensitivity = [ a / ( a + c ) ] × 100 Specificity = [ d / ( b + d ) ] × 100 Positive predictive value ( PPV ) = [ a / ( a + b ) ] × 100 Negative predictive value ( NPV ) = [ d / ( c + d ) ] × 100 .

Is it better to have high or low sensitivity and specificity? ›

It may not be feasible to use a test with low specificity for screening, since many people without the disease will screen positive, and potentially receive unnecessary diagnostic procedures. It is desirable to have a test that is both highly sensitive and highly specific.

Top Articles
Latest Posts
Article information

Author: Barbera Armstrong

Last Updated:

Views: 5677

Rating: 4.9 / 5 (59 voted)

Reviews: 82% of readers found this page helpful

Author information

Name: Barbera Armstrong

Birthday: 1992-09-12

Address: Suite 993 99852 Daugherty Causeway, Ritchiehaven, VT 49630

Phone: +5026838435397

Job: National Engineer

Hobby: Listening to music, Board games, Photography, Ice skating, LARPing, Kite flying, Rugby

Introduction: My name is Barbera Armstrong, I am a lovely, delightful, cooperative, funny, enchanting, vivacious, tender person who loves writing and wants to share my knowledge and understanding with you.