Keywords

Introduction, Definitions, and General Considerations

What Is an Outcome?

All studies consist of taking measurements of varying levels of complexity. While finished studies make included measurements appear obvious and necessary, careful planning is necessary to ensure achievement of study goals. There are two key classes of measurements in studies: outcomes (dependent variables) and predictors (independent variables). In the most basic sense, studies ask whether measured predictors can account for variation in outcomes. In clinical trials the most important predictors are typically therapies, drugs, and outcomes that are measures with clinical significance, e.g., suppose we study the effectiveness of a cigarette smoking cessation method. The (primary) outcome would likely be cigarette cessation (or not); the study then seeks to answer whether the cessation method was associated with higher rates of cigarette smoking cessation. However, decisions regarding outcomes may be more complex than one would naively expect. For smoking cessation, the broad outcome of interest is whether a subject quit smoking, but researchers and regulators may care about how one defines quitting and/or how long the cessation lasts. Which measures most adequately capture relevant clinical concerns and are within the practical limits of conducting a trial?

Outcomes in Clinical Trials

The FDA defines a clinical outcome as “an outcome that describes or reflects how an individual feels, functions, or survives” (FDA-NIH Biomarker Working GroupB). Choices regarding outcomes in clinical trials are often further constrained compared to studies in general. Ethical considerations guide the choice of outcomes, and the regulators, i.e., Food and Drug Administration (FDA), EMA, institutional review boards, and/or other relevant funding agencies, ensure that these ethical considerations are considered. Many clinical trials involve therapies that carry significant risks, especially in the case of surgical interventions or drugs. In phase III trials, the FDA requires that the effects of therapies under consideration be clinically meaningful to come to market (Sullivan n.d.). The expectation is that a therapy’s benefits will sufficiently outweigh the risks. The FDA gives three reasons for which patients reasonably undertake treatment risks:

  1. 1.

    Increased survival rates

  2. 2.

    Detectable patient benefits

  3. 3.

    Decreased risk of disease development and/or complications

Outcomes inherently vary in importance. A primary outcome should be a measure capable of answering the main research question and is expected to directly measure at least one of the above reasons for taking risks. Treatment differences in primary outcomes generally determine whether a therapy is believed to be effective. Researchers often measure a single primary outcome and several secondary outcomes . Secondary outcomes may be related to the primary outcomes, but of lower importance or may not be inherently feasible to use as a primary outcome due to duration of the study needed to assess them or the sample size required to defend the study as adequately powered. Outcomes measuring participant safety must ensure that the risk-benefit ratio is sufficiently high. In cigarette cessation trials, perhaps smoking cessation maintained for 6 weeks as the primary outcome and cessation after 6 months and 1 year are secondary outcomes. The longer-duration cessation is actually more important but may make the size and/or duration of the trial not feasible due to expected recidivism or losses to follow-up. Note that therapies not involving drugs typically still require recording adverse events. For example, smoking cessation therapy studies may be concerned about depression or withdrawal symptoms (Motooka et al. 2018). When conducting an exercise study to improve fitness in disabled multiple sclerosis patients, we need to be cognizant of falls and thus measure and record their occurrences.

Where Are We Going?

The focus of this chapter is the mathematical, clinical, and practical considerations necessary to determine appropriate outcomes. Section “Types of Outcome Measures” introduces and discusses biomarkers and direct and surrogate outcomes and defines mathematical descriptions of variables. Section “Choosing Outcome Measures” considers the clinical, practical, and statistical considerations in outcome choice. Finally, Section “Multiple Outcomes” examines the benefits and complications of using multiple outcomes.

Types of Outcome Measures

Clinical Distinctions Between Outcomes

Direct Endpoints

The FDA defines direct endpoints as outcomes that directly describe patient well-being; these are categorized as objective or subjective measures. Objective measures explicitly describe and/or measure clinical outcomes and leave little room for individual interpretation. Some common objective measures are as follows:

  1. 1.

    Patient survival/death.

  2. 2.

    Disease incidence; e.g., did the subject develop hypertension during the study period given they were free of hypertension at the start of the study?

  3. 3.

    Disease progression; e.g., did the subject’s neurological function worsen during the study period?

  4. 4.

    Clinical events; e.g., myocardial infarction, stroke, multiple sclerosis relapse.

Subjective measures often depend upon a subject’s perception. For health outcomes, this is often in terms of disease symptoms or quality of life (QoL) scores. Subjective endpoints are complicated by their openness to interpretation, either between or within subject’s responses or rater’s assessments, and whether or which measures adequately capture the quality of interest is often debatable. Ensuring unbiased ascertainment and uniformity of measurement interpretation is difficult when the outcome is, say QoL, global impressions of improvement, etc. compared to objective endpoints such as death or incident stroke. Measure assessment is covered in detail in section “Choosing Outcome Measures.”

Note that regulatory agencies prefer direct endpoints as primary outcomes, particularly for new drug approval. There are several issues that arise from using what we will denote as the elusive surrogate measures or biomarkers and how these issues can make their use less than optimal .

Surrogate Endpoints

Surrogate endpoints are substitutes for direct or clinically meaningful endpoints and are typically employed in circumstances where direct endpoints are too costly, are too downstream in time or complexity, or are unethical to obtain. Few true surrogates exist if one uses the definition provided by Prentice (1989). In the Prentice definition, the surrogate is tantamount to the actual outcome of interest (E), but this is often unachievable. While there is some concurrence on the existence of a so-called surrogate, these are often laboratory measures or measurable physical attributes from subjects, such as CD4 counts in HIV trials, although still lacking in meeting the Prentice definition.

Surrogate endpoints may avoid costly or unethical situations, but the researcher must provide strong evidence that the surrogate outcome is predictive of, correlated with, and/or preferably in the therapeutic pathway between the drug or treatment and expected clinically significant benefit. Importantly, while the Prentice criteria argue for complete replacement of the endpoint by the surrogate, the generally accepted goal of a surrogate endpoint is to be sufficiently predictive of the direct endpoint.

In the case of sufficiently severe illness, researchers may obtain “accelerated approval” for surrogate endpoints, but further trials demonstrating the relation between surrogate and direct endpoints are typically required despite initial approval. Surrogate endpoints can be classified into the following stages of validation (Surrogate Endpoint Resources for Drug and Biologic Development n.d.):

  1. 1.

    Candidate surrogate endpoints are in the process of proving their worth as predictors of clinical benefits to subjects.

  2. 2.

    Reasonably likely surrogate endpoints are “endpoints supported by strong mechanistic and/or epidemiologic rationale such that an effect on the surrogate endpoint is expected to be correlated with an endpoint intended to assess clinical benefit in clinical trials, but without sufficient clinical data to show that it is a validated surrogate endpoint” (FDA-NIH Biomarker Working Group). These are more likely to receive accelerated approval than candidate surrogate endpoints.

  3. 3.

    Validated surrogate endpoints “are supported by a clear mechanistic rationale and clinical data providing strong evidence that an effect on the surrogate endpoint predicts a specific clinical benefit” (FDA-NIH Biomarker Working Group). Validated surrogate endpoints are generally accepted by funding agencies as primary outcomes in clinical trials and generally are not required to provide further studies in support of the relationship between the surrogate and direct endpoint.

For validation, regulatory agencies prefer more than one study establishing the relationship between direct and surrogate endpoints. A major drawback to surrogate outcomes is that relationships between surrogate and direct endpoints may not be causal even when the correlation is strong; even if the relationship is (partially) causal, surrogate outcomes may not fully predict the clinically relevant outcome, especially for complicated medical conditions. Two problems thus arise:

  1. 1.

    A drug could have the desired beneficial effect on the surrogate outcome but also have a negative effect on an (possibly unmeasured) aspect of the disease, rendering the drug less effective than anticipated/believed.

  2. 2.

    Drugs designed to treat a medical condition may have varying mechanisms of action, and it does not follow that validated surrogate endpoints are equally valid for drugs with differing mechanisms of action.

These drawbacks can bias the estimate of a benefit-risk ratio, especially in smaller or shorter studies, where there may be insufficient sample size or follow-up time to capture a representative number of adverse events. Pairing underestimation of adverse events with too-optimistic beliefs regarding the therapeutic benefits can result in overselling a mediocre or relatively ineffective therapy.

Surrogates are often used in phase II studies initially before they can be accepted as legitimate for phase III clinical outcomes as surrogate endpoints are not often clinically meaningful in their own right. Phase II trials can use biomarkers or indicators of processes that are not necessarily surrogates.

Biomarkers

Biomarkers are “a defined characteristic that is objectively measured as an indicator of normal biological processes, pathologic processes, or responses to an exposure or intervention, including therapeutic interventions” (FDA-NIH Biomarker Working Group). Biomarkers are often useful as secondary outcomes regarding subject safety or validation that a therapy induces the expected biological response or a primary outcome in phase II proof of concept trials. Most validated surrogate endpoints are in fact biomarkers.

Biomarkers are often chosen as outcomes for the same reasons that surrogates are used: shortened trials, smaller sample sizes, etc. However, biomarkers are often more specific to certain treatments than surrogates. For example, in multiple sclerosis, MRI reveal small areas of inflammation when viewed after injection of a special chemical, gadolinium. Gadolinium-enhanced lesions are used in phase III trials as proof of concept primary outcomes, but they do not clearly predict disability outcomes, which are the goal of disease-modifying therapy, and they are clinically meaningful only through their repeated linkage to successful drug treatment. Sormani et al. have shown that they are acting as surrogates at the study level (Sormani et al. 2009, 2010). These counts of enhancing lesions seem to be biomarkers for inflammation, and their absence following treatment has been taken as a sign of efficacy. However, there are now drugs that virtually eliminate these enhancing lesions, yet progression of disability still occurs, so they are not a good choice for an outcome comparing two effective drugs where both may eliminate enhancing lesions but have differences in their effects on disability.

Biomarkers are also useful not only as outcome variables, but as predictors of outcomes on which to enrich a trial making it easier to see changes in the biomarkers or primary outcomes. Biomarker-responsive trials select individuals who have shown that with certain biomarkers or certain levels of biomarkers, the participants are at increased risk of events or are more responsive to treatment. This seems like a rational approach, but there are several caveats to the uncritical use of this selection. Simon and Maitournam point out that the efficiency of these designs is often not seen unless the proportion of biomarker-positive responders is less than 50% and the response in those who are biomarker negative is negligible (Simon and Maitournam 2004). The reasons for this counterintuitive finding are that the cost of screening can overwhelm the dampening of the response by the biomarker-negative individuals making the biomarker selection an added logistic issue while not enhancing the design over simple increases in sample size and stratification. In other situations, the biomarker’s behavior needs to be carefully considered. In Alzheimer’s trials, it has been argued that more efficient trials could be done if patients were selected based on a protein, tau-beta, found in their spinal cords. This is because patients with these proteins have more rapid declines in their disease as measured by the usual cognitive test outcomes. However, Kennedy et al. (2015) showed that when designing a study based on selection for tau positivity, the gains in sample size reduction due to the greater cognitive declines, which make percent changes easier to detect, were offset by the increased variation in cognitive decline among the biomarker positive subset. This results from assuming that the variance is the same or smaller in the biomarker-positive subset compared to the larger population.

Quantitative and Qualitative Descriptions of Outcomes

Thus far we have definitions of outcomes that relate to what we want to measure rather than how to quantify the measurement. The biological quantity of interest may be clear, but decisions about how to measure that quantity can affect the viability of a study or the reasonableness of the results. Outcomes are described as either quantitative or qualitative. In the following sections, we distinguish between these, give several examples, and discuss several common subtypes of outcomes.

Quantitative Outcomes

Quantitative outcomes are measurements that correspond to meaningful numeric scale and can be broken down into continuous and discrete measurements (or variables). In mathematical terms, continuous variables can take any of the infinite number of values between any two numbers in its support; they are uncountable. On the other hand, discrete variables are countable. A few examples should help clarify the difference.

Systolic blood pressure is a continuous outcome. In theory, blood pressure can take any value between zero and infinity. For example, a subject participating in a hypertension study may have a baseline systolic blood pressure measurement of 133.6224. We may round this number for simplicity, but it is readily interpretable as it stands. In most cases discrete quantitative outcomes consist of positive whole numbers and often represent counts. For example, how many cigarettes did a subject smoke in the last week? The answer is constrained to nonnegative whole numbers: 0, 1, 2, 3, …. While perhaps it is possible to conceive of smoking half a cigarette, the researcher needs to decide a priori whether to record and allow your data collection system to accept fractions or develop clear rules to record discrete values in the same way for all participants .

Categorical Outcomes

Categorical outcomes, or qualitative variables, have neither natural order nor interpretation on a numeric scale and result from dividing study participants into categories. Many drugs are aimed at reducing the risk of negative health outcomes, which are often binary in nature, and common trial aims are reducing the risk of death, stroke, heart attack, or progression in multiple sclerosis. The use of these binary outcomes is not simply convenience or custom, but rather they are much more easily interpreted as clinically meaningful. To say you have reduced blood pressure in a trial by 4.5 mmHg is by conditioning a positive result, but it is not in and of itself immediately clinically meaningful, whereas, if the group that experienced 4.5 mmHg greater change had lower mortality rates, it would be easier to say this is clinically meaningful.

Categories need not be binary in nature. For example, consider a study where it is likely that, in the absence of treatment, patient health is expected to decline over time. A successful drug might slow the decline, stop the decline but not improve patient health, or improve patient health, and so researchers could categorize the subjects as such.

Nominal Versus Ordinal Outcomes

Ordinal outcomes are perhaps best understood as a hybrid of continuous and categorical outcomes. Ordinal outcomes have a natural order, but do not correspond to a consistent, objective numerical scale; moving up/down from one rank to another need not correspond to the same magnitude change and may vary by individual; e.g., categories of worse, the same, or better patient health can be interpreted as an ordinal outcome since there is a natural order. However, it is not immediately clear that the “same” vs. “better” health indicates the same benefit for all patients or is necessarily equal in magnitude to the difference between “worse” and the “same” health. In contrast, nominal outcomes have neither natural ordering nor objective interpretation on a numeric scale. In the context of clinical trials, nominal variables are more likely to be predictors than outcomes; for example, we may have three treatment groups with no natural order, placebo, drug A, and drug B, or they can have an order such as placebo, low dose, and high dose. The former requires certain forms of analyses, while the latter allows us to take advantage of the natural ordering that occurs among the dose groups.

Common Measures

Often outcomes are raw patient measures, e.g., patient blood pressure or incident stroke. However, summary measures are often relevant. Incidence counts the number of new cases of a disease per unit of time and can be divided into cumulative incidence such as that occurring over the course of the entire study or the incidence per unit of time, 30-day mortality following surgery, etc. Incidence can pertain to both chronic issues, e.g., diabetes, and discrete health events, e.g., stroke, myocardial infarction, or adverse reaction to a drug.

Another common summary measure is the proportion: how many patients out of the total sample experienced a medical event or possess some quality of interest? The incidence proportion or cumulative incidence is the proportion of previously healthy patients who developed a health condition or experienced an adverse heath event. Incidence is also commonly described with an incidence rate; that is, per some number of patients, called the radix, often 1000, how many will develop the condition, experience the event, etc.; e.g., supposing 4% of patients developed diabetes during the study, then an incidence rate would say that we expect 40 out of every 1000 (nondiabetic) subjects to develop diabetes. Note that incidence differs from prevalence, which is the proportion of all study participants who have the condition. Prevalence can be this simple proportion at some point in time, known as point prevalence or period prevalence, the proportion over some defined period. The period prevalence is often used in studies from administrative databases and counts the number of cases divided by the average population over the period of interest, whereas the point prevalence is the number of cases divided by the population at one specific point in time.

A measure related to incidence is the time to event; that is, how long into the study did it take for the incident event to occur? This is often useful for assessing a therapy’s effect on survival or health state. For example, a cancer therapy may be considered successful in some cases if patient survival is lengthened; similarly, some therapies may be considered efficacious if they extend the time to a stroke or other adverse health event s.

Safety

Measuring/Summarizing Safety

In addition to efficacy outcomes, safety outcomes are also important to consider. We mentioned above that there is a necessary balance between the risks and potential benefits of a therapy. Thus, we need information regarding potential risks, particularly side effects and adverse events. It is possible that a therapy could be highly effective for treating a disease and yet introduce additional negative health consequences that make it a poor option. Safety endpoints can be direct or surrogate endpoints. Some therapies may increase the risk of adverse health outcomes like stroke or heart attack; these direct endpoints can be collected. We often classify events into side effects, adverse effects, and serious adverse effects.

Side Effects: A side effect is an undesired effect that occurs when the medication is administered regardless of the dose. Unlike adverse events, side effects are mostly foreseen by the physician, and the patient is told to be aware of the effects that could happen while on the therapy. Side effects differ from adverse events and later resolve on their own with time.

Adverse Events: An adverse event is any new, undesirable medical occurrence or change (worsening) of an existing condition in a subject that occurs during the study, whether or not considered to be related to the treatment.

Serious Adverse Events: A serious adverse event is defined by regulatory agencies as one that suggests a significant hazard or side effect, regardless of the investigator’s or sponsor’s opinion on the relationship to investigational product. This includes, but may not be limited to, any event that (at any dose) is fatal, is life threatening (places the subject at immediate risk of death), requires hospitalization or prolongation of existing hospitalization, is a persistent or significant disability/incapacity, or is a congenital anomaly/birth defect. Important medical events that may not be immediately life threatening or result in death or hospitalization but may jeopardize the subject or require intervention to prevent one of the outcomes listed above, or result in urgent investigation, may be considered serious. Examples include allergic bronchospasm, convulsions, and blood dyscrasias.

Collecting and monitoring these are the responsibility of the researchers as well as oversight committees such as Data and Safety Monitoring Committees. Collection of these can be complicated, such as when treatments are tested in intensive care units where nearly all actions could be linked to one or the other type of event, to relatively straightforward. Regulators have tried to standardize the recording of these events into System Organ Classes using the Medical Dictionary for Regulatory Activities (MedDRA) coding system. This standardized and validated system allows for mapping of a virtually infinite vocabulary of events into medically meaningful classes of events – infections, cardiovascular, etc. for comparison between groups and among treatments. These aid in the assessment of benefits versus risks by allowing comparisons of the rates of these medical events that occur within specific organs or body functions.

Obstacles to Measuring Safety

It is often the case that direct safety-related endpoints have relatively low incidence rates, especially within the time frame of many clinical trials since many medical conditions manifest after extended exposure; i.e., often weeks, months, or years pass before health conditions manifest. Thus, surrogate endpoints are often necessary, and biomarkers are useful, especially in cases where information on drug toxicity is needed. Using biomarkers to assess toxicity is integral to altering the patient’s dose or ceasing treatment before more severe health problems develop (FDA-NIH Biomarker Working Group). Laboratory assessments measure ongoing critical functions, and we often use flags or cut points to identify evolving risks, such three times the upper limit of normal to flag liver function tests or white cell counts to indicate infections. One of the major obstacles to assessing safety is that neither researchers nor regulators can give a specific frequency above which a treatment is considered unsafe. For some events, such as rare fatal events, the threshold may be just a few instances, and for other situations where the participants are extremely sick, high rates of adverse events may be tolerated, such as in cancer trials and treatments .

Choosing Outcome Measures

While some health-related outcomes have obvious metrics, others do not. For instance, if a study is conducted on a drug designed to lower or control blood pressure or cholesterol, then it is straightforward to see that the patient’s blood pressure is almost certainly the best primary outcome. However, for many complex medical conditions, arriving at a reasonable metric requires a considerably more twisted, forking path. For example, in multiple sclerosis (MS), the aim is to reduce MS-related disability in patients, but “disability” in such patients is a multidimensional problem consisting of both cognitive and physical dimensions and thus requiring a complex summary metric. Sometimes the choice involves choosing a metric and other times it involves how to use the metric appropriately. For example, smoking cessation therapy studies should record whether patients quit smoking, but at a higher level, we may debate just how long a subject must have quit smoking to be considered a verified nonsmoker, or we may require biological evidence of cessation such as cotinine levels, whereas in MS studies, the debate is more often over which metric most adequately captures patient disability.

Nonstatistical and Practical Considerations

The most important consideration in choosing an outcome measure is to ensure that the outcome measure possesses the ability to capture information that can answer relevant scientific questions of interest, and there can sometimes be a debate about which metrics are most appropriate. In MS studies there has been considerable concern that many commonly used measures of MS-related disability cannot sufficiently capture temporal change nor adequately incorporate or detect patient-perceived quality of life (Cohen et al. 2012).

More practical concerns involve measure interpretability and funding agency approval. Established metrics are more likely to be accepted by funding agencies such as the FDA and NIH, and a considerable amount of work is often necessary to make the case for a new metric. Some of the regulatory preference for established measures is no doubt based in a disposition toward “historical legacy,” but we note that there can be good reasons for preferring the status quo in this case (Cohen et al. 2012). Specifically, comparing studies becomes more difficult when different outcome measures are used, complicating interpretation of a body of literature. Therefore, new measures must often bring along detailed and convincing cases for their superiority over established measures. Physicians and other medical professionals must be able to readily interpret trial results in terms of practical implications on their patients, and if an outcome is difficult to practically interpret, it may be resisted even if it possesses other desirable qualities. For example, the Multiple Sclerosis Functional Composite (MSFC) was proposed to answer criticisms of the established and regulator-preferred Expanded Disability Status Scale (EDSS) (Cutter et al. 1999). However, despite its good qualities and improvement on the EDSS in many aspects, the MSFC is resisted by regulators primarily because its mathematical nature, a composite z-score of three functional tests, is a barrier to physician interpretation (Cohen et al. 2001, 2012). Interpretability is closely tied to ensuring that measures are clinically meaningful in addition to possessing desirable metric qualities.

Assessing Outcome Measures

Validity

Validity is the ability of the outcome metric to measure that which it claims to measure. In cases where the outcome is categorical, it is common to assess validity with sensitivity and specificity. Sensitivity is the ability of a metric to accurately determine patients who have the medical condition or experienced the event, and specificity is the ability of a metric to accurately determine which patients do not have the medical condition or did not experience the event. Both sensitivity and specificity should be high for a good metric; for example, consider the extreme case where a metric always interprets the patient as having a medical condition. In such a case, we will identify 100% of the patients with the medical condition (great job!) and 0% of the patients without the medical condition (poor form!). Note that while these concepts are often understood in terms of medical conditions and events, they need not be confined in such a way.

For continuous, and often ordinal, measures, assessing validity is somewhat more complicated. One could impose cutoffs on the continuous measure to categorize the variable, only then using sensitivity or specificity to assess validity. However, this is a rather clumsy approach in many cases; we want continuous outcome measures to capture the continuous value with as little measurement error as possible. This is often more relevant for medical devices. For example, a wrist-worn measure of blood glucose would need to be within +/− of a certain amount of the actual glucose level in the blood to demonstrate validity. Often individuals use regression analyses to demonstrate that a purported measure agrees with a gold standard, but it should be kept in mind that a high correlation by itself does not demonstrate validity. A regression measure should have slope of 1 and an intercept of 0 to indicate validity.

Sensitivity is also used to describe whether an outcome measure can detect change at a reasonable resolution. Consider a metric for disability, on a scale from 1 to 3, where higher scores indicate increased disability. This will be a good metric if generally when a patient’s disability increase results in a corresponding increase on the scale, but if the measure is too coarse, then it could be the case, for example, that many patients are having disability increases, but not sufficient to move from a 1 to a 2 on the scale. This measure being insensitive to the worsening at the participant level would lead to high sensitivity (because greater disability would have occurred before the scale recognized it), but poor specificity because being negative does not indicate the participant hasn’t progressed. When the metrics pertain to patient well-being or health dimensions of which a patient is conscious, it is expected that when the patient notices a change, the metric will reflect those changes. This is particularly important for determining the effectiveness of therapies in many cases. A measure that is insensitive to change could either mask a therapy’s ineffectiveness by incorrectly suggesting that patient conditions are not generally worsening or on the flip side portray an effective therapy as ineffective since it will not detect positive change. Further, because if a participant feels they are worsening, but the measure is insensitive, then this can lead to dropping out of the trial.

Reliability

Reliability is a general assessment of the consistency of a measure’s results upon repeated administrations. There are several relevant dimensions to reliability. A measure can be accurate, but not reliable. This occurs because on average the measure is accurate but highly variable. Various types or aspects of reliability are often discussed. Perhaps most prominent is interrater reliability . Many trials require raters to assess patients and assign scores or measures describing the patient’s condition, and interrater reliability describes the consistency across raters when presented with the same patient or circumstance. A reliable measure will result in (properly trained) raters assigning the same or similar scores to the same patient. When a metric is proposed, interrater reliability is a key consideration and is typically measured using a variant of the intraclass correlation coefficient (ICC), which should be high if the reliability is good (Bartko 1966, Shrout and Fleiss 1979).

Intersubject reliability is also a concern; that is, subjects with similar or the same health conditions should have similar measures. This differs from interrater reliability in that it is possible for raters to be highly consistent within a subject, but inconsistent across subjects, or vice versa. Interrater reliability measures whether properly trained raters assign sufficiently similar scores to the same patient; that is, is the metric such that sufficiently knowledgeable individuals would agree about how to score a specific subject? Intersubject reliability measures whether a metric assigns sufficiently similar scores to sufficiently similar subjects.

Other Concerns

There are several other issues in evaluating outcome measures. Practice effects occur when patients’ scores on some measure improve over time not due to practice rather than therapy. Studies involving novel outcome measures should verify that either no practice effects are present or that the practice effects taper off; for example, in developing the Multiple Sclerosis Functional Composite (MSFC), practice effects were observed, but these tapered off by the fourth administration (Cohen et al. 2001). Practice effects are problematic in that they can lead to overestimates of a treatment effect because the practice effect improvement is ignored when comparing a post-intervention measure to a baseline measure. In randomized clinical trials, we can assume both groups experience equivalent practice effects and the difference between the two groups is still an unbiased estimate of the treatment effect, but how much actual improvement was achieved is biased unless the practice effects can be eliminated prior to baseline by multiple administrations or adjusted for in the analyses. Practice effects are often complex and require adjustments to the measure or its application; for example, the Paced Auditory Serial Addition Test (PASAT), a measure of information processing speed (IPS) in MS, was shown to have practice effects that increased with the speed of stimulus presentation and was more prominent in relapse-remitting MS compared to chronic-progressive MS (Barker-Collo 2005). Therefore, using PASAT in MS research requires either slower stimulus presentation or some correction accounting for the effects.

Another source of (unwanted) variability in outcome measures, particularly subjective measures, is response shift . Response shift occurs when a patient’s criteria for a subjective measure change over the course of the study. It is clearly a problem if the meaning of the same recorded outcome is different at different times in a study, and therefore response shift should be considered and addressed when subjective measures and/or patient-reported outcomes are employed (Swartz et al. 2011). This is often the case with long-term chronic conditions such as multiple sclerosis where participants report on their quality of life in the early stages of the disease and when reassessed years later when they have increased disability record the same quality of life scores. Adaptation and other factors are at the root of these response shifts, but outcome measures that are subject to this type of variability can be problematic to use.

Statistical Considerations

In addition to whether a measure is informative to the clinical questions of interest, there are statistical concerns relating to the ability to make comparisons between treatment groups and answer scientific or clinical questions of interest with the data on hand. This section defines the relevant statistical measures and then describes their practical import in outcome choice.

Statistical Definitions

In hypothesis testing, variable selection, etc. there are two kinds of errors to minimize. The first is the false positive, formally Type I error, which is the probability that we detect a therapy effect, given that one does not exist. False positives are typically controlled by assignment of the statistical significance threshold, α, which is generally interpreted as the largest false-positive rate that is acceptable; by convention, α = 0.05 is usually adopted.

The second error class is the false negative, or Type II error, which occurs when we observe no significant therapy effect, when in reality one exists. Power is given as one minus the false-negative rate and refers to the power to detect a difference, given that one exists. For a given statistical method, the false-positive rate should be as low as possible and the power as high as possible. However, there is a tension in controlling these errors because controlling one in a stronger manner corresponds to a reduction in ability to control the other.

There are several primary reasons that these errors arise in practice (and theory). Sampling variability allows for occasionally drawing samples that are not representative of the population; this problem may be exacerbated if the study cohort is systematically biased in recruitment or changes over time during the recruitment period so that it is doubtful that the sample can be considered as drawn from the population of interest. The second primary reason for errors is sample size. In the absence of compromising systematic recruitment bias, a larger sample size can often increase the chance that we detect a treatment difference if one exists. The fundamental reason for improvement is that sample estimates will better approximate population parameters with less variation about the estimates, on average. Small sample sizes can counterintuitively make it difficult to detect significant effects, overstate the strength of real effects, and more easily find spurious effects. This is because in small samples a relatively small number of outliers can have a large biasing effect, and in general sampling variability is larger for smaller compared to larger samples.

The choice of statistical significance thresholds can also contribute to errors in inference. If the threshold is not sufficiently severe, then we increase the risk of detecting a spurious effect. Correspondingly, a significance threshold that is too severe may prevent detection of treatment differences in all but the most extreme cases. Errors related to the severity of threshold can affect both large and small samples.

Note that caution must be employed with respect to interpreting differences. Statistically significant differences are not necessarily clinically significant or meaningful. It is rare that two populations will be exactly equal in response to a treatment, but small differences between groups, while statistically significant at a large sample size, may not represent a sufficient benefit to the patients.

Common Statistical Issues in Practice

Statistical significance thresholds, power, and clinically meaningful treatment differences are determined a priori. Using these values and some estimate of variance gleaned from similar studies, we can calculate the necessary sample size. In cases with continuous outcomes, the calculations are often relatively straightforward and tend to have few, if any, additional restrictions beyond (usually) normality on the outcome’s values. However, the situation is more complicated when the outcomes are no longer continuous (or normal). A common problem is near or complete separability when the outcome is binary, that is, when almost all the patients have one or the other outcome. Model fitting problems will arise when separability applies to the whole study sample but also when (nearly) all patients in one group have one outcome and (nearly) all patients in the other group have the other outcome. This is especially true in retrospective and observational studies which seek to make comparisons among subgroups within the population. For example, in a study presented at the European Association for Cardiothoracic Surgery in 2016, a retrospective study of a vein graft preservation by a buffered solution compared to saline for use during harvesting and bypassing heart vessels attempted to adjust for the two time periods of comparison, the saline period prior to the introduction of the new product and after introduction. However, when the propensity scores were plotted by type of storage solution used, there was almost complete separation of the two populations before and after (Haime 2016). Here nearly all saline patients had a more favorable risk profile as the willingness to perform bypasses was directed toward younger healthier patients and after the buffered solution was adopted, so did the willingness to accept higher-risk patients for bypass.

Another common way such a situation arises is when the study length is too short for a sufficient number of events to arise. This is an issue whether the study is collecting time-to-event outcomes or simply recording incidence at the study terminus. In such cases it is highly relevant to know the required number of events necessary to detect a clinically significant difference. This problem is generalized to cases where there are more than two categories for the outcome, e.g., ordinal or multinomial data. In such cases where the study length cannot be extended to a sufficient length, nonbinary outcomes and/or surrogate outcomes may be necessary alternatives.

Common Simplifications and Their Up- and Downsides

As noted above, due to the ease of interpretation on the clinical meaningfulness of binary events, a common approach to analysis is to categorize continuous measures so long as the cutoffs are clinically relevant and decided before conducting a study. As discussed in the previous section, this simplification can cause analysis and interpretation issues if there are not sufficient numbers of patients in each category. There is also a tendency to reanalyze the data to better understand what has happened and that can lead to arbitrary cutoffs, and without the a priori specification of the outcome, finding a cut point that “works” is certainly changing the chances of a false-positive result.

On the other hand, it is sometimes useful to treat a discrete variable as continuous; the most common instance of this is count data, where counts are generally very large. In some cases, ordinal data may be treated as continuous with reasonable results. Even though analyzing the ordinal data implicitly assumes that each step is the same in meaning, this provides a simple summarization of response rate. Nevertheless, treating these ordinal data as ranks can be shown to have reasonable properties in detecting treatment effects. However, one should use caution when analyzing ordinal data by applying statistical methods designed for continuous outcomes; in particular, models for continuous outcomes perform badly when the number of ranks is small, and/or the distribution of the ordinal variable is skewed or otherwise not approximately normally distributed (Bauer and Sterba 2011; Hedeker 2015).

Reporting Outcomes

There are several main approaches for assessing outcomes: patient-reported outcomes, clinician-reported outcomes, and observer-reported outcomes. We define and discuss each below.

Patient-Reported Outcomes

Patient-reported outcomes (PROs) are outcomes dependent upon a patient’s subjective experience or knowledge; PROs do not exclude assessments of health that could be observable to others and may include the patient’s perception of observable health outcomes. Common examples include quality of life or pain ratings (FDA-NIH Biomarker Working Group). These outcomes have gained a lot of acceptance since the Patient-Centered Outcomes Research Institute (PCORI) came into existence. The FDA and other regulators routinely ask for such outcomes as they are indicative of the meaningfulness of treatments. Rarely have patient-reported outcomes been used as primary outcomes in phase III trials, except in those instances, such as pain, where the primary outcomes are only available in this manner. Most often they are used to provide adjunctive information on the patient perspective of the treatments or study. Nevertheless, researchers should be cautioned not simply to accept the need for PROs, but rather think carefully about what and when to measure PROs. PROs are subjective assessments and can be influenced by a wide variety of variables that may be unrelated to the actual treatments or interventions under study. Asking a cancer patient about their quality of life during chemotherapy may not lead to the conclusions of benefits of survival because of the timing of the ascertainment. Similarly, a participant in a trial who is severely depressed may be underwhelmed with the benefits of a treatment that doesn’t address this depression. In addition, the frame of reference needs to be carefully considered. For example, when assessing quality of life, should one use a tool that is a general measure, such as the Short-Form-36 Health Survey (SF36), or one that is specific to the disease under study? This depends on the question being asked and should be factored into the design for any and all data to be collected.

PROs, like many outcomes, are subject to biases. If participants know that they are on an active treatment arm versus a placebo, then their reporting of the specific outcome being assessed may be biased. Similarly, participants who know or suspect that they are on a placebo may report they are doing poorly simply because of this knowledge rather than providing accurate assessments as per the goal of the instrument. A general rule is that when one can make blinded assessments, the better. A more detailed discussion of the intricacies involved in PROs is found in Swartz et al., and the FDA provides extensive recommendations and discussion (FDA 2009) .

Clinician-Reported Outcomes

Clinician-reported outcomes (CRO) are assessments of patient health by medical or otherwise healthcare-oriented professionals and are characterized by dependence on professional judgment, algorithmic assessment, and/or interpretation. These are typically outcomes requiring medical expertise, but do not encompass outcomes or symptoms that depend upon patient judgment or personal knowledge (FDA-NIH Biomarker Working Group). Common examples are rating scales or clinical events, e.g., Expanded Disability Status Scale, stroke, or biomarker data, e.g., blood pressure.

Observer-Reported Outcomes

Observer-reported outcomes (OROs) are assessments that require neither medial expertise nor patient perception of health. Often OROs are collected from parents, caregivers, or more generally individuals with knowledge of the patient’s daily life and often, but not always, are useful for assessing patients who cannot, for reasons of age or impairment, reliably assess their own health (FDA-NIH Biomarker Working Group). For example, in epilepsy studies caregivers often keep seizure diaries to establish the nature and number of seizures a patient experiences.

Multiple Outcomes

Most studies have multiple outcomes (e.g., primary, secondary, and safety outcomes), but it is sometimes desirable or necessary to include multiple primary outcomes. These generally consist of repeatedly measuring the same (or similar) outcomes over time and/or including multiple measures, which can encompass multiple primary outcomes or multiple secondary outcomes. This section describes common situations where multiple outcomes are employed and discusses relevant considerations arising thereof.

Multiple (Possibly Related) Measures

When the efficacy of a clinical therapy is dependent upon more than one dimension, it may be inappropriate to prioritize one dimension or ignore lower-priority dimensions. For example, in a trial of thymectomy, a surgical procedure by which one’s thymus is removed, to control myasthenia gravis, a neuromuscular disease, a joint outcome was needed (Wolfe et al. 2016). The treatments were thymectomy plus prednisone versus prednisone alone. The primary outcome was the clinical condition of the participant over 3 years and the amount of prednisone utilized to control the disease. The need for both outcomes was due to the fact that the clinical condition could be made better by using more prednisone, so analyzing the clinical condition as the outcome would not correctly answer the question of how well a participant was doing, nor would use of the amount of prednisone used since using less prednisone could be done at the expense of the clinical condition.

Primary outcomes are generally analyzed first. For a therapy to achieve an efficacy “win,” it usually must meet some criteria pertaining to success in the primary endpoints (Huque et al. 2013). This may consist of all, some proportion, or at least one of the primary endpoints achieving significance by specified criteria and is typically conditional on demonstration of acceptable safety outcomes. Primary outcomes that must be all significant in order to demonstrate efficacy are called coprimary (FDA 2017). Significance in secondary outcomes tends to be supportive in nature and is generally not considered an efficacy “win” in the absence of significance for the therapy on primary outcome terms. Additionally, tertiary and exploratory outcomes are often reported, but conditional on primary endpoint efficacy. Nevertheless, the regulators often refer to the “totality of the evidence” when evaluating any application for licensing, and there have been treatments approved when showing statistically significant effectiveness on secondary outcomes, but not primary outcomes. This is more often done when there are no or few treatments available for a condition.

Composite Outcomes

Composite outcomes are functions of several outcomes. These can be relatively simple, e.g., the union of several outcomes. Such an approach is common for time-to-event data, e.g., major adverse cardiovascular events (MACE) or to define an event as when a patient experiences one of the following: death, stroke, or heart attack. Composite outcomes can also be more complex in nature. For example, many MS trials are focused on MS-related disability metrics, which tend to be composites of multiple outcomes of interest and which may be related to either physical or cognitive disability; two common options are the Expanded Disability Status Scale (EDSS) and the Multiple Sclerosis Functional Composite (MSFC).

Composite events are often used in time-to-event studies to increase the numbers of outcomes and, thus for the same relative risk reductions, increase the power as the power of a time-to-event trial is directly related to the number of events. When such events are not reasonably correlated however, care must be given not to dominate signal by noise. For example, as noted previously MS impacts patients differently and variably. The EDSS assesses seven functional systems and combines them into a single ordinal number ranging from 0 to 10 in 0.5 increments. For a person who is impacted in only one functional system, the overall EDSS may not be moved even by changes in this one functional system, thereby reducing its sensitivity to change. In composites, such as z-scores of multiple tests, it is recommended that no more than four or five components be used because most composites are averages or sums of the individual items. This means that any signal can be dominated by noise. Consider five ordinal scales, four of which only vary due to measurement error or variability and only one is in the affected domain. If all five scales were 2 at the baseline and all but one are 2 s later on and the one that changes goes from a 2 to 4, the overall average score would go from 2.0 to 2.4 because of the lack of change in four of the five measures. Many measurements are more variable than that in this example, and thus, understanding or even identifying changes becomes more difficult because of the signal-to-noise problem.

Multiple Comparisons/Testing Considerations

Multiple testing issues arise when individual outcomes are to be evaluated individually, either as components of a composite outcome or without a global test. When efficacy depends on more than one outcome, controlling the false-positive rate becomes more complicated, and there are two general metrics for false-positive control, each of which has multiple approaches to control. The family-wise error rate (FWER) is the probability of one or more false positives in a family of tests; control of the FWER is divided into weak control, which controls the FWER under the complete null hypothesis, i.e., when no outcomes have a significant treatment effect, and strong control, which controls the FWER when any subset of the outcomes has no significant treatment effect. Note that in confirmatory trials, strong control of the FWER is often required (Huque et al. 2013; FDA 2017). Alternatively, the false discovery rate (FDR) is the expected proportion of false rejections in a family. A good heuristic for determining which false-positive rate is appropriate is whether a single false positive would significantly affect the interpretation of the study. FWER is usually appropriate if a false positive would invalidate the study, while in contrast, FDR is often appropriate in the presence of a large number of tests, where a few false positives would not alter the study interpretation.

The utility of controlling complex false-positive rates was once viewed with suspicion, not least because many traditional methods resulted in severe decreases in statistical power, for example (Pocock 1997). However, advances in methodology for control and concern about study reliability have renewed focus on controlling false-positive rates; e.g., Huque et al. (2013) provide an overview of approaches to FWER control, and Benjamini and Cohen (2017) propose a weighted FDR controlling procedure in similar contexts. In addition to improvements in the quality of the procedures for false-positive control, regulatory agencies are also more aware and concerned with false-positive control (FDA 2017). The pharmaceutical industry is under control by regulators, but the academic community has been less well-regulated with regard to these multiple comparison issues. However, the requirement to list trials with ClinicalTrials.gov has made it more concrete that these issues must be decided in advance. Often, such testing approaches are not determined explicitly at the initiation of the protocol, but codified at the time the statistical analysis plan is created, prior to locking the database and in double blind studies, prior to unblinding. These details are often in the statistical analysis plan and, thus, not available on ClinicalTrials.gov. While the specific corrections are beyond the scope of this chapter, it is useful to contrast traditional approaches to false-positive control with modern extensions.

In many cases traditional methods for controlling FWER or FDR have been adapted to handle multiple testing in a more nuanced manner. Traditionally, one defined a family of tests and then applied a particular method for controlling false positives, but a study may reasonably consist of more than one family of tests; e.g., one may divide primary and secondary outcome analyses into two separate families. Furthermore, families need not be treated as having equal importance, which is the basis for hierarchical ordered families or a so-called step-down approach. This approach applies to controlling FWER or FDR when “win” criterion is achieved in the families of primary endpoints which is required before testing secondary (and possibly tertiary) outcomes. The two most common frameworks are α-propagation and gatekeeping. α-Propagation divides the significance level across a series of ordered tests, and when a test is significant, its portion of the α is “propagated” or passed to the next test. Gatekeeping approaches depend on a hierarchy of families. In regular gatekeeping, a second family of tests is only tested if the first family passes some “win” criteria, but some gatekeeping procedures allow for retesting, and many methods incorporate both gatekeeping and α-propagation. A detailed discussion is found in Huque et al. (2013).

Defining power and calculating the required sample size in complex multiplicity situations may not be straightforward (Chen et al. 2011). However, using traditional methods in the absence of updated methodology is likely to result in conservative results, especially since many methods control FWER at or below the specified significance value and so are often more conservative than the desired value so that higher sample sizes are required to achieve the desired power. Note that there are no multiplicity issues when considering coprimary endpoints, since each must successfully reject the null hypothesis, but power calculations are nonetheless complicated and require considering the dependency between test statistics; unnecessarily large sample sizes will be required if an existing dependency is ignored (Sozu et al. 2011); for a detailed discussion on power and sample size with coprimary endpoints, see Sozu et al. (2012) .

Longitudinal Studies

What Are Longitudinal Studies?

Multiple outcomes also arise when following patients over time, recording measurements for outcomes at several time points throughout the trial. The circumstances for such a situation can be quite varied. In Alzheimer’s disease, we are interested in the trajectory or slope of the decline in cognitive function over time or in chronic obstructive pulmonary disease, the rate of lung function decline. In many cases where adverse medical events (e.g., death, stroke, etc.) are recorded, it is of interest to know when the event occurred, and follow-up may be conducted at pre-specified times to determine whether or not an event occurred for each (qualifying) patient; in such cases therapies may be distinguished by whether they prolong the time to event instead of, or in addition to, whether they prevent the event entirely. Some events, such as exacerbations (relapses) in MS, can occur repeatedly over time, and thus assessing these for the intensity of their occurrence (often summarized as the annualized relapse rate) is common. Other less extreme examples involve recording patient attributes, e.g., quality of life assessments or biomarkers like blood pressure or cholesterol; these and similar situations assess changes over time and address the existence and/or form of differences between treatment groups.

Benefits

A major benefit of recording patients at multiple time points is that it allows for a better understanding of the trajectory of change in a patient’s outcome; for example, nonlinear trends may be discovered and modeled, thereby allowing researchers to seek understanding of the clinical reasons for each part of the curve. For example, longitudinal studies of blood pressure have established that human blood pressure varies according to a predictable nonlinear form (Edwards and Simpson 2014). Such a model may be used to better define and evaluate healthy ranges for patient blood pressure. Second, patient-specific health changes and inference are available when a patient is followed over time. A simple example is given by comparing a cross-sectional study, a rarity in clinical trials, to a pre-post longitudinal study. Whereas in a cross-sectional study, we only have access to observed group differences in raw scores, longitudinal studies provide insight into whether and how a particular patient’s metrics have changed over time; generalizing beyond pre-post to many observations allows better modeling for individuals as well as groups. Additionally, unless the repeated measures are perfectly correlated, the power to detect group differences is generally increased when using longitudinal data and may require a smaller sample size to detect a specified treatment difference.

Drawbacks and Complications

The increased use and acceptance of the (generalized) linear mixed model (GLMM) have allowed increased flexibility in modeling dependence and handling missing outcomes compared to traditional methods such as repeated measures ANOVA or MANOVA approaches, which require strong assumptions and trouble including patients with missing observations (Hedeker and Gibbons 2006). While GLMM work by making some assumptions that are rarely testable, there are far more flexible. However, while modeling has been greatly improved in recent decades, longitudinal studies still have drawbacks and complications. An obvious drawback is the corresponding increase in study cost; trials must balance the benefit of additional information with the ability to pay for that information. Power is generally increased in these repeated measures designs, but few studies seem designed based on the balance of the gain per dollar for each measurement.

A perhaps more pressing concern is that while missing data is no longer an impediment to modeling and inference, the reason that missingness occurs is relevant to interpretation of trial results (Carpenter and Kenward 2007). Data missing completely at random (MCAR) are associated with neither observed nor unobserved outcomes of interest. Data missing at random (MAR) may be associated with observed, but not unobserved, outcomes. Data missing not at random (MNAR) are associated with unobserved outcomes. The assumptions of MCAR are often not well-substantiated. In particular, using complete case analyses where participants with missing observations are excluded can bias results (Mallinckrodt et al. 2003; Powney et al. 2014). Likewise, many common imputation methods, such as last observation carried forward, are often not valid approaches; GLMMs are valid when the assumptions of MAR are reasonable and do not require imputation methods to include all subjects but verifying missingness is not MNAR is difficult. In the interest of transparency, studies should report reasons for dropout and prepare detailed and well-justified approaches for handling missing data (Powney et al. 2014). Performing sensitivity analyses is essential, not just to get agreement on what the primary analysis showed, but to provide evidence that the results are not the product of hidden biases .

Summary/Conclusion

It is imperative that researchers think deeply about which outcomes to employ in studies and are aware of the various issues and complications that can arise from those choices. This chapter has introduced outcomes in clinical trials, delineated their different purposes and manifestations, and discussed regulatory and statistical issues in selecting appropriate study outcomes so that researchers have a clear idea of what to consider as they make choices on which outcomes to include in their studies.

Key Facts

  • Primary outcomes should be measures capable of answering the main research/clinical question and are often expected to measure rates, scale outcomes, detectable patient benefits, risk of disease development, and/or complications, while secondary outcomes are often related to the primary outcomes but typically of lesser importance and/or may be infeasible due to sample size limitations; both are typically related to assessing the efficacy of a therapy.

  • Direct endpoints directly describe patient well-being, and may be objective, i.e., explicitly measure clinical outcomes, or subjective, which often depend on subject self-report.

  • Surrogate or biomarker endpoints substitute for direct endpoints and are often employed when direct endpoints are infeasible and/or unethical to measure. Many are biomarkers, which objectively measure biological processes, pathologies, and/or responses to exposures or interventions, whereas surrogate endpoints are tantamount to the outcome of interest itself.

  • Safety outcomes measure negative health consequences such as side effects or adverse events and help assess the trade-offs between the benefits and risks of therapies.

  • Choosing outcome measures requires both practical and statistical considerations. Practical considerations include the ability to capture the physical phenomenon of interest, interpretability, while statistical considerations include validity, reliability, and the ability to answer the clinical questions.

  • Outcome measurements may depend on a patient’s subjective experience (patient-reported), require some degree of medical/professional expertise (clinician-reported), or be measurable by some other third party who is neither a medical professional nor the patient (observer-reported).