Centre for Health Evidence: Home » Users' Guides to EBP |
Gordon H. Guyatt, C. David Naylor, Elizabeth Juniper, Daren K. Heyland, Roman Jaeschke, Deborah J. Cook for the Evidence Based Medicine Working Group
Based on the Users' Guides to Evidence-based Medicine and reproduced with permission from JAMA. (1997;277(15):1232-1237). Copyright 1995, American Medical Association.
You are a physician following a 35 year-old man who has had active Crohn's disease for 8 years. The symptoms were severe enough to require resectional surgery four years ago, and despite treatment with sulphasalazine and metronidazole, the patient has had active disease requiring oral steroids for the last two years. Repeated attempts to decrease the prednisone have failed, and the patient has required doses of greater than 15 mg. per day to control symptoms. You are impressed by both the methodology and results of a recent report documenting that such patients benefit from oral methotrexate [1] and suggest to the patient that he consider this medication. When you explain some of the risks of methotrexate, particularly potential liver toxicity, the patient is hesitant. How much better, he asks, am I likely to feel while taking this medication?
There are three reasons we offer treatment to our patients. We believe our interventions increase longevity, prevent future morbidity, or make patients feel better. The first two of these three endpoints are relatively easy to measure. At least in part because of difficulty in measurement, clinicians have for many years been ready to substitute physiological or laboratory tests for the direct measurement of the third. In the last 20 years, however, clinicians have recognized the importance of direct measurement of how people are feeling, and how they are able to function in daily activities. Investigators have developed increasingly sophisticated methods of making these measurements.
Since, as clinicians, we are most interested in aspects of life quality directly related to health rather than issues such as finances, or the quality of the environment, we frequently refer to measurements of how people are feeling as health-related quality of life (HRQL) [2]. Investigators measure HRQL using questionnaires that typically include questions about how patients are feeling or what they are experiencing associated with response options such as yes-no, or seven-point scales, or visual analogue scales. Investigators aggregate responses to these questions into domains or dimensions (such as physical or emotional function) that yield an overall score.
Controversy exists concerning the boundaries of HRQL, and the extent to which individual patient's values must be included in its measurement [3] [4] [5]. Is it sufficient to know that patients with chronic obstructive lung disease in general value being able to climb stairs without getting short of breath, or does one need to establish that the individual patient values climbing stairs without dyspnea? Further controversy exists about how the relative values of items and domains need to be established, and how these values should be determined. Is it enough to know that both dyspnea and fatigue are important to people with lung disease, or does one need to establish their relative importance? If establishing their relative importance is necessary, which of the many available approaches should one use?
In this paper, we take a simple approach. We use HRQL to refer to the health aspects of their lives that people, in general, value, and are ready to accept patients' statement of what they value without precise determination of ranking of items or domains.
Clinicians often have limited familiarity with methods of measuring how patients feel. At the same time, they are facing articles that recommend administering or withholding treatment on the basis of its impact on patients' well-being. This Users' Guide is designed for clinicians asking the question: "Will this treatment make my patient feel better?" As in other guides, we will use the framework of the validity of the methods, interpreting the results, and applying the results to one's patients [Table 1]. In addition, we begin the guide with a commentary on when one should and should not be concerned about HRQL measurement. Our guidelines borrow heavily from our previous work [2] [5]. While this article focuses on using HRQL measures to help with treatment decisions, we hope that it may also improve clinical care by emphasizing aspects of patients' experience, including functional, emotional, and social limitations, that clinicians sometimes neglect.
Table 1: Users' Map for an Article about Health-Related Quality of Life
|
In the early days of clinical trials, few if any treatment studies included measurements of HRQL, and no one worried much. When should you be concerned if investigators have not paid adequate attention to how patients feel?
In general, delaying mortality is sufficient reason to administer a treatment. Some years ago, investigators showed that round-the-clock oxygen in patients with severe chronic airflow limitation improved mortality [6]. The fact that HRQL data weren't reported in the original paper turns out not to be an important omission. Since the intervention prolongs life, our enthusiasm for continuous oxygen administration is not blunted by a subsequent report suggesting that more intensive oxygen therapy had little or no impact on HRQL [7]. Similarly, while feeling better is important to heart failure patients, when interventions either extend [8] or shorten [9] life span, we usually do not need a HRQL assessment to inform our clinical decisions.
There are exceptions to this rule. While many of our life-prolonging treatments have a negligible impact on, or actually improve HRQL, this is not always the case. If treatment leads to a deterioration in HRQL, patients may be concerned that small gains in life-span come at too high a cost. Interventions that highlight this concern include chemotherapy for cancer and HIV disease. In the extreme, life may be prolonged, but patients' families may wonder if, for example, their fate is a persistent vegetative state, they are not better off dead. A patient's own preferences expressed through an advance directive may support this view.
When the goal of treatment is to improve how people are feeling (rather than to prolong their lives) and physiological correlates of patients' experience are lacking, HRQL measurement is imperative. For example, we would pay little attention to studies of antidepressants that failed to measure patients' mood, or trials of anti-migraine medication that failed to measure pain.
The difficult decisions occur when the relation between physiologic or laboratory measures and HRQL outcomes is uncertain. Practitioners have relied on substitute endpoints not because they weren't interested in making patients feel better, but because they assumed a strong link between physiologic measurements and patients' well-being. A recent trial in patients with symptomatic postmenopausal osteoporosis examined the effect of sodium fluoride on bone density and vertebral fractures [10]. The investigators believed that increased bone mass and fewer vertebral fractures would lead to decreased pain and increased function. Does their failure to measure the effect of treatment on areas of unequivocal importance to patients, including pain, physical function, and household and leisure activities [11] affect the clinical message of the results? Similarly, investigators measuring the effects of anti-anginal medication have often been satisfied with increased duration of exercise on the treadmill without direct measurement of decreased symptoms or increase in activity in day-to-day life. Are we ready to prescribe medication on the basis of increased laboratory exercise capacity?
Bone density, vertebral fractures, and exercise capacity, or similar measures such as joint count, ejection fraction, or pulmonary function, are surrogate endpoints for what we really want to measure: the effect of treatment on our patients' lives. Whether these surrogate measures are adequate depends on how confident we are of the link with how people feel. When this issue has been investigated empirically, the relation between physiologic and clinical measures and patients' symptoms is usually modest, and often highly variable [12] [13] [14] [15] [16] [17]. Though these findings lead us to recommend caution in assuming that improvement in physiologic or clinical function will result in patients feeling better, each clinician (and, when appropriate, the patient) must decide on her own threshold.
Referring back to the opening scenario, investigators reported the results of a randomized trial of methotrexate in 141 patients with chronically active Crohn's disease despite at least three months of prednisone therapy. Patients who received methotrexate were twice as likely to be in clinical remission following 16 weeks of treatment than those who received placebo (39.4% versus 19.1%, p = 0.025), and actively treated patients received less prednisone and showed less disease activity. Is additional information regarding HRQL necessary to interpret the results of this study? As depicted in the scenario, the decision to give methotrexate depends on weighing the benefits and risks, and the patient's question about how much better he is likely to feel on medication may well be relevant to his decision. Without information about the effect of the medication on HRQL, therefore, neither the clinician nor the patient can make a fully informed choice.
We have described how investigators often substitute endpoints that make intuitive sense to them for those that patients value. Clinicians can recognize these situations by asking themselves the question: if the endpoints measured by the investigators were the only thing that changed, would patients be willing to take the treatment? In addition to change in clinical or physiologic variables, patients would require that they feel better, or live longer.
How can clinicians be secure that investigators have measured aspects of life that patients value? Investigators may show that the outcomes they have measured are important to patients by asking them directly. For example, in a study examining HRQL in patients with chronic airflow limitation, we used a literature review and interviews with clinicians and patients to identify 123 items reflecting possible ways their illness might impact on patients' HRQL [18]. We then asked 100 patients which of the items were problems for them and how important those items were. We found that the most important problem areas for patients were their dyspnea during day-to-day activities, and their chronic fatigue. An additional area of difficulty was emotional function, including feeling frustrated and impatient. If the authors don't present direct evidence that their outcome measures are important to patients, they may cite prior work. For example, a randomized trial of respiratory rehabilitation in patients with chronic lung disease used a HRQL measure based on the responses of patients in the study we've described above, and referred back to that study [19]. Ideally the report will include a summary of the developmental process sufficiently detailed to obviate the need to go back to the prior report.
Alternatively, investigators may describe the content of their measures in detail. An adequate description of the content of a questionnaire allows clinicians to use their experience to decide whether what is being measured is important to patients. For instance, the authors of an article describing a randomized trial of surgery versus watchful waiting for benign prostatic hyperplasia "assessed the degree to which urinary difficulties bothered the patients or interfered with their activities of daily living, sexual function, social activities, and general well-being" [20]. Few would doubt the importance of these items.
In the study of methotrexate for patients with inflammatory bowel disease the patients completed the Inflammatory Bowel Disease Questionnaire (IBDQ) which addresses patients' bowel function, emotional function, systemic symptoms, and social function. Although the authors don't mention this in their paper, the 32 items in the IBDQ were chosen because patients with inflammatory bowel disease labelled them as the most important in their daily lives [21].
Measuring how people are feeling is not easy. Investigators must demonstrate that their instruments allow strong inferences about the effect of treatment on HRQL. We will now review how a HRQL should perform (we call the way it performs its measurement properties) if it is going to be useful.
There are two distinct ways in which investigators use HRQL instruments. They may wish to help clinicians distinguish between people who have a better or worse HRQL , or to measure whether people are feeling better or worse over time [22]. For instance, suppose a trial of a new drug for patients with heart failure shows that it works best in patients with the New York Heart Association (NYHA) functional classification Class IV symptoms. We could use the NYHA class for two purposes. One would be to discriminate between patients as to their NYHA class in deciding who to treat. We might also want to determine whether the drug was effective in improving an individual patient's functional status, and therefore monitor changes in patient's NYHA functional class.
While, for both purposes, we require a high ratio of signal to noise, when we are discriminating between people at a single point in time, the signal comes from differences between patients (if everyone gets the same score, we can't tell who is better off and who is worse off) and the noise comes from variability within subjects (if patients' scores fluctuate wildly, we're not going to be able to say much about their relative well-being) [23]. The technical term usually used for the ratio of variability between patients to the total variability is reliability.
Instruments used to evaluate change over time must, in contrast, be able to pick up any important changes in the way patients are feeling, even if those changes are small. Thus, the signal comes from the difference in score in patients who have improved or deteriorated, and the noise from the variability in score in patients who have not changed. The term we use for the ability to detect change (the ratio of signal to noise over time) is responsiveness.
An unresponsive instrument can result in a false negative trial in which the intervention improves how patients feel, and yet the instrument fails to detect the improvement. This problem may be particularly salient for questionnaires that have the advantage of covering all relevant areas of HRQL, but the disadvantage of covering each area superficially. A crude instrument such as the NYHA functional classification (with only four categories) may work well for stratifying patients, but may not be able to detect small but important improvement with treatment.
In studies that show no difference in change in HRQL when patients receive a treatment versus a control intervention clinicians should look for evidence that the instruments have been able to detect small or medium-sized effects in previous investigations. In the absence of this evidence, instrument unresponsiveness becomes a plausible reason for the failure to detect differences in HRQL. For example, a randomized trial of a diabetic education program reported no changes in two measures of well-being, and attributed the result to, among other factors, lack of integration of the program with standard therapy [24]. Given that the program improved knowledge and self-care, and patients felt less dependent on physicians, another explanation is inadequate responsiveness of the two HRQL measures.
In the trial of methotrexate in Crohn's disease, concern about responsiveness decreases because the study showed statistically significant differences between treatment and control groups. As it turns out, the IBDQ had detected small to medium-sized differences in previous investigations [21] [25] [26].
To provide such evidence, investigators have borrowed validation strategies from psychologists who have for many years had to decide whether questionnaires assessing intelligence, attitudes, and emotional function were really measuring what is intended. Investigators interested in attitudes may show apparent differences between individual that really reflect variability in the tendency to provide socially acceptable answers rather than differences in underlying attitudes; investigators may demonstrate apparent effects of rehabilitation on HRQL, but really be detecting differences in satisfaction with care. In either case, the instrument would be detecting a signal, but it would be the wrong signal.
Establishing validity therefore involves examining the logical relationships that should exist between measures. For example, we would expect that in general patients with lower treadmill exercise capacity will have more dyspnea in daily life than those with higher exercise capacity, and we would expect to see substantial correlations between a new measure of emotional function and existing emotional function questionnaires. When we are interested in evaluating change over time, we examine correlations of change scores: patients who deteriorate on their treadmill exercise capacity should, in general, show increases in dyspnea, while those whose exercise capacity improves should experience less dyspnea; a new emotional function measure should show improvement in patients who improve on existing measures of emotional function. The technical term for this process is testing an instrument's construct validity.
Clinicians should look for evidence of the validity of HRQL measures used in clinical studies. Reports of randomized trials using HRQL measures seldom review evidence for the validity of the instruments they use, but clinicians can gain some reassurance from statements (backed by citations) that the questionnaires have been previously validated. In the absence of evident face validity, or empirical evidence of validity, clinicians are entitled to scepticism about the study's measurement of HRQL.
In the methotrexate in inflammatory bowel disease study the investigators refer to the IBDQ as "previously validated" and provide two relevant citations [21] [25]. These papers describe extensive validation of the questionnaire, including correlations of change that document the instruments' usefulness for measuring change over time.
Investigators may have addressed HRQL issues, but not done so comprehensively. Exhaustive measurement may be more or less important in a particular context. One can think of a hierarchy that begins with symptoms, moves on to the functional consequences of the symptoms, and ends with more complex elements such as emotional function. If, as a clinician, you believe your patients' sole interest is in whether a treatment relieves the primary symptoms and most important functional limitations you will be satisfied with a limited range of assessment. Recent randomized trials in patients with migraine [27] [28] and post-herpetic neuralgia [29] restricted themselves primarily to the measurement of pain; studies of patients with rheumatoid arthritis [30] [31] and back pain [32] measured pain and physical function, but not emotional or social function.
As a clinician, you can judge whether or not these omissions are important to you or, more importantly, your patients. We would encourage you, however, to bear in mind the broader impact of disease on patients' lives. Disease-specific measures that explore the full range of patients' problems and experience remind us of domains we might otherwise forget. We can trust these measures to be comprehensive if the developers have conducted a detailed survey of patients suffering from the illness or condition.
If you are interested in going beyond the specific illness and comparing the impact of treatments on HRQL across diseases or conditions, you will require a more comprehensive assessment. None of the disease-specific, system or organ-specific, function-specific (such as instruments that examine sleep or sexual function), or problem-specific (such as pain) measures are adequate for comparisons across conditions. These comparisons require generic measures designed for administration to people with any underlying health problem (or no problem at all) that cover all relevant areas of HRQL.
One type of generic measure, health profiles, yields scores for all domains of HRQL (including, for example, mobility, self-care, and physical emotional, and social function). There are a number of well-established health profiles, including the Sickness Impact Profile [33] and the short forms of the instruments used in the Medical Outcomes Study [34] [35] that have advantages of simplicity, self-administration, and the ability to put changes in specific functions in the context of overall HRQL. Inevitably, such instruments cover each area superficially. This may limit their responsiveness -- indeed, several randomized trials have found that generic instruments were less powerful in detecting treatment effects than specific instruments [19] [36] [37] [38] [39] [40]. Ironically, generic instrument may also suffer from not being sufficiently comprehensive: they may completely omit patients' primary symptoms.
Disease-specific measures may comprehensively sample all aspects of HRQL relevant to a specific illness and also be responsive, but they are unlikely to deal with side effects. For instance, the IBDQ measures all important disease-specific areas of HRQL, including symptoms directly related to the primary bowel disturbance, systemic symptoms, and emotional and social function. Coincidentally, it measures some methotrexate side effects, including nausea and lethargy, because these are also experience by IBD patients not taking methotrexate, but not others such as skin rash or mouth ulcers. The investigators could have administered a generic instrument to tap in to non-IBD-related aspects of HRQL, but once again would likely have failed to measure side effects in sufficient detail. Side-effect specific instruments are limited; the investigators chose a check-list approach, and documented the frequency of occurrence of adverse events both severe, and not severe enough to warrant discontinuation of treatment.
While providing information about the broad domains of HRQL, and therefore allowing comparisons across conditions, health profiles are ill-suited for health policy decisions that involve integrating costs. Health policy decisions require choices about resource allocation across diseases, conditions, or medical problems, and also involve considerations of cost. These choices require standardized comparisons that allow one to relate the impact of very different treatments (such as drugs, surgery, or rehabilitation programs) on very different conditions (such as chronic lung disease, renal failure, or Parkinson's disease). Inevitably, they involve putting a value on health states, and may thus require sophisticated weighting for patient preferences, and necessitate relating health states to anchors of death and full health. Such measures may aid policy-makers in making the right decisions about how public money is allocated.
Measures that provide a single number that summarizes all of HRQL, are preference or value-weighted, and have the preferences or values anchored to death and full health are called utility measures. Typically, utility measures use a scale from 0 (death) to 1.0 (full health) to summarize HRQL. Since they weight the duration of life according to its quality, their output is often called quality-adjusted life years. Thus, utilities are holistic measures that ask patients to express, in a single value, their strengths of preferences for particular health states.
Boyle and colleagues, in a classic article, used a utility measure to calculate that treating critically ill infants weighing 1,000 to 1,499 grams at birth cost $3,200 per quality-adjusted life-year gained, while treating infants with a birth weight of 500 to 999 grams cost $22,400 per quality-adjusted life-year gained [41]. Estimates for the cost per quality-adjusted life year for treating patients on renal dialysis have ranged from approximately $30,000 to $50,000 [42] [43]. While different weighting schemes yield different results, and may therefore be considered arbitrary, a number of increasingly simple utility measures are now available, have provided interesting results in clinical trials, and may facilitate integrating cost into policy decisions. The use, measurement, and interpretation of utility measures remain, however, controversial [44]. The investigators in the methotrexate trial did not use a health profile or a utility measure, limiting use of the data for comparisons across disease states, and preventing a formal economic analysis.
Understanding the results of a trial involving HRQL involves special challenges. Patients with acute back pain prescribed bed rest had mean scores on the Owestry Back-Disability Index, a measure that focuses on disease-specific functional status, 3.9 points worse than control patients [32]. Patients with severe rheumatoid arthritis allocated to cyclosporin had a mean disability score 0.28 units better than control patients [30]. Are these differences trivial, small but important, of moderate magnitude, or do they constitute large and extremely important differences between treatments?
These examples show that the interpretability of most HRQL measures is not self-evident. There are a number of methods available for understanding the magnitude of HRQL effects. Investigators may relate changes in HRQL questionnaire score to well-known functional measures (such as the New York Heart Association Functional classification), to clinical diagnosis (such as the change in score needed to move people in or out of the diagnostic category of depression), or to the impact of major life events [45]. They may relate changes in HRQL score to patients' global ratings of the magnitude of change they have experienced [46], or to the extent they rate themselves as better or worse than other patients [47]. Whatever the strategy, if investigators don't provide an indication of how to interpret changes in HRQL score, the findings are of limited use to clinicians.
Even if we did know that 3.9 points on the Owestry Back-Disability Index or 0.28 units on a rheumatoid arthritis disability index signified, for instance, small but important changes, mean differences between groups may be difficult to interpret. Clinicians may find the proportion of patients who achieved small, medium and large gains due to treatment more informative.
The investigators who conducted the trial of methotrexate for Crohn's disease do not help clinicians interpret the magnitude of difference in HRQL. The mean difference in IBDQ score between treatment and control groups at 16 weeks was 0.59. Other investigations suggest that differences of approximately 0.5 represent may represent small but important changes, while large improvements correspond to a difference in score of greater than 1.0 [46] [47] [48] [49]. Thus, the mean difference between treated and control patients in the methotrexate study likely falls into the category of small but important change in HRQL.
People with the same chronic disease often vary markedly in the problems they experience. Even if the problems are the same, the magnitude of the impact of those problems in their lives may differ. Assessment of HRQL will only help in the care of an individual patient if that patient's problems are similar to those of patients in the trial.
Knowing whether HRQL results of a study are relevant for your patients means understanding their experience of illness. Even the most common problems of a chronic disease don't affect all those afflicted. For instance, 92% of patients with inflammatory bowel disease complain of frequent bowel movements, and 82% of abdominal cramps<5146>. With respect to emotional function 78% feel frustrated, and 76% depressed. The patients who experienced these difficulties vary in the extent to which they feel the problems were important. Thinking back to the scenario, before answering the question about how the treatment would impact on the patient's life, the clinician would have to find out the problems the patient was currently experiencing, the importance he attached to those problems, and the value he might attach to having the problems ameliorated.
Reflecting further on the process of communicating with patients, HRQL instruments that focus on specific aspects of patients' experience may be of more use than global measures. Patients with chronic lung disease may find it more informative to know that their compatriots offered a treatment became less dyspneic and fatigued in daily activity, rather than simply that they judged their HRQL as improved. HRQL measures will be most useful when results facilitate their practical use by you and your patients.
Treatments affect HRQL both by reducing disease symptoms and consequences and by creating new problems. Side effects may make the cure worse than the disease. Clinicians conducting clinical trials are usually blind to treatment allocation, and try to maintain patients on study medication as long as possible. Patients may therefore soldier on in the face of considerable side effects, and this may be reflected in their HRQL.
This is not how we conduct our clinical practice. If patients experience significant side effects, we discontinue the medication, particularly if there is a suitable alternative. Thus, the design of the clinical trial may create an artificial situation with misleading estimates of the impact of treatment on HRQL. This issue is of particular concern for treatments such as antihypertensives in which much of the impairment in HRQL may be due not to the medical condition, but to the treatment.
The trial of methotrexate in Crohn's disease simulated clinical practice well. Any side effects of the medication would have compromised HRQL so that the positive effect we see is likely to be a conservative estimate of treatment benefit. If the patient is experiencing similar problems to the trial patients, and if those problems are important to him, he is likely to achieve comparable benefit to patients enrolled in the trial.
We encourage clinicians to consider the impact of their treatments on patients' HRQL, and to look for information regarding this impact in clinical trials. Responsive, valid, and interpretable instruments measuring experiences of importance to most patients should increasingly help guide our clinical decisions.
© 2001 Evidence-Based Medicine Informatics Project
© 2001 Centre for Health Evidence.
Home.
Users' Guides to EBP.
Webmaster.
Disclaimer.