How to use an Article Reporting Variations in the Outcomes of Health Services Research
C. David Naylor and Gordon H. Guyatt for the Evidence-Based Medicine Working Group
Based on the Users Guides to Evidence-based Medicine and reproduced with permission from JAMA. (1996;275(7):554-8). Copyright 1996, American Medical Association.
- Case Scenario
- The Search
- Introduction
- I. Are the recommendations valid?
- II. What are the recommendations?
- III. Will the recommendations help you in caring for your patients?
- Conclusions and Resolution
- References
Case Scenario
Your patient, a 78 year-old retired internist, has been complaining of increasing symptoms of benign prostatic hypertrophy. He has long-standing hypertension, and coronary artery disease, with remote antero-lateral infarction and bypass surgery ten years ago. His left ventricular ejection fraction was recently documented at 20%, and he has been started on an angiotensin converting enzyme inhibitor. Rectal examination confirms a moderately enlarged prostate, without irregularities, nodularity, or tenderness. As you discuss management options, your patient insists that transurethral prostate surgery is dangerous, and that international studies of thousands of patients have proved that, as he puts it, "old-fashioned open prostatectomy is safer than that keyhole surgery". You prescribe a trial of an alpha-blocker, terazosin, but expect that he will still need surgery at some point. Moreover, apart from this patient's case, you are deeply concerned that you may have harmed others by referring them for the wrong procedure.
The Search
You sit down in the hospital library, using a program that contains the MEDLINE database from January 1990 to October 1994. You start from "Explode Prostatic Hypertrophy", limit the search to English language reports on human subjects, and then combine the resulting set with "transurethral" and "mortality" as text words. This yields 27 citations. Browsing through the resulting abstracts, two appear to address your patient's concern. One, by a Danish group [1], addresses the long-term outcomes of transurethral versus "open" (suprapubic or transvesical) prostatectomy using hospitalization data linked to vital status data for the entire Danish male population from 1977 to 1985. The study relies on administrative data and massive population-based numbers (38,067 men) and shows excessive mortality among patients undergoing trans-urethral resection of the prostate (TURP). The other report, by Concato et al [2], offers long-term outcomes data on only 252 patients who underwent either procedure at a Yale teaching hospital in New Haven, Conn. between 1979 and 1981. However, a detailed chart audit was undertaken, and the results suggested that patients undergoing the more extensive open procedure had lower long-term mortality because they were healthier at the outset.
Introduction
Over the last decade, changes in health care delivery have broadened the range of groups interested in the outcomes of medical care. Concern with costs, and with dramatic inter-regional or international differences in practice among clinicians and institutions, have focused the attention of administrators and politicians on the interplay between the processes and outcomes of health services. The evolution of managed care has sharpened interest in measuring and managing the quality of care delivered by individual practitioners, hospitals, and other institutions.
Implicitly, the questions about quality of care, and the best way of delivering health services, are issues of optimal treatment. From a prior Users' Guide you've learned the key issues in assessing the validity of a treatment study are randomization and completeness of followup [3]. However, investigators are not generally going to be able to randomize patients to different practitioners or hospitals, and focusing on the outcomes associated with these differences in care will require strategies other than randomized trials. Increasingly, investigators have looked to large administrative data bases to examine the outcomes of care associated with different procedures, practitioners, or institutions. Under what circumstances should you believe the inferences made on the basis of such studies?
There is a parallel here with studies assessing potential harm to patients: it is impossible to randomize people to smoke or not, or to various levels of air pollution, and so observational studies or "natural experiments" are used as sources of insight. In a previous Users' Guide we provided criteria for validity for the observational studies that investigators must use when exploring issues of harm [4]. The challenges are fundamentally the same for comparing outcomes of two or more sets of health care practitioners or delivery systems. We felt, however, that the use of observational studies using administrative data bases has become so prevalent, is having so much influence, and is sufficiently associated with its own particular challenges, that the issues are worthy of an special Users' Guide. Table 1 revisits our criteria for assessing an article about harm, modified here for examining associations between variations in processes and outcomes of health care in the "real world" setting.
I. Are the recommendations valid?
i. Are the outcome measures accurate and comprehensive?
A randomized therapeutic trial must have valid and reliable outcome measures; so must any observational study assessing patients' outcomes. The easiest outcomes for health researchers to measure are those that are defined objectively and usually captured in large insurance data bases or computerized hospital administrative data, e.g. death, those in-hospital complications of surgery that are routinely coded, or readmissions to hospital. Linkage to vital status registries is also performed to track out-of-hospital deaths. However, other outcomes, e.g. disability, discomfort, distress and dissatisfaction, are important to patients. Functional status and quality of life measures are needed to capture these burdens, but these measures are not applied in routine clinical care, and if applied, their results are not incorporated into administrative data bases.
In short, many large data bases are not designed for clinical research and may either mismeasure patients' outcomes or fail to capture outcomes that are important to patients [5]. Researchers should therefore report on the quality and comprehensiveness of the data source. Ideally there should be independent cross-checks to ensure that the same outcomes are measured consistently and completely for whatever unit of comparison is used.
How did our two studies of prostate surgery perform in these respects? Andersen et al [1] used vital status data for the entire population of Denmark, and therefore mortality was measured in a reliable and unbiased fashion across all groups for comparison. Concato et al [2] reported only on all-cause mortality data within 5 years of the procedure obtained by hospital chart review and, where those data were inconclusive, from the national vital status registry.
From the standpoint of relieving obstructive or irritative symptoms of benign prostatic hypertrophy, the complete resection attained by "open" prostatectomy yields excellent results and eliminates the need for repeat procedures as occasionally occurs with TURPs. However, neither study provided a tally of these or other outcome measures of interest to patients and physicians, such as overall recovery time, rates of perioperative complications, impotence, incontinence, and so forth. In short, by assessing only mortality, these outcomes studies provided an incomplete tally of the burdens and benefits of the two treatments being compared.
ii. Were the comparison groups similar with respect to important determinants of outcome, other than the one of interest, and were residual differences adjusted for in the analysis?
Clinicians and health care managers are interested in a variety of determinants of outcome, the major categories of which are such as those shown in table 2. One type of comparison examines differences that may be due to variations in quality of care across individual practitioners or institutions providing care in a specific city or region. State agencies now publish some provider- or institution-specific outcomes, and researchers sometimes relate these outcomes to the provider- or institution-specific volume of the services under scrutiny. This reflects a belief that "practice makes perfect" -- all things being equal, centres (and by, inference, physicians or surgeons) with a higher caseload will generally achieve better outcomes than lower volume centres. For example, various studies suggest that in-hospital post-operative mortality after aortic aneurysm surgery [6], percutaneous transluminal coronary angioplasty [7] and coronary artery bypass graft surgery [8] [9] is lower for centres or surgeons that manage more patients.
However, the greater the difference between service settings being compared, the more difficult it is to be sure that patients were similar, or to isolate which aspects, if any, of the process of care relate to the outcomes observed. This is especially true when comparisons are made on a broad geographic footing between regions or countries in which populations and processes of care differ in many ways. One recent study compared outcomes of Canadian and American patients enrolled in a major trial of thrombolytic therapy for acute myocardial infarction [10]. Rates of revascularization and use of specialist services were much higher in America. The investigators used an appropriately broad range of outcomes measures, and observed that in terms of symptoms, functional status, psychological well-being and health-related quality of life, Canadian patients fared somewhat worse than their American counterparts -- a finding of obvious concern to Canadian practitioners. However, some of the difference may be because the types of patients recruited by Canadian investigators were destined for worse outcomes irrespective of management. Canadians may also have a different cultural threshold for reporting symptoms or functional impairment.
A third source of variations in outcomes that may occur within similar health systems is the type of treatment provided. This is the sort of comparison that was done in the outcomes studies of TURP versus open prostatectomy described in this article's opening scenario. Such comparisons may avoid some of the broad health system effects and sociocultural or even genetic differences that threaten the validity of outcomes comparisons made across widely-disparate populations. However, it is still possible that differences in outcomes may have been due to differences in patients receiving the alternative management strategies, for without randomization, patients will inevitably differ in ways other than the treatment being provided to them. This phenomenon is called selection bias. When two alternative procedures are being compared, selection bias arises from the exercise of good clinical judgement in routine practice. For example, urologists may choose younger, healthier patients to undergo the more extensive open prostatectomy, and older, sicker patients for TURP. Patients then end up differing in obvious or subtle ways that affect their likelihood of having a good or bad outcome. Epidemiologists use the term "confounding" to describe this problem. The validity of any form of observational research is threatened by case selection biases that create non-comparable groups of patients and confound any outcomes comparisons.
Researchers must therefore somehow adjust for differences between groups of patients. The sophistication of these so-called risk adjustment methods is growing rapidly [11]. However, researchers and quality-of-care evaluators are unlikely to know all the prognostic factors that interact with treatments to affect outcomes. Randomization is important precisely because it accounts for these unknown factors. The problem gets worse when one considers that all known prognostic features may not have been measured, and if they have been measured, they may not have been measured or recorded accurately. Inaccurate measurement or recording is a particular concern when information comes from administrative data bases. For instance, Jollis et. al [12] compared information about cardiac risk factors in an administrative database in patients undergoing angiography with information collected prospectively for a clinical database by a cardiac fellow who actually saw the patients. A chance-corrected measure of agreement (kappa statistic) showed good agreement only for diabetes (83% agreement) and whether patients had an acute myocardial infarction (76%); agreement was moderate for hypertension (56%), poor for the presence of heart failure (39%), and no better than chance (9%) for unstable angina. Hannan et al [13] found similar discrepancies in comparing a cardiac surgery registry to an administrative database in New York State. These inaccuracies mattered: the ability of evaluators to predict mortality was clearly higher with the detailed clinical data as opposed to the administrative database [13]. Thus, the accuracy (and fairness) of adjustments for differences in patients can be undermined by poor data quality.
The problem of limited or inaccurate data in insurance databases or computerized hospital discharge abstracts may be partly ameliorated by supplementing the information with chart audits [14]. This is time-consuming and expensive, but may be the only way to reduce the chances of missing, or misconstruing, important differences among groups of patients. A more efficient mechanism may be to establish specific registry mechanisms geared to measuring key patient characteristics, process of care elements, and relevant outcomes.
How, then, can you best assure yourself that, short of randomization, investigators have made the fairest possible outcomes comparison possible? We summarize the steps in Table 3. First, did the researchers convince you, through their review of the literature and on the basis of what you know about the determinants of prognosis, that they measured all of the important prognostic factors? Second, since these measurements are only as good as the data that go into them, you should consider whether these measures of patients' prognostic factors are reproducible and accurate. Third, did they show the extent to which the groups being compared differed on the prognostic factors that they measured? Fourth, did the researchers use some form of multivariate analysis wherein they tried to adjust simultaneously for all the important prognostic factors?
Localio and colleagues have recently reported on the consequences of not taking into account all possible prognostic factors. A large corporation's managed care program sought to determine which of the hospitals serving the corporation's employees delivered better quality of care as reflected in part by fewer in-hospital deaths [15]. The consultant concluded that the hospitals differed, and this conclusion influenced the company's choices about hospital selection. As it turned out, an appropriate analysis conducted by a group of academic investigators concluded that the difference between even the hospital with the worst record and the rest could be easily attributable to the play of chance. Furthermore, when the investigators included an adjustment for age, a prognostic factor which had been left out of the consultants' initial analysis, the rank order of the hospitals changed [16].
Because observational data are so susceptible to selection biases that may confound the outcome comparisons, the researchers should determine whether their results persist when they analyze the data in different ways. For example, if there is a severe imbalance in allocation of patients with a particularly important prognostic factor, it may make sense to eliminate all patients with that factor, and repeat the analyses. Unfortunately, even relative balance on a prognostic factor does not guarantee comparability. One reason is that administrative data and registries tend to use fairly simple categories, such as whether a disease is or is not present. Yet "disease present" may be associated with a wide range of underlying dysfunction, and therefore equally variable prognosis. Patients with chronic lung disease, or chronic heart failure, for instance, can vary from mild to severe, with very different prognostic implications. Thus, apparent balance on the proportion of patients with these diagnoses can mask a situation in which one group has many more severely affected patients than the other. This is even true for advanced age as a prognostic factor, since elderly persons may vary considerably in their overall robustness.
Because of this problem, a useful double-check in any outcomes comparison is to ensure that the findings are replicable within a relatively low-risk subgroup of the patients being examined. By eliminating patients in categories associated with widely varying physiological states, we increase the likelihood of a "level playing field" for comparisons.
II. What are the recommendations?
What are the recommendations?
How do our two studies of prostate surgery measure up in this regard? Andersen et al considered patients' ages at surgery, but relied only on diagnoses coded in the computerized hospital records as indicating compromised health status. Even with these limited data, fewer open prostatectomy patients had high-risk diagnoses. They were also younger, and had less heart disease and cancer. In a multivariate analysis to try to adjust for these differences, it did appear that TURP continued to confer a 30%-40% relative increase in the risk of death over several years of follow-up. Extensive sensitivity analyses were performed, including a specific examination of low-risk patients (described as "healthiest men"). Although low-risk patients also showed an excess risk with TURP, the relative magnitude of the increased risk of death was smaller for low-risk patients than high-risk patients. As Andersen et al [1] stated: "The extent to which this difference is attributable to the surgical intervention itself remains an open question. The two groups of patients are quite different with regard to age and preoperative health status, and available data may not be sufficient to control such differences through statistical analysis."
III. Will the recommendations help you in caring for your patients?
How will the recommendations help you?
Concato et al [2] used chart review methods, with a detailed and systematic abstraction of information related to health status based on inpatient and ambulatory care records. They carefully confirmed that two reviewers independently agreed on patients' health status assessments. Patients in the TURP group were again found to be older and sicker. However, in a multivariate analysis, the adjusted excess risk of TURP diminished as the degree of detail on comorbidity was increased. Their best estimate was that TURP actually conferred no increased risk relative to open prostatectomy. Unfortunately, owing to the small sample size, their results were very imprecise, with 95% confidence limits ranging from much increased to much reduced risk with TURP (e.g. from 0.57 to 1.87). Thus, the Yale study highlights the issue of non-comparability and selection biases, but does not rule out harms of the magnitude demonstrated by the Danish investigators. Moreover, the study provides data on outcomes for only a single city; the results may not be generalizable.
Conclusions and Resolution
Given the limitations of observational studies of large data-bases, can we better define the role of this sort of health services research? First, we should acknowledge that once randomized trials have helped define evidence-based practices, observational studies of outcomes of care will remain useful in generating important information about what happens when these practices are used in the "real world" as opposed to the selected populations of patients and practitioners participating in randomized trials. This information not only deepens our understanding of practical effectiveness as opposed to theoretical efficacy; it may also add information since trials do not measure all the outcomes of interest to patients and physicians.
However, this complementary or supplementary role of large-scale observational studies departs sharply from using administrative data or clinical registries to decide which specific management strategies will yield better outcomes: e.g. surgery versus medical, invasive versus non-invasive, different surgical procedures, and so on. For such comparisons, randomized trials are usually possible. Given the unavoidable biases of observational studies, we should generally insist on randomized trials in the first instance to determine the relative merits of treatments.
Do observational studies have any role at all in choosing best practices? Randomized trials are expensive and difficult to conduct, and we cannot therefore undertake randomized trials of all the questions in which we might be interested. Observational studies may identify situations in which one therapy appears so much better than an alternative that bias would be a very unlikely explanation for the difference. In addition, observational studies may identify hitherto improbable hypotheses worthy of further study. The possibility that open prostatectomy has a lower mortality than transurethral prostatectomy is one example. Finally, if the outcomes of interest are very rare, such as unusual idiosyncratic side effects of a drug, researchers can only obtain adequate sample sizes through use of administrative data bases.
There are other situations in which randomized trials are not possible, such as looking for systematic variations in outcomes of similar procedures provided by different practitioners or institutions. It is untenable to assume that all hospitals or providers practice equally well, and observational outcomes comparisons have a role in assessing quality of care. This is especially applicable for some well-defined services (e.g. coronary artery bypass grafting) where there are validated risk-adjustment algorithms [17] [18] [19] [20] and dedicated registries to measure risk factors and outcomes, so that these comparisons are probably meaningful. In general, however, we must weigh potential harm to patients from poor quality care against the harm to skilled and committed health workers and fine institutions caused by poorly-founded inferences about inferior outcomes.
Given the relatively weak inferences possible from most observational studies of outcomes, we should always consider alternative strategies for ensuring the quality of medical care. For some processes of care (though certainly not all, as we cautioned in the previous article in this series), we can accurately document what went on and make confident judgements about its appropriateness. For example, we know from randomized trials that pre-operative antibiotic and antithrombotic prophylaxis before a variety of forms of surgery, improves patients' outcomes. If these measures are omitted, we know that patients will suffer and that practitioners and institutions can improve their quality of care. We suggest that in most instances it is most efficient to use randomized trials or meta-analyses of trials to establish optimal management strategies, and then ensure that quality of care is maintained by monitoring the process of care to ensure that well-proven practices are consistently applied to eligible patients.
What, then, of your patient? Perhaps predictably, given what we know about the limitations of observational studies, your exploration has been inconclusive. Indeed, had you used MEDLINE on CD-ROM for the years prior to 1990, the relevant literature would not have moved you much further. Related work [21] [22] on elevated mortality after TURP as opposed to open prostatectomy has incorporated extra detail on differences among patients drawn from chart reviews, and failed to eliminate the excess mortality seen with TURP; however the adjustments were arguably less detailed than those used by Concato et al [2]. One very small randomized trial has also shown a trend to excess mortality with TURP [23]. On the other hand, there has been no definitive trial comparing the two forms of surgery, and TURP remains the predominant procedure for benign prostatic hypertrophy.
The retired internist returns in four weeks as planned. "Was I right about the risks of the keyhole method?", he asks. You admit that the abandonment of open prostatectomy may have been premature, but caution that his age and medical status make him a poor candidate for the more extensive procedure, even if you could find a urologist competent to do it. Hearing your own advice, you again appreciate that similar selection biases may be the real reasons for the apparently higher mortality after TURP. Fortunately, your patient has had an excellent response to the alpha-blocker, and the issue of prostatectomy can be set aside for some time. As you usher him from the office, he grumbles: "By the way, did you see that the operative mortalities for all the local heart surgeons are on the front page of the newspaper? Thank heavens I retired."
Table 1: Three core questions to ask about a study using an observational design to examine sources of difference in patients' outcomes
|
Table 2: Factors that may systematically affect outcomes
|
Table 3: Determining whether differences in prognosis, rather than differences in the intervention, explain differences in outcomes
|
References
1. Andersen TF, Bronnum-Hansen H, Sejr T, Roepstorff C. Elevated mortality following transurethral resection of the prostate for benign hypertrophy: But why? Med Care 28. 870-81 (1990).
2. Concato J, Horwitz RI, Feinstein AR, Elmore JG, Schiff SF. Problems of comorbidity in mortality after prostatectomy. J A M A 267. 1077-82 (1992).
3. Guyatt GH, Sackett DL, Cook DJ. Users' guides to the medical literature: II. How to use an article about therapy or prevention: A. Are the results of the study valid? Evidence-Based Medicine Working Group. J A M A 270. 2598-601 (1993).
4. Levine MS, Walter SS, Lee HN, Haines T, Holbrook A, Moyer V. Users' guides to the medical literature: IV. How to use an article about harm. Evidence-Based Medicine Working Group. J A M A 271. 1615-9 (1994).
5. Dans PE. Looking for answers in all the wrong places. Ann Intern Med 119. 855-7 (1993).
6. Hannan EL, Kilburn H, Jr., O'Donnell JF, Bernard HR, Shields EP, Lindsey ML, Yazici A. A longitudinal analysis of the relationship between in-hospital mortality in New York State and the volume of abdominal aortic aneurysm surgeries performed. Health Serv Res 27. 517-42 (1992).
7. Jollis JG, Peterson ED, Delong ER, Mark DB, Collins SR, Muhlbaier LH, Pryor DB. The relation between the volume of coronary angioplasty procedures at hospitals treating Medicare beneficiaries and short-term mortality. N Engl J Med 331. 1625-9 (1994).
8. Showstack JA, Rosenfeld KE, Garnick DW, Luft HS, Schaffarzick RW, Fowles J. Association of volume with outcome of coronary artery bypass graft surgery. Scheduled vs nonscheduled operations. J A M A 257. 785-9 (1987).
9. Hannan EL, Kilburn H, Jr., Bernard H, O'Donnell JF, Lukacik G, Shields EP. Coronary artery bypass surgery: the relationship between inhospital mortality rate and surgical volume after controlling for clinical risk factors. Med Care 29. 1094-107 (1991).
10. Mark DB, Naylor CD, Hlatky MA, Califf RM, Topol EJ, Granger CB, Knight JD, Nelson CL, Lee KL, Clapp-Channing NE. Use of medical resources and quality of life after acute myocardial infarction in Canada and the United States. N Engl J Med 331. 1130-5 (1994).
11. Daley J, Shwartz M.Iezzoni LI, Ed. Developing risk-adjustment methods. pp.199-238. (1994) Risk adjustment for measuring health care outcomes.Ann Arbor: Health Administration Press.
12. Jollis JG, Ancukiewicz M, Delong ER, Pryor DB, Muhlbaier LH, Mark DB. Discordance of databases designed for claims payment versus clinical information systems: implications for outcomes research. Ann Intern Med 119. 844-50 (1993).
13. Hannan EL, Kilburn H, Jr., Lindsey ML, Lewis R. Clinical versus administrative data bases for CABG surgery. Does it matter? Med Care 30. 892-907 (1992).
14. Malenka DJ, McLerran D, Roos N, Fisher ES, Wennberg JE. Using administrative data to describe casemix: a comparison with the medical record. J Clin Epidemiol 47. 1027-32 (1994).
15. Localio AR, Hamory BH, Sharp TJ, Weaver SL, TenHave TR, Landis JR. Comparing hospital mortality in adult patients with pneumonia. A case study of statistical methods in a managed care program. Ann Intern Med 122. 125-32 (1995).
16. Wu AW. The measure and mismeasure of hospital quality: appropriate risk-adjustment methods in comparing hospitals. Ann Intern Med 122. 149-50 (1995).
17. Tu JV, Jaglal SB, Naylor CD. Multicenter validation of a risk index for mortality, intensive care unit stay, and overall hospital length of stay after cardiac surgery. Steering Committee of the Provincial Adult Cardiac Care Network of Ontario. Circulation 91. 677-84 (1995).
18. O'Connor GT, Plume SK, Olmstead EM, Coffin LH, Morton JR, Maloney CT, Nowicki ER, Levy DG, Tryzelaar JF, Hernandez F. Multivariate prediction of in-hospital mortality associated with coronary artery bypass graft surgery. Northern New England Cardiovascular Disease Study Group. Circulation 85. 2110-8 (1992).
19. Higgins TL, Estafanous FG, Loop FD, Beck GJ, Blum JM, Paranandi L. Stratification of morbidity and mortality outcome by preoperative risk factors in coronary artery bypass patients. A clinical severity score. J A M A 267. 2344-8 (1992)
20. Edwards FH, Clark RE, Schwartz M. Coronary artery bypass grafting: the Society of Thoracic Surgeons National Database experience. Ann Thorac Surg 57. 12-9 (1994).
21. Roos NP, Wennberg JE, Malenka DJ, Fisher ES, McPherson K, Andersen TF, Cohen MM, Ramsey E. Mortality and reoperation after open and transurethral resection of the prostate for benign prostatic hyperplasia. N Engl J Med 320. 1120-4 (1989).
22. Malenka DJ, Roos N, Fisher ES, McLerran D, Whaley FS, Barry MJ, Bruskewitz R, Wennberg JE. Further study of the increased mortality following transurethral prostatectomy: a chart-based analysis. J Urol 144. 224-7; discussion 228 (1990).
23. Meyhoff HH. Transurethral versus transvesical prostatectomy. Clinical, urodynamic, renographic and economic aspects. A randomized study. Scandinavian Journal of Urology & Nephrology.Supplementum 102:1-26. 1-26 (1987).
© 2001 Evidence-Based Medicine Informatics Project

