Andrew D. Oxman, Deborah J. Cook, Gordon H. Guyatt, for the Evidence Based Medicine Working Group
Based on the Users' Guides to Evidence-based Medicine and reproduced with permission from JAMA. (1994 Nov 2;272(17):1367-71). Copyright 1995, American Medical Association.
A 55 year old man had his serum cholesterol measured at a shopping mall two months ago. His cholesterol proved elevated and he comes to you, his primary care physician, for advice. He does not smoke, is not obese, and does not have hypertension, diabetes mellitus or any first order relatives with hypercholesterolemia or premature coronary heart disease (CHD). You repeat his cholesterol test and schedule a follow-up appointment. The test confirms an elevated cholesterol (7.9 mmol/l), but before deciding on at treatment recommendation, you elect to find out just how big a reduction in the risk of CHD this patient could expect from a cholesterol lowering diet or drug therapy.
There are a number of cholesterol lowering trials, and instead of trying to find and review all of the original studies yourself, you use Grateful Med to find a recent overview. On the first subject line you select "hypercholesterolemia" and "cholesterol" from the list of Medical Subject Headings (MeSH) used to index articles. On the second subject line you use the MeSH term "Coronary Disease" which you explode so as to capture articles that are indexed with more specific terms that come under Coronary Disease, such as "Myocardial Infarction." You limit your search to English language reviews. Since the 170 references that the search yields are too many, to find a quantitative review you use the term "meta-analysis" instead of "review" on the line for publication type. Titles and abstracts suggest four of the 11 references from this last search are on target. You decide to examine the most recent of these .
Systematic overviews of the medical literature that summarize scientific evidence (in contrast to unsystematic narrative reviews that mix together opinions and evidence) are becoming more prevalent, and are increasingly useful for clinicians trying to make optimal management decisions. These overviews may address questions of treatment, causation, diagnosis, or prognosis. In each case, the rules for deciding whether the overviews are credible, and for interpreting their results, are similar. In this article, we provide guidelines for distinguishing a good overview from a bad one, and for using the results. In doing so, we will ask the same key questions that we have suggested for original reports of research  (Table 1). Are the results valid? If they are, what are the results, and will they be helpful in my patient care?
Table 1. Users' guides for how to use review articles
I. Are the results of the study valid?
II. What are the results?
III. Will the results help me in caring for my patients?
The terms literature review, overview, and meta-analysis are sometimes used interchangeably. We use overview as a term for any summary of the medical literature, and meta-analysis as a term for reviews that use quantitative methods to summarize the results. A number of authors have recently examined issues pertaining to the validity of overviews     . In this section we will emphasize key points from the perspective of a clinician needing to make a decision about patient care.
Unless an overview clearly states the question it addresses, you can only guess whether it is pertinent to your patient care. Most clinical questions can be formulated in terms of a simple relationship between the patient, some exposure (to a treatment, a diagnostic test, a cause, etc), and one or more specific outcomes of interest. If the main question that an overview addresses is not clear from the title or abstract, it is probably a good idea to move on to the next article.
Many overviews address a number of such relationships. For example, a review article, or a chapter from a textbook, on asthma might include sections on the etiology, diagnosis, prognosis, treatment and prevention of asthma. While broad reviews such as this can be useful to someone seeking an introduction to an area, they usually provide limited support for the conclusions they put forward. Typically, only a declarative statement will be made, followed by one or more citations. Readers must then study the references in order to judge the validity of the authors' conclusions.
To determine if the investigators reviewed the appropriate research, the reader needs to know the specific criteria that they used to decide which research to select. These criteria should specify the patients, exposures and outcomes of interest. They should also specify the methodological standards used to select studies. Ideally, these should be similar to the primary validity criteria we have described for original reports of research .
In looking at the effectiveness of lowering cholesterol on CHD, investigators might restrict themselves to studies of patients who did not have clinically manifest CHD at the beginning of the study (primary prevention), to studies of patients who already had symptomatic CHD (secondary prevention), or include both. They might include only trials of diet therapy, only trials of drug therapy, or both. They might consider several different outcomes, such as non-fatal CHD, CHD mortality and total mortality. With respect to methodologic criteria, they might consider only randomized controlled trials (RCTs), only cohort and case-control studies, or both. Differences in the patients, exposures and outcomes can lead to different results among overviews that appear to address the same clinical question . The clinician must be sure the criteria used to select the studies correspond to the clinical question that led her to the article in the first place. The impact of cholesterol lowering strategies, for instance, differs in studies of primary versus secondary prevention.
If inclusion criteria are clearly stated, it is less likely that the authors will preferentially cite studies that support their own prior conclusion. Bias in choosing articles to cite is a problem for both overviews and original reports of research (in which the discussion section often includes comparisons with the results of other studies). Gotzsche, for example, reviewed citations in reports of trials of "new" non-steroidal anti-inflammatory drugs in rheumatoid arthritis . Among 77 articles where the authors could have referenced other trials with and without outcomes favouring the "new" drug, nearly 60% (44) cited a higher proportion of the trials with favourable outcomes. In 22 reports of controlled trials of cholesterol lowering, Ravnskov  found a similar bias towards citing positive studies (in which the risk of CHD was lowered).
It is important that authors conduct a thorough search for studies that meet their inclusion criteria. This should include the use of bibliographic databases, such as MEDLINE and EMBASE, checking the reference lists of the articles that are retrieved, and personal contact with experts in the area. Unless the authors tell us what they did to locate relevant studies, it is difficult to know how likely it is that relevant studies were missed.
There are two important reasons why the review's authors should use personal contacts. The first is so they could identify published studies that might have been missed (including studies that are in press or not yet indexed or referenced). The second is so they could identify unpublished studies. Although the inclusion of unpublished studies is controversial , their omission increases the chances of "publication bias" - a higher likelihood for studies with "positive" results to be published    and the attendant risk for the review to overestimate efficacy or adverse effects.
If investigators include unpublished studies in an overview, they should obtain full written reports, and appraise the validity of both published and unpublished studies. They may go on to use statistical techniques to explore the possibility of publication bias . Overviews based on a small number of small studies with weakly positive effects are the most susceptible to publication bias.
Even if a review article includes only RCTs, it is important to know whether they were of good quality. Unfortunately, peer review does not guarantee the validity of published research, even in prestigious journals . For exactly the same reason that the guides for using original reports of research begin by asking if the results are valid, it is essential to consider the validity of research included in overviews.
Differences in study methods might explain important differences among the results  . For example, less rigorous studies tend to over-estimate the effectiveness of therapeutic and preventive interventions . Even if the results of different studies are consistent, it is still important to know how valid they are. Consistent results are less compelling if they come from weak studies than if they come from strong studies.
There is no one correct way to assess validity. Some investigators use long check-lists to evaluate methodologic quality, while others focus on three or four key aspects of the study. You will remember that, in our previous articles about therapy, diagnosis, harm and prognosis in the Users' Guides series, we asked the question, "Is the study valid?", and presented criteria to help you answer these questions. When considering whether to believe the results of an overview, you should check whether the authors examined criteria similar to those we have presented in deciding on the credibility of their primary studies.
As we have seen, authors of review articles must decide which studies to include, how valid they are, and which data to extract from them. Each of these decisions requires judgement by the reviewers and each is subject to both mistakes (random errors) and bias (systematic errors). Having two or more people participate in each decision reduces such errors, and if there is good agreement between the reviewers, the clinician can have more confidence in the results of the overview.
Despite restrictive inclusion criteria, most systematic overviews document important differences in patients, exposures, outcome measures and research methods from study to study. Readers must decide when these factors are so different that it no longer makes sense to combine the study results.
One criterion for deciding to combine results is whether the studies seem to be measuring the same underlying magnitude of effect. Investigators can test the extent to which differences among the results of individual studies are greater than you would expect if all studies were measuring the same underlying effect and the differences observed were due only to chance. The statistical analyses that are used to do this are called "tests of homogeneity." The more significant the test, the less likely it is that the observed differences in the size of the effect are due to chance alone, and the more likely that differences in patients, exposures, outcomes, or study design are responsible for the varying treatment effect. Both the "average" effect and the confidence interval around the average effect need to be interpreted cautiously when there is significant heterogeneity (a low probability of the differences in results from study to study being due to chance alone, reflected in a low p-value).
Unfortunately, a "nonsignificant" test does not necessarily rule out important differences between the results of different studies . Hence, clinically important differences between study results still dictate some degree of caution in interpreting the overall findings, despite a "nonsignificant" test of homogeneity. Nevertheless, when there are large differences between the results of different studies, a summary measure from all of the best available studies may still provide a better estimate for clinical use than the results of any one study   .
In clinical research, investigators collect data from individual patients. Because of the limited capacity of the human mind to handle large amounts of data, investigators use statistical methods to summarize and analyze them. In overviews, investigators collect data from individual studies. These data must also be summarized, and increasingly, investigators are using quantitative methods to do so.
Simply comparing the number of "positive" studies to the number of "negative" studies is not an adequate way to summarize the results. With this sort of "vote counting" large and small studies are given equal weights, and (unlikely as it may seem) one investigator may interpret a study as positive, while another investigator interprets the same study as negative . In addition, there is a tendency to overlook small but clinically important effects if studies with statistically "nonsignificant" (but potentially clinically important) results are counted as "negative"  . Moreover, a reader cannot tell anything about the magnitude of an effect from a vote count.
Typically, meta-analysts weight studies according to their size, with larger studies receiving more weight. Thus, the overall results represent a weighted average of the results of the individual studies. Occasionally studies are also given more or less weight depending on their quality, or poorer quality studies might be given a weight of zero (excluded) either in the primary analysis or in a "sensitivity analysis" to see if this makes an important difference in the overall results.
You should look to the overall results of an overview the same way you look to the results of primary studies. In our articles concerning therapy, we described the relative risk and the absolute risk reduction, and how they could be interpreted  . In the articles about diagnostic tests, we discussed likelihood ratios . In overviews of treatment, and of etiologic and prognostic factors, you will often see the ratio of the odds of an adverse outcome occurring in those exposed (to a treatment or risk factor) to the odds of an adverse outcome in those not exposed. This odds ratio, illustrated in Table 2, has desirable statistical properties when combining results across studies. Whatever method of analysis the investigators used, you should look for a summary measure (such as the number needed to treat ) that clearly conveys the practical importance of the result.
Table 3. Odds ratio, relative risk, risk reduction and number needed to treat
Odds ratio = [A/B]/[C/D]
a) When the outcome is undesirable, a relative risk or odds ratio
less than one represents a beneficial treatment or exposure, with
zero representing 100% effectiveness. An absolute risk reduction
of less than zero representatives a benefit, and 100% effectiveness
would be equivalent to the risk observed in the control group.
Sometimes the outcome measures that are used in different studies are similar but not exactly the same. For example, different trials might measure functional status using different instruments -- a walk test versus a stair climbing test, for instance. If the patients in the different studies and the interventions are reasonably similar, it might still be worthwhile to estimate the average effect of the intervention on functional status across studies. One way of doing this is to summarize the results of each study as an "effect size" . The effect size is the difference in outcomes between the intervention and control groups divided by the pooled standard deviation. Hence, the results of each study are summarized in terms of the number of standard deviations of difference between the intervention and control groups. Investigators can then calculate a weighted average of effect sizes from studies that measured an outcome of interest in different ways.
You are likely to find it difficult to interpret the clinical importance of an effect size (if the weighted average effect is one-half of a standard deviation, is this effect clinically trivial, or is it large?). Once again, you should look for a presentation of the results that conveys their practical importance (for example by translating the summary effect size back into natural units ). For instance, if clinicians have become familiar with the significance of differences in walk test scores in patients with chronic lung disease, the effect size of a treatment on a number of measures of functional status (such as the walk test and stair climbing) can be converted back into differences in walk test scores.
In the same way that it is possible to estimate the average effect across studies, it is possible to estimate a confidence interval around that estimate; i.e. a range of values with a specified probability (typically 95%) of including the true effect. A previous article in this series provides a guide for understanding confidence intervals .
One of the advantages of an overview is that, since many studies are included, the results come from a very diverse range of patients. If the results are consistent across studies, they apply to this wide variety of patients. Even so, the clinician may still be left with doubts about the applicability of the results. Perhaps the patient is older than any of those included in the individual trials summarized by the overview. If studies using different members of a class of drug have been combined, one might question whether one of the drugs has a larger effect than the others.
These questions raise the issue of sub-group analysis. Detailed guides for deciding whether to believe these subgroup analyses are available  . One of the most important guides is that conclusions drawn on the basis of between-study comparisons (comparing patients in one study with patients in another) should be viewed sceptically. For example, in a meta-analysis of the effectiveness of beta-blockers after myocardial infarction there was a statistically significant and clinically important difference in effect between trials of beta-blockers with and without intrinsic sympathomimetic activity (ISA) . This resulted in clinical recommendations that only beta-blockers without ISA should be used. However, the addition of two subsequent trials eliminated this difference in the overall summary . In fact, a large number of subgroup analyses exploring differences in either patients or the beta-blocker regimen used have been conducted on these data. Most, if not all of the differences that have been found are probably due to chance .
Other criteria that make a hypothesized difference in subgroups more credible include a big difference in treatment effect; a highly significant difference in treatment effect (the lower the p-value on the comparison of the different effect sizes in the subgroups, the more credible the difference); a hypothesis that was made before the study began and was one of only a few hypotheses that were tested; consistency across studies; and indirect evidence in support of the difference ("biological plausibility"). If these criteria are not met, the clinician is well advised to assume that the overall effect across all patients, and all treatments, applies to the patient at hand, and to the treatment under consideration.
While it is a good idea to look for focused review articles because they are more likely to provide valid results, this does not mean that you should ignore outcomes that are not included in a review. For example, the potential benefits and harms of hormone replacement therapy include reduced risk of fractures and CHD and increased risk of breast cancer and endometrial cancer. Focused reviews of the evidence for individual outcomes are more likely to provide valid results, but a clinical decision requires considering all of them.
Finally, either explicitly or implicitly, when making a clinical decision the expected benefits must be weighed against the potential harms and costs. While this is most obvious for deciding whether to use a therapeutic or preventive intervention, it is important to remember that providing patients with information about causes of disease or prognosis can also have both benefits and harms.
A valid review article provides the best possible basis for quantifying the expected outcomes, but it is also necessary to take into consideration your patients' preferences for the expected outcomes of a decision. In the next article in this series we will address this issue in the context of clinical practice guidelines.
The meta-analysis we identified in the clinical scenario at the beginning of this article indicates that the benefit we can expect from interventions to lower cholesterol depend on the baseline risk of death from CHD, and whether we are considering diet or drug interventions. The higher the risk of dying from CHD, the greater the likelihood of benefit. Drug, but not dietary therapy, is associated with a greater likelihood of death from causes other than CHD. The patient described in the scenario has a risk of death from CHD of approximately 1.0% over the next decade. You would have to treat approximately 1,000 such patients for ten years with a dietary intervention to save one life. If you were to treat such patients with drug therapy, there is no evidence that this would reduce total mortality and it is possible that you would increase their risk of death from other causes by more than you would be reducing their risk of death from CHD. This analysis, which is consistent with other recently published overviews of lipid-lowering interventions  , suggests that diet therapy is questionable in low-risk individuals (such as the patient in the scenario) and that drug therapy should be restricted to those at high risk, such as individuals who have already sustained a myocardial infarction.
© 2001 Centre for Health Evidence.
Home. Users' Guides to EBP. Webmaster. Disclaimer.