How to read a medical journal article (July 2003 version).
"Still, it is an error to argue in front of your data. You find yourself insensibly twisting them around to fit your theories." Sherlock Holmes in The Adventure of Wisteria Lodge.
Reading medical research is hard work. I'm not talking about the medical terminology, though that is often quite bad (if I hear the word "emesis" one more time, I'm going to throw up!). The hard part is assessing the strength of the evidence. When you read a journal article, you have to decide if the authors present a case that is persuasive enough to get you to change your practice.
Some evidence is so strong that it stands on its own. Other evidence is weaker and requires support from other studies, from mechanistic arguments, and so forth. Still other evidence is so weak, that you should not consider any changes in your practice until the study is replicated using a more rigorous approach.
What you should look for
When you are assessing the quality of the evidence, it's not how the data are analyzed that's important. Far more important is HOW THE DATA ARE COLLECTED. Don't agonize over whether the researchers should have used a non-parametric test or whether a random effects meta-analysis is appropriate (just to cite two obscure examples). These are important issues and they generate a lot of debate. But in most cases, the use of one statistical analysis or another is unlikely to make a substantial difference in the conclusions.
The more common and more important threat to the validity of the study relates to how the data are COLLECTED, not how they area ANALYZED. After all, if you collect the wrong data, it doesn't matter how fancy the analysis is. This is good news, because you don't need a lot of statistical training or a lot of mathematical sophistication to assess how the data are collected.
I don't want to imply that data analysis is irrelevant. There are good examples of where a better data analysis led to a different conclusion (Vickers 2001, Skegg 2000). Analysis errors are less frequent and less serious, however, than design errors.
In this presentation, I want to show you what to look for and why. Here are five questions you should ask yourself when reading a journal article.
Was there a good comparison group?
Was there a plan?
Who knew what when?
Who was left out?
How much did things change?
In this presentation, I will justify these questions using anecdotal evidence at times and solid empirical research at other times. I will also highlight real research articles and use them as examples.
Important Disclaimer.
This presentation will review several published journal articles. The intent is to gauge how much evidence each article presents in favor of the efficacy of a new therapy. Some articles will provide a greater level of evidence and some will provide a lesser level of evidence. But articles which provide lesser levels of evidence are still valuable and important.
Nothing stated in this presentation about a particular journal article should be construed as a statement about the quality of that article. The very nature of research requires a series of steps from very preliminary and speculative levels of evidence to more definitive levels of evidence.
Furthermore, when I point out limitations in the evidence presented in a journal article, more often than not, the authors of the article delineate these same limitations in their discussion. But in general, you need to be aware of these limitations because not every journal author is going to be open and honest about the limitations of their research.
Additional resources
Pitfalls of pharmacoepidemiology. D. C. Skegg. Bmj 2000: 321(7270); p1171-2. [Full text] [PDF]
Acupuncture for treatment of chronic neck pain. Andrew Vickers, Dominik Irnich, Martin Krauss. BMJ 2001: 323(7324); 1306-. [Full text]
Chapter 1: Was there a good comparison group?
Introduction
Almost all research involves comparison. Do woman who take Tamoxifen have a lower rate of breast cancer recurrence than women who take a placebo? Do left handed people die at an earlier age than right handed people? Are men with severe vertex balding more likely to develop heart disease than men with no balding?
When you make such a comparison between an exposure/treatment group and a control group, you want it to be a fair comparison. You want the control group to be identical to the exposure/treatment group in all respects, except for the exposure/treatment in question. You want an apples to apples comparison.
To ensure that the researchers made an apples to apples comparison, ask the following three questions:
Did the authors use randomization?
Did the authors use matching?
Did the authors use statistical adjustments?
Case study: Vitamin C and Cancer
Paul Rosenbaum, in the first chapter of his book, Observational Studies, gives a fascinating example of an apples to oranges comparison. Cameron and Pauling published an observational study of Vitamin C as a treatment for advanced cancer. For each patient, ten matched controls were selected with the same age, gender, cancer site, and histological tumor type. Patients receiving Vitamin C survived four times longer than the controls (p < 0.0001).
Cameron and Pauling minimize the lack of randomization. "Even though no formal process of randomization was carried out in the selection of our two groups, we believe that they come close to representing random subpopulations of the population of terminal cancer patients in the Vale of Leven Hospital."
Ten years later, the Mayo Clinic conducted a randomized experiment which showed no statistically significant effect of Vitamin C. Why did the Camoeron and Pauling study differ from the Mayo study?
The first limitation of the Cameron and Pauling study was that all of their patients received Vitamin C and followed prospectively. The control group represented a retrospective chart review. You should be cautious about any comparison of prospective data to retrospective data.
But there was a more important issue. The treatment group represented patients newly diagnosed with terminal cancer. The control group was selected from death certificate records. So this was clearly an apples versus oranges comparison. It doesn't matter how bad the prognosis was for a patient diagnosed with terminal cancer; it can't be as bad as the prognosis of a patient who has a death certificate.
Surgical trial without controls
There's another story, unfortunately fictional, which also highlights the importance of a good comparison group.
A prominent surgeon came to give a special lecture at the School of Medicine. He expounded about the great advance that he had made in a specific surgical procedure. At the end of the lecture he drew thunderous applause from the audience. At first it seemed like there would be no questions, but then a young student in the front row raised her hand. "Did you use any controls?" she asked. The surgeon seemed to be offended by this question. "Controls?" he asked. "Are you suggesting that I should have denied my surgical advance to half of my patients?" The rest of the audience grew very quiet. But the young woman was not intimidated. "Yes," she said, "that's exactly what I meant." The surgeon grew even angrier at this, slammed his fist on the podium and shouted "Why that would have condemned half of my patients to certain death!" There was silence for a few seconds. Then the entire auditorium burst out in laughter when the young woman asked "Which half?"
Covariate imbalance
If you want to judge how effective a new therapy is, you need a comparison group. The comparison group would be a group of subjects who receive either the standard therapy or, in some cases, no therapy (e.g., a placebo comparison).
The ideal comparison group should be similar in all respects to the new therapy group except for the therapy itself. For example, the two groups should have a similar range of ages and weights and should be composed of roughly the same proportions in gender and race/ethnicity. The groups should be evaluated concurrently.
Sometimes the groups are dissimilar on some important characteristics. This is known as covariate imbalance. Covariate imbalance is not an insurmountable problem, but it does make a study less authoritative.
In a yet to be published research study here at Children's Mercy Hospital, pre-term infants were randomized either to a group that received normal bottle feeding while they were in the hospital or to a nasogastric (ng) tube feeding group. The researchers wanted to see if the latter group of infants, because they had not become habituated to bottle feeding, would be more likely to breastfeed after discharge from the hospital.
The randomization was only partially effective at preventing covariate imbalance. The infants had comparable birth weights, gestational ages, and Apgar scores. There were similar proportions of caesarian section and vaginal births in both groups. But the mothers in the ng tube group were older on average than the mothers in the bottle fed group.
Since older mothers are more likely to breast feed than younger mothers, we had to include mother's age in an analysis of covariance model so that the effect of ng tube feeding could be estimated independent of mother's age.
Beware of situations where the two treatment groups are handled differently. An example of this would be the study of women who use oral contraceptives. These women visit a doctor at least every six months to get their prescriptions renewed. If these women are compared to a women who do not use oral contraceptives, then the former group will probably be evaluated by a doctor more frequently. An increase in the prevalence of certain diseases may actually reflect the fact these diseases are diagnosed earlier because of the frequency of hospital visits.
Similarly, if a certain drug is suspected to have certain side effects, doctor may question more closely those patients who are on that medication, creating a self-fulfilling prophecy.
Concurrent controls versus historical controls.
Sometimes researchers will assign all of the research subjects to the new therapy. The outcomes of these subjects are compared to historical records representing the standard therapy. This type of study is sometimes called a historical controls study. The very nature of a historical controls study guarantees that there will be a major discrepancy in timing. Thus, you have to consider any factors that have changed over time that might be related to the outcome. To what extent might these factors affect the outcome differentially?
The one exception is when a disease has close to 100% mortality (Silverman 1998, page 67). In that situation, there is no need for a concurrent control group, since any therapy that is remotely effective can be detected readily.
Did the authors use randomization?
If the authors of the study decided who would get the new therapy and who would get the standard therapy, we have an experimental design. When the authors of the study do have this level of control, they will almost always assign patients randomly.
If the patient did the choosing, if the patient’s doctor did the choosing, or if the groups were intact prior to the start of the research, then we have an observational design. In an observational design, it is impossible to assign patients randomly.
Here are some examples of experimental designs and observational studies.
In Adkinson (1997), 121 children with moderate-to-severe asthma were "randomly assigned to receive subcutaneous injections of either a mixture of seven aeroallergen extracts or a placebo." Since the researchers generated the sequence of random assignment, this is an experimental design.
In Bullock (1989), "80 severe recidivist alcoholics received accupuncture either at points specific for the treatment of substance abuse (treatment group) or at nonspecific points (control group)." Since the researchers controlled the nature of the accupuncture, this is an experimental design.
In Cardo (1997), 33 health care workers who became seropositive to HIV after percutaneous exposure to HIV-infected blood were compared to 665 health care workers with similar exposure who did not become seropositive. Since the researchers did not control who became seropositive, this is an observational study.
In Hu (1997), 80,082 women between the ages of 34 and 59 years were followed for 14 years to look for instances of non-fatal myocardial infarction or death from coronary heart disease. These women were divided into low, intermediate, and high groups on the basis of their consumption of dietary fat. Since the women themselves controlled their diets, rather than having a diet imposed on them by the researchers, this represents an observational design.
Information from an experimental design is generally considered more authoritative than information from an observational design because the researchers can use randomization. Randomization provides some level of assurance that the two groups are comparable in every way except for the therapy received.
Randomization requires the use of a random device, such as a coin flip or a table of random numbers. Systematic allocation (i.e., alternating between treatments) is not the same as randomization.
The simplest way to randomize is to layout the treatment schedule in a systematic (non-random) fashion, generate a random number for each value in the schedule and then sort the schedule by the random number.
Randomization ensures that both measurable and unmeasurable factors are balanced out across both the standard and the new therapy, assuring a fair comparison. It also guarantees that no conscious or subconscious efforts were used to allocate subjects in a biased way.
Randomization is not always possible or practical. When this is the case, we have to rely on observational data to draw any conclusions. But when randomization is possible, its use makes a research study more authoritative.
Although I do not have a bibliographic citation for this example, I heard an amusing story about a study of water toxicants on fish.
This research required that the fish be separated into five tanks, each of which would get a different level of the toxicant. The researchers caught one fifth of the fish and put then in one tank, then an additional one fifth and put them in a second tank and so forth. The outcome measurements were related not to the dosage, but to the order in which the tanks were filled, with the worst outcomes being in the first tank filled. and the best outcomes in the last tank filled.
What happened was that the slow-moving, easy-to-catch fish were all allocated to the first tank. The fast-moving, hard-to-catch fish ended up in the last tank. It turned out that the sicker fish were also the slow-moving, easy-to-catch fish, the healthiest fish swam faster and avoided early capture.
A better way to design this experiment was to allocate the fish into tanks randomly. This would ensure that each tank got a fair share of the fast-and-healthy and the slow-and-sick fish.
Studies without randomization often require either matching or statistical adjustments. While both matching and adjustments can help to some extent with covariate imbalance, these approaches do not work as well as randomization. In particular, some of the covariate imbalance may be due to factors that are difficult to measure. For example, patients may differ on the basis of
psychological state
severity of disease, and/or
presence of comorbid conditions.
All of these factors can influence the outcome, but if you can't measure them easily, matching or adjustment is not possible.
So, all other things being equal, an experimental design with randomization is more persuasive than an observational design without randomization. Nevertheless, much can be learned from non-randomized. Almost everything we know about the risks of cigarette smoking came from observational designs (Gail 1996).
An editorial in the Journal of the American Medical Association (Sherwin 1997) tries to make sense of recent studies of the effect of dietary fat on obesity, heart disease, and stroke. After reviewing the results of numerous studies, the editorial comments:
"At present, most of this evidence in humans is observational and, consequently, an imperfect basis for causal inference. Large scale experimental studies that would provide more compelling data (such as the Women's Health Initiative) cost hundreds of millions of dollars and take decades to complete. Each study can only address the effects of a single nutritional change. Thus, it is still necessary to base advice to patients on dietary information that is less than certain and complete."
Randomized studies do have some weaknesses. These studies typically rely on the use of volunteers in a narrowly defined research setting. Such situations may not be reflective of how a typical patient behaves in a typical health care setting (Sackett 1997). In this particular aspect, a carefully planned observational design may provide a more relevant comparison.
Another problem with randomized designs is the limit to their size and scope. These limits may make it difficult to detect rare but important side effects. An observational approach like post marketing surveillance is more likely to be successful in these situations.
Studies of the potential harm caused by environmental exposures (such as lead based paint, second hand tobacco smoke, or electro-magnetic fields) are often impossible to randomize because of logistical and ethical issues.
These exceptions, however, do not diminish the value of experimental designs. In situations where observational and experimental studies can both be conducted, most researchers will give greater weight to the evidence in an experimental study.
Did the authors use matching?
Matching is the systematic selection, for every subject in the treatment/exposure group, of control subject with similar characteristics. For example, in a study of fetal exposure to cocaine, you might select infants born to a mother who abused cocaine during pregnacy. For every such infant, you would select a infant unexposed to cocaine in utero, but also who had the same sex, race, and socio-economic status.
Matching will prevent covariate imbalance for those variables used in matching. It will also reduce covariate imbalance for any variables closely related to the matching variables. It will not, however, protect against all covariate imbalance, especially for those covariates that are difficult to measure.
Matching often presents difficult logistical issues, because a matching control subject may not always be available. The logistics are especially difficult when there are several matching variables and when the pool of control subjects that you can draw from is not substantially larger than the pool of treatment/exposed subjects.
Matching is usually reserved for those variables that are known to be highly predictive of the outcome measure. In a cancer study, for example, matching is usually done on smoking. Many neonatology studies will match on gestational age.
Matching in a case control design
When you are selecting patients on the basis of disease and looking back at what exposure might have caused the disease, selection of matching control patients (patients without disease) can sometimes be tricky. You need to find a control that is similar to the case, except for the disease of interest. There are several possibilities, but none of them works perfectly.
If the cases are people hospitalized for disease, you could choose people who are hospitalized for conditions other than the disease.
You could ask each case to bring a friend with them. Their friend would be likely to be of simlar age and socioeconomic status.
You could recruit controls from undiseased members of the same family.
You also have to be careful about the variable you use to match. If the matching variable is caused by the exposure or is a similar measure of exposure, then you might "over match" the data and remove the effect of the exposure. Marsh et al discuss an example of a study examining radiation exposure and the risk of leukemia at a nuclear reprocessing plant. In this study there were 37 workers diagnosed with leukemia (cases) and they were matched to four control workers. Each of the four control workers had to work at the same site, have the same gender, have the same job code, be born within two years of the case, and had to be hired within two years of the hire date of the case.
Unfortunately, there was a strong trend between hire date and exposure. Exposures were highest early in the plant's history and declined over time. So both hire date and exposure were measuring the same thing. When the data was matched on hire date, it artefactually controlled the exposure and pretty much ensured that the average radiation exposure would be the same among both the cases and the controls. This led to an estimate of radiation exposure that was actually slightly negative and not statistically significant.
When the data was rematched using all the variables except for hire date, the effect of radiation dose was large and positive and came close to approaching statistical significance.
Matching in a randomized design
In some randomized studies, matching will be used as well. Partly, this is a recognition that randomization will not totally remove covariate imbalance, just like a flip of 100 coins will not always result in exactly 50 heads and 50 tails.
More importantly, however, matching in a randomized study will provide extra precision. Matching creates pairs of subjects who will have greater homogeneity and therefore less variability.
The crossover design
The crossover design represents a special type of matching. In a crossover design, a subject is randomly assigned to a specific treatment order. Some subjects will receive the standard therapy first, followed by the new therapy (AB). Others will receive the new therapy first, followed by the standard therapy (BA).
Since the same subject receives both treatments, there is no possibility of covariate imbalance.
When therapies are applied in sequence, timing effects are of great concern. Are the therapies set far apart enough so that the effect of one therapy is unlikely to carryover into the other therapy? For example, if the two therapies represent different drugs, did the researchers allow enough time so that one drug was fully eliminated from the body before they administered the second drug?
The possibility of learning and fatigue effects are also potential problems in a crossover design.
Special problems arise when each subject receives the standard therapy first and then the new therapy (or vice versa). Many factors other than the change in therapy can cause a shift in the health of patients over time. Unless the researchers can point to other evidence that shows stability of the condition over time, information from this type of study is worthless.
Sometimes difficult circumstances (such as a general failure to respond to the standard therapy) will force the use of this type of design. Further discussion of lack of randomization or other issues with crossover designs can be found in Louis (1992).
Did the authors use statistical adjustments?
Statistical adjustments represent one way of correcting for covariate imbalance. There are several ways to make statistical adjustments.
First, there are regression adjustments. In a study of breastfeeding, there was an imbalance between the two groups in that one group was much older than the other group. From a regression model, we discover that older mothers breastfeed for longer periods of time, on average, than younger mothers. In fact, for each year of age, the duration of breastfeeding increases by 0.25 weeks on average. So we would adjust the difference of the two groups by 0.25 weeks for every year in discrepancy between the average mothers' ages.
Second, there are weighting adjustments. Suppose a group includes 25 males and 75 females, but in population we know that there should be a 50/50 split by gender. We could re-weight the data, so that each male has a weighting factor of 2.0 and each female has a weighting factor of 0.67. This artificially inflates the number of males to 50 and deflates the number of females to 50. A second group might have 40 males and 60 females. For this group, we would use weights of 1.25 and 0.83.
Both of these adjustments are imperfect, especially when the adjustment variable is imperfectly measured. And these adjustments are impossible if the researchers did not/could not measure the covariates.
Summary - Who did the choosing?
Did the authors use randomization? Randomization ensures balance among the two therapy groups with respect to both measurable and unmeasurable factors.
Did the authors use matching? Matching ensures comparable groups during the selection process.
Did the authors use statistical adjustments? Regression or weighting makes adjustments after the data are collected.
Bibliography
A controlled trial of immunotherapy for asthma in allergic children. N. F. Adkinson, Jr., P. A. Eggleston, D. Eney, E. O. Goldstein, K. C. Schuberth, J. R. Bacon, R. G. Hamilton, M. E. Weiss, H. Arshad, C. L. Meinert, J. Tonascia, B. Wheeler. New England Journal of Medicine 1997: 336(5); 324-31. [Abstract] [Full text] [PDF]
Controlled trial of acupuncture for severe recidivist alcoholism. M. L. Bullock, P. D. Culliton, R. T. Olander. Lancet 1989: 1(8652); 1435-9.
The orthomolecular treatment of cancer. II. Clinical trial of high-dose ascorbic acid supplements in advanced human cancer. E. Cameron, A. Campbell. Chem Biol Interact 1974: 9(4); 285-315.
A case-control study of HIV seroconversion in health care workers after percutaneous exposure. Centers for Disease Control and Prevention Needlestick Surveillance Group. D. M. Cardo, D. H. Culver, C. A. Ciesielski, P. U. Srivastava, R. Marcus, D. Abiteboul, J. Heptonstall, G. Ippolito, F. Lot, P. S. McKibben, D. M. Bell. N Engl J Med 1997: 337(21); 1485-90. [Abstract] [Full text] [PDF]
Statistics in Action. M.H. Gail. Journal of the American Statistical Association 1996: 91(433); 1-13.
Dietary Fat Intake and the Risk of Coronary Heart Disease in Women. Frank B. Hu, Meir J. Stampfer, JoAnn E. Manson, Eric Rimm, Graham A. Colditz, Bernard A. Rosner, Charles H. Hennekens, Walter C. Willett. N Engl J Med 1997: 337(21); 1491-1499. [Abstract] [Full text] [PDF]
Removal of radiation dose response effects: an example of over-matching. J. L. Marsh, J. L. Hutton, K. Binks. Bmj 2002: 325(7359); 327-30. [Medline] [Full text] [PDF]
Observational Studies. PR Rosenbaum (1995) New York: Springer-Verlag.
Evidence-based medicine and treatment choices. D. L. Sackett. Lancet 1997: 349(9051); 570; discussion 572-3. [Medline]
Fat chance: diet and ischemic stroke [editorial; comment]. R. Sherwin, T. R. Price. Jama 1997: 278(24); 2185-6.
Where's the Evidence? Debates in Modern Medicine. William A. Silverman (1998) New York: Oxford University Press.
Was there a plan?
Introduction
The presence of a plan developed before data collection and analysis adds to the quality of a publication.
- Did the research have a narrow focus?
- Did the authors deviate from the plan?
Case study: Meat consumption and childhood cancer
Studies of the effects of diet on health often have difficulties with multiple endpoints. An example is a 1994 study of the effect of cured and broiled meat consumption on childhood cancer.
This study examined two types of cancer (acute lymphocytic leukemia and brain tumor). The authors examined five types of meat consumption (ham/bacon/sausage, hot dogs, hamburgers, lunch meats, and charcoal broiled foods). Finally, the authors looked at food consumption both of the child and of the mother during pregnancy.
In the analysis, the researchers used a cut-off to compare low meat consumption to high meat consumption. For example, they compare one or more hamburgers consumed per week to less than one per week. In the text, however, they went further and discussed results with a different cut-off, children who ate two or more hamburgers per week compared to children who ate one or less per week.
This study came under a lot of criticism for its scattershot approach to investigation, though it also had its share of defenders. There's a saying in statistics "if you torture your data long enough, it will confess to something." When a research study has a plan with limited number of precisely defined hypotheses, the results are more persuasive. When the research has no pre-planned hypotheses, then the results should be considered preliminary and exploratory in nature.
Did the research have a narrow focus?
A good research study has limited objectives that are specified in advance. Failure to limit the scope of a study leads to problems with multiple testing.
When there are a large number of comparisons being made, the study is considered a fishing expedition. There is a saying in Statistics circles "If you torture your data long enough, it will confess to something."
Swaen et al (2001) provides empirical evidence that specifying a hypothesis prior to data collection reduced the chances of a false positive finding by a factor of three.
Pollex et al also show a similar finding in a more light hearted research project. They established a statistically significant association between certain astrological signs to be associated with winning the Nobel prize (Geminis were more likely, Leos were less likely). The authors conclude that
foraging through databases using contrived study designs in the absence of biological mechanistic data seomtimes yields spurious results.
When is multiple testing likely to occur?
Multiple testing often occurs when a researcher examines a large number of subgroups or a large number of endpoints (Howel 1994). Multiple testing problems also occur when a study examines multiple side effects.
When multiple tests are done simultaneously within a paper, there is an increase in the overall Type I error. If 100 tests were performed at alpha=.05, you would expect that 5 of those tests would be significant, even if there was nothing at all going on. There are statistical adjustments for multiple comparisons, but these are controversial. Significant results from a large number of unplanned comparisons are useful mostly just for setting future research priorities.
Optimal cut points and the problem with multiple comparisons.
Researchers will often simplify analysis of a continuous outcome measure by dividing that measure into two or more distinct groups on the basis of cut points. For example, a researcher might categorize his/her subjects as high or low blood pressure when they are above or below a certain value.
An abuse of this approach, called the minimum p-value approach, was noted by Altman (1994). Researchers would examine a variety of cut points and select the one that yielded the most favorable statistics.
For example, some researchers have chosen the cut point from among a large number of possible cut points so as the make the difference in survival times between those patients above the cut point and those patients below the cut point as large as possible.
By examining a multiple number of cut points the chance of drawing a false conclusion (Type I Error) is inflated from the traditional 5% value to a value as large as 40%.
There are several objective ways to select a cut point. Perhaps the best way is to select the cut point prior to looking at the data. This would involve the use of medical judgment.
After the data has been collected, there are some neutral ways of selecting a cut point. The simplest is a median split. If you wanted to create a median split for blood pressure, you would combine the blood pressure data from both groups, and select a value so that half of the blood pressures are larger and half are smaller.
Subgroup analysis
Subgroup comparisons are a special case of multiple testing. Rather than looking at multiple endpoints, a subgroup analysis compares a single endpoint across several different subgroups within the data.
Subgroup comparisons suffer from three problems. First, the subgroup comparison is usually a non-randomized comparison. Second, the subgroup comparison has less precision because the sample size is smaller. Third, the sample size in a study could be swamped by the potential number of possible subgroups that could potentially be examined.
If you find a subgroup that behaves differently, then you need to ask yourself a few questions. Is this a subgroup that I would have studied a priori if I had been more careful during the planning stage? Is there a plausible mechanism to explain why this subgroup behaves differently? Are there other studies that have similar findings for this subgroup?
There are some technical issues with subgroup comparisons. You wouldn't want to declare that a therapy is effective one subgroup if the p-value for that subgroup was 0.043 and the p-value for the others was 0.062. The analysis of subgroups should be done as a formal test of interaction.
A recent publication in the International Journal of Epidemiology provides empirical evidence that post hoc analyses are more likely to lead to false positive findings.
Did the authors deviate from the plan?
Not all research is predictable, so deviations from a pre-designed plan are sometimes necessary. Nevertheless, be cautious about any major deviation from the original research protocol. Some examples of deviations from the plan include:
Investigating end-points other than those originally specified.
Developing new exclusion criteria after the study has started.
You need to ask yourself if the authors deviated from the protocol in a conscious or subconscious effort to manipulate the results. Did the authors add other end-points in order to salvage a largely negative study? Were new exclusion criteria targeted to keep "troublesome" subjects out? It is impossible, of course, to discern the motives of the researchers. Nevertheless, for any deviation or modification to the protocol, you can ask whether this change would have made sense to include in the protocol if it had been thought of before data collection began.
An example of a deviation from the research plan.
An interesting deviation from the research plan occurs in a randomized double blind control trial for the use of selenium supplements (Clark 1996). The study was initiated in 1983 with basal skin carcinoma and squamous skin carcinoma as the primary end points. The researchers also looked for signs of selenium toxicity.
In 1990, funding was obtained to look at additional secondary end points (total mortality, cancer mortality, and incidence of lung, colorectal, and prostate cancers). While it was relatively easy to add extra endpoints in the middle of the study, the authors acknowledged that this represented a deviation from the protocol.
Another deviation from the protocol occurred when the study was terminated early (January 1996). No statistical changes were found in the primary endpoints, nor was any evidence of selenium toxicity found.
Among the secondary endpoints, however, the authors found statistically significant declines in total cancer mortality and lung cancer mortality. The authors also found statistically significant declines in the incidence of prostate cancer, colorectal cancer, lung cancer and total carcinomas. There was also a decline in overall mortality, though it did not achieve statistical significance.
There were no significant changes in the incidence of nine other types of cancer, including breast cancer, bladder cancer, and leukemia.
Because the significant results occurred in areas that were not originally planned for study, the authors acknowledge that any results have to be considered preliminary. Furthermore, it is unclear what impact the early termination of the study had on the statistics. Early termination of a study can cause serious biases, unless specific rules for early termination are established at the start of the study.
Fraudulent changes in the protocol
Detecting fraud in a research study is extremely difficult for anyone, but especially difficult for the reader. A thorough peer review provides a limited level of protection from fraud. Hawkey (2001) proposes that journals should see the original protocols for research studies as part of the peer review process. This practice, which has not yet been widely adopted, would provide some level of protection against fraud.
Sometimes a careful review of the numbers in a study can highlight the possibility of fraud. If a study used randomization, for example, watch out if there is an unexpected and unexplained deviation from a 50-50 split between treatment and control.
Replication of research findings is also a good protection against fraud.
Did the authors discard outliers?
You should be skeptical of any study that removes outliers. Inappropriate removal of outliers can seriously bias the study results.
Sometimes the outliers are more interesting than the bulk of the data themselves. You may gain more insight by trying to uncover the cause of an outlying observation than you would by examining the relatively small effects that occur with the rest of the data.
It is generally a bad idea to remove data points on the basis of their data values alone. If an investigation of an outlier leads to a discovery of a typing error or the inclusion of a subject who did not meet the pre-specified inclusion criteria, then correction or removal of the outlier is appropriate.
If there is no such justification, then the best solution is to leave the outlier alone. Another alternative is reporting data analysis results both with and without the outlier.
An example of inappropriate outlier deletion.
The NASA web site has an interesting example of outlier deletion. Researchers in the 1980s first published information about the hole in the ozone layer above Antarctica. These researchers were nervous because the results from the British Antarctic survey did not match results from earlier years taken by an American satellite. The authors discovered, however, that the American satellite had a computer filter built in that automatically removed any large sudden changes in ozone concentration which it considered as instrument errors. When this filter was removed, the authors were able to trace the development of the ozone hole all the way back to 1976.
Further details about the history of the ozone hole can be found at Ozone Depletion, History and Politics. Brien Sparling. (Accessed on January 11, 2001) http://www.nas.nasa.gov/About/Education/Ozone/history.html
This site explains how the ozone hole was first discovered. It mentions the (inaccurate) claim that a computer filter on a previous satellite had discarded outliers which masked the discovery of the ozone hole for eight years. Although this makes a wonderful teaching example, the actual story is not quite that good.
Summary - Was there a plan?
The presence of a plan developed before data collection and analysis adds to the quality of a publication.
Did the research have a narrow focus? A large number of comparisons limits the amount of evidence that you can place on any single conclusion. Results from a limited number of planned comparisons are considered more authoritative.
Did the authors deviate from the plan? While minor deviations are expected, be cautious about major deviations from the research plan, such as developing new exclusion criteria during the course of the study. In particular, removing outliers without a sound scientific reason is dangerous.
Further reading
Responsibilities of sponsors are limited in premature discontinuation of trials. Richard Ashcroft. BMJ 2001: 323(7303); 53-. [Full text]
Risks and benefits of estrogen plus progestin in healthy postmenopausal women: principal results From the Women's Health Initiative randomized controlled trial. Jacques E. Rossouw, Garnet L. Anderson, Ross L. Prentice, Andrea Z. LaCroix, Charles Kooperberg, Marcia L. Stefanick, Rebecca D. Jackson, Shirley A. A. Beresford, Barbara V. Howard, Karen C. Johnson, Jane Morley Kotchen, Judith Ockene. Jama 2002: 288(3); 321-33. CONTEXT: Despite decades of accumulated observational evidence, the balance of risks and benefits for hormone use in healthy postmenopausal women remains uncertain. OBJECTIVE: To assess the major health benefits and risks of the most commonly used combined hormone preparation in the United States. DESIGN: Estrogen plus progestin component of the Women's Health Initiative, a randomized controlled primary prevention trial (planned duration, 8.5 years) in which 16608 postmenopausal women aged 50-79 years with an intact uterus at baseline were recruited by 40 US clinical centers in 1993-1998. INTERVENTIONS: Participants received conjugated equine estrogens, 0.625 mg/d, plus medroxyprogesterone acetate, 2.5 mg/d, in 1 tablet (n = 8506) or placebo (n = 8102). MAIN OUTCOMES MEASURES: The primary outcome was coronary heart disease (CHD) (nonfatal myocardial infarction and CHD death), with invasive breast cancer as the primary adverse outcome. A global index summarizing the balance of risks and benefits included the 2 primary outcomes plus stroke, pulmonary embolism (PE), endometrial cancer, colorectal cancer, hip fracture, and death due to other causes. RESULTS: On May 31, 2002, after a mean of 5.2 years of follow-up, the data and safety monitoring board recommended stopping the trial of estrogen plus progestin vs placebo because the test statistic for invasive breast cancer exceeded the stopping boundary for this adverse effect and the global index statistic supported risks exceeding benefits. This report includes data on the major clinical outcomes through April 30, 2002. Estimated hazard ratios (HRs) (nominal 95% confidence intervals [CIs]) were as follows: CHD, 1.29 (1.02-1.63) with 286 cases; breast cancer, 1.26 (1.00-1.59) with 290 cases; stroke, 1.41 (1.07-1.85) with 212 cases; PE, 2.13 (1.39-3.25) with 101 cases; colorectal cancer, 0.63 (0.43-0.92) with 112 cases; endometrial cancer, 0.83 (0.47-1.47) with 47 cases; hip fracture, 0.66 (0.45-0.98) with 106 cases; and death due to other causes, 0.92 (0.74-1.14) with 331 cases. Corresponding HRs (nominal 95% CIs) for composite outcomes were 1.22 (1.09-1.36) for total cardiovascular disease (arterial and venous disease), 1.03 (0.90-1.17) for total cancer, 0.76 (0.69-0.85) for combined fractures, 0.98 (0.82-1.18) for total mortality, and 1.15 (1.03-1.28) for the global index. Absolute excess risks per 10 000 person-years attributable to estrogen plus progestin were 7 more CHD events, 8 more strokes, 8 more PEs, and 8 more invasive breast cancers, while absolute risk reductions per 10 000 person-years were 6 fewer colorectal cancers and 5 fewer hip fractures. The absolute excess risk of events included in the global index was 19 per 10 000 person-years. CONCLUSIONS: Overall health risks exceeded benefits from use of combined estrogen plus progestin for an average 5.2-year follow-up among healthy postmenopausal US women. All-cause mortality was not affected during the trial. The risk-benefit profile found in this trial is not consistent with the requirements for a viable intervention for primary prevention of chronic diseases, and the results indicate that this regimen should not be initiated or continued for primary prevention of CHD. [Abstract]
Effects of selenium supplementation for cancer prevention in patients with carcinoma of the skin. A randomized controlled trial. Nutritional Prevention of Cancer Study Group. L. C. Clark, G. F. Combs, Jr., B. W. Turnbull, E. H. Slate, D. K. Chalker, J. Chow, L. S. Davis, R. A. Glover, G. F. Graham, E. G. Gross, A. Krongrad, J. L. Lesher, Jr., H. K. Park, B. B. Sanders, Jr., C. L. Smith, J. R. Taylor. Jama 1996: 276(24); 1957-63. OBJECTIVE: To determine whether a nutritional supplement of selenium will decrease the incidence of cancer. DESIGN: A multicenter, double-blind, randomized, placebo-controlled cancer prevention trial. SETTING: Seven dermatology clinics in the eastern United States. PATIENTS: A total of 1312 patients (mean age, 63 years; range, 18-80 years) with a history of basal cell or squamous cell carcinomas of the skin were randomized from 1983 through 1991. Patients were treated for a mean (SD) of 4.5 (2.8) years and had a total follow-up of 6.4 (2.0) years. INTERVENTIONS: Oral administration of 200 microg of selenium per day or placebo. MAIN OUTCOME MEASURES: The primary end points for the trial were the incidences of basal and squamous cell carcinomas of the skin. The secondary end points, established in 1990, were all-cause mortality and total cancer mortality, total cancer incidence, and the incidences of lung, prostate, and colorectal cancers. RESULTS: After a total follow-up of 8271 person-years, selenium treatment did not significantly affect the incidence of basal cell or squamous cell skin cancer. There were 377 new cases of basal cell skin cancer among patients in the selenium group and 350 cases among the control group (relative risk [RR], 1.10; 95% confidence interval [CI], 0.95-1.28), and 218 new squamous cell skin cancers in the selenium group and 190 cases among the controls (RR, 1.14; 95% CI, 0.93-1.39). Analysis of secondary end points revealed that, compared with controls, patients treated with selenium had a nonsignificant reduction in all-cause mortality (108 deaths in the selenium group and 129 deaths in the control group [RR; 0.83; 95% CI, 0.63-1.08]) and significant reductions in total cancer mortality (29 deaths in the selenium treatment group and 57 deaths in controls [RR, 0.50; 95% CI, 0.31-0.80]), total cancer incidence (77 cancers in the selenium group and 119 in controls [RR, 0.63; 95% CI, 0.47-0.85]), and incidences of lung, colorectal, and prostate cancers. Primarily because of the apparent reductions in total cancer mortality and total cancer incidence in the selenium group, the blinded phase of the trial was stopped early. No cases of selenium toxicity occurred. CONCLUSIONS: Selenium treatment did not protect against development of basal or squamous cell carcinomas of the skin. However, results from secondary end-point analyses support the hypothesis that supplemental selenium may reduce the incidence of, and mortality from, carcinomas of several sites. These effects of selenium require confirmation in an independent trial of appropriate design before new public health recommendations regarding selenium supplementation can be made.
Societal responsibilities of clinical trial sponsors. Stephen Evans, Stuart Pocock. BMJ 2001: 322(7286); 569-570. [Full text] [PDF]
Journals should see original protocols for clinical trials. C J Hawkey. BMJ 2001: 323(7324); 1309-. [Medline] [Full text]
Assessing cause and effect from trials: a cautionary note. D. Howel, R. Bhopal. Control Clin Trials 1994: 15(5); 331-4.
Premature discontinuation of clinical trial for reasons not related to efficacy, safety, or feasibility Commentary: Early discontinuation violates Helsinki principles. Michel Lievre, Joel Menard, Eric Bruckert, Joel Cogneau, Francois Delahaye, Philippe Giral, Eran Leitersdorf, Gerald Luc, Luis Masana, Philippe Moulin, Philippe Passa, Denis Pouchain, Gerard Siest, K Boyd. BMJ 2001: 322(7286); 603-606. When investigators embark on a clinical trial, they naturally expect that the journey will end with the completion of the scheduled patient follow up and publication of the results. Some trials may sink en route because of organisational or ethical reasons, and such misfortunes must be accepted. Sometimes, however, trials are scuttled by their sponsors. Such premature discontinuation not only is frustrating for investigators but may have important medical implications. In this article we analyse the case of a clinical trial that was recently stopped for financial reasons, discuss the consequences of such discontinuations, and make some proposals to avoid recurrence. [Full text] [PDF]
Randomised controlled trial of cardiotocography versus Doppler auscultation of fetal heart at admission in labour in low risk obstetric population. G. Mires, F. Williams, P. Howie. British Medical Journal 2001: 322(7300); 1457-60; discussion 1460-2. (See "Commentary: changes between protocol and manuscript should be declared at submission" at the end of this article.) OBJECTIVE: To compare the effect of admission cardiotocography and Doppler auscultation of the fetal heart on neonatal outcome and levels of obstetric intervention in a low risk obstetric population. DESIGN: Randomised controlled trial. SETTING: Obstetric unit of teaching hospital PARTICIPANTS: Pregnant women who had no obstetric complications that warranted continuous monitoring of fetal heart rate in labour. INTERVENTION: Women were randomised to receive either cardiotocography or Doppler auscultation of the fetal heart when they were admitted in spontaneous uncomplicated labour. MAIN OUTCOME MEASURES: The primary outcome measure was umbilical arterial metabolic acidosis. Secondary outcome measures included other measures of condition at birth and obstetric intervention. RESULTS: There were no significant differences in the incidence of metabolic acidosis or any other measure of neonatal outcome among women who remained at low risk when they were admitted in labour. However, compared with women who received Doppler auscultation, women who had admission cardiotocography were significantly more likely to have continuous fetal heart rate monitoring in labour (odds ratio 1.49, 95% confidence interval 1.26 to 1.76), augmentation of labour (1.26, 1.02 to 1.56), epidural analgesia (1.33, 1.10 to 1.61), and operative delivery (1.36, 1.12 to 1.65). CONCLUSIONS: Compared with Doppler auscultation of the fetal heart, admission cardiotocography does not benefit neonatal outcome in low risk women. Its use results in increased obstetric intervention, including operative delivery. [Medline] [Abstract] [Full text] [PDF]
Celestial determinants of success in research. R. Pollex, B. Hegele, M.R. Ban. Cmaj 2001: 165(12); 1584. [Medline] [Full text] [PDF]
Ozone Depletion, History and politics. Brien Sparling. Accessed on 2002-11-27. "Ground based measurements of Ozone were first started in 1956, in at Halley Bay, Antarctica. Satellite measurements of ozone started in the early 70's, but the first comprehensive worldwide measurements started in 1978 with the Nimbus-7 satellite. Nimbus-7 carried a TOMS (total ozone mapping spectrometer, and a SBUV(solar backscatter UV meter). The TOMS finally broke on May 7th,1993, but today there are several different satellites measuring concentrations of ozone and other atmosheric gases. Gases in the troposphere and lower stratosphere are sampled by weather balloons or by airplanes such as the ER-2 managed by NASA." www.nas.nasa.gov/About/Education/Ozone/history.html
False positive outcomes and design characteristics in occupational cancer epidemiology studies. G. G. Swaen, O. Teggeler, L. G. van Amelsvoort. Int J Epidemiol 2001: 30(5); 948-54. BACKGROUND: Recently there has been considerable debate about possible false positive study outcomes. Several well-known epidemiologists have expressed their concern and the possibility that epidemiological research may loose credibility with policy makers as well as the general public. METHODS: We have identified 75 false positive studies and 150 true positive studies, all published reports and all epidemiological studies reporting results on substances or work processes generally recognized as being carcinogenic to humans. All studies were scored on a number of design characteristics and factors relating to the specificity of the research objective. These factors included type of study design, use of cancer registry data, adjustment for smoking and other factors, availability of exposure data, dose- and duration-effect relationship, magnitude of the reported relative risk, whether the study was considered a 'fishing expedition', affiliation and country of the first author. RESULTS: The strongest factor associated with the false positive or true positive study outcome was if the study had a specific a priori hypothesis. Fishing expeditions had an over threefold odds ratio of being false positive. Factors that decreased the odds ratio of a false positive outcome included observing a dose-effect relationship, adjusting for smoking and not using cancer registry data. CONCLUSION: The results of the analysis reported here clearly indicate that a study with a specific a priori study objective should be valued more highly in establishing a causal link between exposure and effect than a mere fishing expedition.
Who knew what when?
Introduction
Knowledge of group membership, during the research study collection can cause problems. When possible, the treatment status should be blinded to the patients, anyone who interacts with the patients, anyone who evaluates the patients or anyone who collects data from the patients. Even when this is not possible, the randomization list should stay be concealed until the patient agrees to participate in the study and is shown to be eligible for the study.
Acupuncture
Acupuncture is an example of a therapy that is difficult to blind. One study of the effect of acupuncture on the prevention of recidivism among alcohol and other drug abusers (Bullock et al 1989). This study used a placebo acupuncture that placed needles 5 mm away from the designated acupuncture point.
The use of placebo acupuncture was intended to keep information about the treatment groups hidden from the patients themselves. The patients knew that they were being "needled", but they did not know if the needles were placed correctly or incorrectly. The assumption for this study is that if acupuncture is effective, then correct application of acupuncture should show a greater effect than incorrect application of acupuncture. There is some controversy, however, over this assumption (Nahin and Strauss 2001).
Because of the nature of acupuncture, the acupuncturists were aware of which patients were which, making this only a partially blinded study. A critique of this study (Sampson 1997) pointed out that there were significant interactions between the acupuncturists and the patients, with opportunities for indirect suggestion and nonverbal communication to occur. One indication that subjects became aware of who was in which group was the fact that there was a far greater tendency for control subjects to drop out of the study.
Definition of blinding.
In an experimental study, it is desirable (but not always possible) to keep the information about the treatments hidden from the patients and anyone involved with evaluating the patient. This is known as "blinding" or "masking." Blinding prevents conscious or subconscious biases or expectations from influencing the outcome of the study.
There is always some individual who knows which patients get which treatments, such as the pharmacy that prepares the pills and placebos. This is perfectly fine as long as these individuals do not interact with the patients or evaluate the patients.
There is a bit of ambiguity with respect to who is blinded (Devereaux et al 2001). For example, a survey of 25 textbooks produced nine different definitions of "double blind." Therefore, you should avoid using these terms and focus instead on which individuals are blinded. If you are evaluating an article, look for evidence of blinding for the following groups:
the patients themselves,
clinicians who have substantial interactions with the patients,
anyone who assesses outcomes in these patients, or
anyone who collects data from these patients.
If only some of the above are unaware of the treatment, then the study is partially blinded.
The effect of blinding on the patient.
Blinding prevents the placebo effect from distorting the research results. The placebo effect is a product of "belief, expectancy, cognitive reinterpretation, and diversion of attention" that can lead to psychological and sometimes physiological improvements in situations where the treatment is known to have no effect, such as sugar pills (Beyerstein 1997).
Johnson (1997) lists three specific situations where the placebo effect is of particular concern: when enthusiasm by the patient or the doctor for the new procedure is strong, when outcomes are based on the patient's self-assessment (e.g. quality of life studies), and when the treatment is primarily for symptoms. The placebo effect is less critical for objective outcomes like survival.
A recent study showed that the placebo effect might be overstated in some contexts (Hrobjartsson and Gotzsche 2001). Some of the effects attributed to the placebo are perhaps caused instead by statistical artefacts like regression to the mean or by the tendency of some conditions to resolve spontaneously .
Even without a placebo effect, blinding would still be important to insure uniform rates of compliance. You want to avoid a situation where a patient thinks "I'm in the placebo arm, so it's not really important whether I show up for my follow-up evaluation."
The effect of blinding on the investigators.
The value of blinding also extends to the research team, and should include anyone who interacts with the patients. In a clinical trial of treatments for multiple sclerosis, a pair of neurologists assessed the outcome of each patient (Noseworthy et al 1994). One neurologist was blinded to the treatment status and one was unblinded. The unblinded neurologist gave substantially lower ratings to patients in the placebo group, which would have led to falsely concluding that one of the treatments was effective.
Researchers can also influence the outcome through their attitudes and through their differential use of other medications (Schulz et al 2002).
Those who collect data through an interview might probe harder for some patients if they are not blinded. Gail (1996) describes an observational study where the people asking questions about smoking and other risk factors were unaware of when they were interviewing lung cancer patients or controls. Thus, the interviewers could not subconsciously prod more for smoking information among the lung cancer patients.
When blinding is impossible
Unfortunately, there are many situations where blinding is impossible. For example, if you are comparing oral versus rectal administration of a drug, that's pretty hard to conceal from the patient. In general, observational studies cannot be blinded, because the patient and/or their doctor selects the treatment group.
Surgical procedures are often difficult to completely blind. Nevertheless, Johnson (1997) suggests some partial steps at blinding that prevent some of the biases from creeping in. If two surgical procedures use different types of incisions, identical blood or iodine stained opaque dressings could be used to keep the patients unaware of which operation was performed. Also, although the surgeon cannot be blinded to the difference in surgery, those who evaluate the health of the patient after surgery could be kept unaware of the particular operation, so as to insure that their evaluation of the patient is unbiased.
Even though the placebo may look the same, sometimes the doctor may infer which group a patient belongs to, perhaps through noting a characteristic set of side effects. If you are worried about this, ask the doctors to try to identify which treatment group they believe each patient belonged to. If the percentage of correct guesses is significantly larger than 50%, then the allocation scheme was not sufficiently blinded.
Although unblinded studies are considered less authoritative than blinded studies, you should not use blinding as a surrogate marker for the quality of the research (Schulz et al 2002). For example, Rupert Sheldrake conducted a survey of various journals and showed that blinding was used in 85% of all parapsychology research. But it would be a mistake to claim, as Dr. Sheldrake does, that
"Parapsychologists ... have been constantly subjected to intense scrutiny by skeptics, and this has made them more rigorous." http://www.parascope.com/en/articles/blindScience.htm
Blinding is just of many factors that combine to indicate a study's rigor and quality.
The problem with studies without blinding.
Two researchers have examined studies with and without blinding. These authors found that studies without blinding show an average bias of 11-17% (Schulz 1996; Colditz 1989). In other words, when an unblinded study was compared to a blinded study, the former study tended to estimate a treatment effect that was (on average) 11% to 17% higher than the latter.
Additional evidence of this problem appears in a meta-analysis of the effect of intermittent sunlight exposure and melanoma (Nelemans 1995). When nine studies without blinding were combined, they showed a odds ratio of 1.84 which was statistically significant (95% confidence interval 1.52 to 2.25). When the seven studies with blinding were combined, they showed a much smaller odds ratio (1.17, 95% confidence interval 0.98 to 1.39) which was not statistically significant. This is further evidence that unblinded studies are more likely to show statistical significance than blinded studies.
Concealed allocation.
Another important aspect of research is concealed allocation, which is the concealment of the randomization list from those involved with recruiting subjects. This concealment occurs until after subjects agree to participate and the recruiter determines that the patient is eligible for the study.
It is always possible to conceal the randomization list, even when the treatment itself cannot be blinded. Check out all the exclusion criteria and if the subject qualifies, open a sealed envelope which identifies which group the patient belongs to. So, for example, it is impossible to use blinding when comparing a surgical to a non-surgical technique, but the selection of who gets the surgical technique could be hidden from both the patient and the surgeon until after all the selection and inclusion criteria are applied.
Knowledge of treatment order allows the doctors recruiting patients to consciously or unconsciously influence the composition of the groups. They can do this by applying exclusion criteria differentially or by delaying entry of a certain healthier (or unhealthier) subject so he/she gets into the "desirable" group. Unblinded allocation schemes show an average bias of 30-40% (Schulz 1996).
There are many stories of physicians who have tried and suceeded in recruiting a patient into a preferred group. If the treatment allocation is hidden in sealed envelopes, they can hold it up to a strong light. If the sealed envelopes are not sequentially numbered, they can open several envelopes at once. If the allocation is controlled by a central operator, they can call and ask for the allocation of several patients at once.
When a doctor has an overt preference to enroll a patient into one group over another, it raises ethical issues about equipoise and perhaps the doctor should not be participating in the trial.
Concealed allocation only makes sense for a truly randomized study. For convenience, some researchers will allocate in a systematic (non-random) fashion, such as alternating regularly between the two treatments. This is a bad idea. Systematic allocations allow the doctors to guess which group the next patient is going to be allocated to, leading to the same potential problems described above. Systematic assignment causes an average bias of 15% (Colditz 1989).
Summary - Who knew what when?
Knowledge of group membership, either before or during the data collection can bias the study. Ask yourself who knew what when. Ideally information about the treatment should be hidden from the patients themselves, anyone interacting with the patients, anyone evaluating the patients, or anyone collecting data from the patients. The randomization list should be concealed and the treatment assignment should not be revealed until the patient agrees to participate in the study and the recruiting physician has verified that the patient is eligible for the study.
Further reading
Controlled trial of acupuncture for severe recidivist alcoholism. Bullock ML, Culliton PD and Olander RT. Lancet 1989:1(8652);1435-9.
How study design affects outcomes in comparisons of therapy. I: Medical. Colditz G, Miller J and Mosteller F. Stat Med 1989:8(4);441-454.
"Double blind, you are the weakest link- good-bye!" Devereaux PJ, Bhandari M, Montori VM, Manns BJ, Ghali WA and Guyatt GH. ACP Journal Club 2002:136;A11-A12.
Statistics in Action. Gail MH. Journal of the American Statistical Association 1996:91(433);1-13.
Is the placebo powerless? An analysis of clinical trials comparing placebo with no treatment. Hrobjartsson A and Gotzsche PC. N Engl J Med 2001:344(21);1594-602.
Removing bias in surgical trials. Johnson AG and Dixon JM. British Medical Journal 1997:314(7085);916-7.
Research into complementary and alternative medicine: problems and potential. Nahin RL and Straus SE. British Medical Journal 2001:322(7279);p161-4.
An addition to the controversy on sunlight exposure and melanoma risk: a meta-analytical approach. Nelemans PJ, Rampen FH, Ruiter DJ and Verbeek AL. J Clin Epidemiol 1995:48(11);1331-42.
The impact of blinding on the results of a randomized, placebo-controlled multiple sclerosis clinical trial. Noseworthy JH, Ebers GC, Vandervoort MK, Farquhar RE, Yetisir E and Roberts R. Neurology 1994:44(1);p16-20.
Inconsistencies and Errors in Alternative Medicine Research. Sampson W. Skeptical Inquirer 1997 (September/October);21(5):35-38.
Empirical evidence of bias dimensions of methodological quality associated with estimates of treatment effects in controlled trials. Schulz K, Chalmers I, Hayes R and Altman D. JAMA 1995:273(5);408-12.
Randomised trials, human nature, and reporting guidelines. Schulz KF. Lancet 1996:348(9027);596-8.
The Landscape and Lexicon of Blinding in Randomized Trials. Schulz KF, Chalmers I and Altman DG. Annals of Internal Medicine 2002:136(3);254-259.
Allocation concealment in randomised trials: defending against deciphering. Schulz KF and Grimes DA. Lancet 2002:359;614-618.
Blinding in randomised trials: hiding who got what. Schulz KF and Grimes DA. Lancet 2002:359;696-700.
Generation of allocation sequences in randomised trials: chance not choice. Schulz KF and Grimes DA. Lancet 2002:359;515-519.
Who was left out?
Introduction
Research studies often have a narrow focus, but sometimes it can be too narrow. When too many patients are left out, those who remain may not be not representative of the types of patients you will encounter.
When you are trying to figure out who was left out and what impact this has, ask the following two questions:
4.1 Who was excluded at the start of the study?
4.2 Who dropped out during the study?
Nicotine patches
The Journal of Pediatrics published a study of adolescent smokers in 1996. The researchers recruited 22 volunteers from five public high schools in the Rochester, MN area for participation in a smoking cessation program involving behavioral counseling, group therapy, and nicotine patches. Researchers measured the number of cigarettes smoked, side effects, and blood levels of nicotine.
The purpose of the research was to evaluate "the safety, tolerance, and efficacy of 22 mg/d nicotine patch therapy in smokers younger than 18 years who were trying to stop smoking." The authors also listed a secondary goal, "to compare blood cotinine levels, nicotine withdrawal scores, and adverse experiences with those of adults obtained in previous patch studies." Cotinine is a metabolite of nicotine and provides a useful objective measure of cigarette smoking. It also allowed the authors to examine whether nicotine toxicity was an issue.
This study did not include major segments of the teenage smoking population. The study included only white subjects because there were too few minority studentsin the Rochester area. Subjects had to get parental permission, excluding smokers who wished to keep their habit secret from their parents. Subjects were also volunteers, and thus could be considered more motivated to quit than the typical teenage smoker.
The study also had a serious drop out rate. Of the presumably thousands of teenage smokers in the Rochester Minnesota area, only 71 volunteers responded to the initial call for subjects. Of the 71 volunteers, 55% met inclusion criteria. Of the remaining 39, 44% declined to attend the initial meeting. Of the remaining 22, 14% were non-compliant. Of the remaining 18, 39% failed to respond to the one year survey. Only 11 completed the entire study (50% of those who started the study; 28% of those meeting inclusion criteria; 15% of the initial volunteers.)
This study had a serious problem with who was left out. The large number of subjects who did not get into the study or who did not complete the study makes it hard to generalize the findings of this research.
4.1 Who was excluded at the start of the study?
Researchers, trying to minimize variation, will use exclusion criteria to create more homogenous groups. While minimizing variability is good, too much homogeneity can backfire. It’s difficult to extrapolate results from a very tightly controlled and homogenous clinical trial to the variation of patients seen in your practice. Ask yourself the question "How similar are my patients?"
For the study to be useful to us, we want the research subjects to be as similar as possible to the patients we see. Watch out for exclusion criteria that leave out large groups of patients. Also be aware that too many research studies exclude women unnecessarily.
Ask yourself whether the geographic location or the type of health care setting places restrictions on the type of patients seen. Tertiary care centers only see patients that are extremely ill. A study of Midwest hospitals will not have a representative number of Hispanic patients compared to the Southwest.
Exclusion of elderly patients
[To be added]
Exclusion of women
[To be added]
Exclusion of children
[To be added]
Volunteer bias
Quite often, the only patients we are able to study are those who volunteer to help out. The use of volunteers, however, may exclude important segments of the patient population.
Volunteers may differ from the normal population on several critical factors. Volunteers for a study involving cash payments may come more often from economically challenged environments. If a free health check-up is included, volunteers may come more often from people worried about their health status. Volunteers for lengthy studies are less likely to be employed.
Recruiting controls is especially troublesome in a study that involves a painful procedure. Gustavsson (1997) documents volunteer bias in a study of lumbar puncture to obtain cerebrospinal fluid.
In this study, subjects were asked to submit to a lumbar puncture in order to "examine the associations between personality traits and biochemical variables." Of the 87 subjects, 48 declined to participate. The authors were fortunate enough to have measures of personality on both those who participated in the study and those who did not participate.
Those who participated had scores roughly a half standard deviation higher on impulsiveness. They did not differ on other personality traits such as socialization and detachment.
The large difference in the impulsiveness measurement would obviously cloud any attempt to correlate personality traits and biochemical measurements in spinal fluids among those who volunteered.
Hughes et al (1997) point out the obvious fact that smokers who participate in smoking cessation studies are different from smokers in the general population.
Volunteers in survey study.
An aspect of volunteering can occur in survey studies. People who volunteer to return a questionnaire are frequently quite different from those who refuse to fill out the survey. In particular, the non-responders tend to be more apathetic. Return rates for surveys vary by the type of survey, but if less than half of the subjects returned the survey, any results are of very limited value. Again, look for efforts to minimize non-response and/or efforts to characterize the demographics of non-responders.
Stocks and Grunnell (2000) examined general practitioners who routinely failed to return mail surveys. A follow-up telephone call assessed demographic characteristics of this group. They were older, less likely to have post graduate qualifications and were less likely to be involved with a teaching practice.
The use of email and the Internet to recruit and/or survey subjects is problematic, because not everyone owns or uses a computer. Etter and Perneger (2001) recruited cigarette smokers both by the Internet and by regular mail. Those subjects recruited by the Internet differed in age, education, degree of smoking, and desire to quit. The authors of this report, however, argue that in spite of these demographic differences, the trends and associations found in the Internet recruited group matched those of the other group. For example, in both groups, light smokers were more likely than heavy smokers to adopt a "taking control" self-change strategy and less likely to adopt a "risk assessment" strategy.
In 1976, Shere Hite published a study on female sexual attitudes that represented the responses of 3,019 surveys. While that sounds impressive, it was a small fraction of the 100,000 surveys that were sent out.
One can speculate on the characteristics of those who failed to respond, but it is a pretty good bet that many of them felt uncomfortable discussing aspects of their sex lives in a survey format. It's obvious that this tendency alone would tend to affect many of the responses in the survey.
What to look for in studies using volunteers.
Examine the incentives and disincentives for participation. Are any incentives or disincentives related to important prognostic factors?
Were the researchers able to characterize various aspects of those who did not volunteer? How similar were the volunteers and non-volunteers?
Do people volunteer themselves into specific treatment groups? If so, we have an observational study.
Some studies involve the use of volunteers who are subsequently randomized into two groups. If this case, some problems will diminish. Comparison between the two groups will be unbiased, but it may be difficult to generalize to a non-volunteer population.
4.2 Who dropped out during the study?
It is inevitable that some patients will drop out during the study. If the number is more than a few, this is a cause for concern. Dropouts often have a different prognosis than those who stay. Ignoring the dropouts will often paint a rosier picture of the outcome. Was there any effort (financial inducement, follow-up reminders) made to minimize dropouts? Were the authors able to characterize the demographics of the dropouts?
Were non-compliant patients excluded? Non-compliance is often associated with poor prognosis. Excluding these patients may also paint a rosier picture of the outcome. Patients should be analyzed in the groups they were randomized to. This is known as "intention to treat" analysis.
Consider a new surgical therapy which is being compared to a standard non-surgical therapy. Some patients randomized to the surgical therapy might die prior to receiving the therapy. This is the most extreme form of non-compliance. These patients should still be analyzed as part of the surgical therapy group. Otherwise the rapidly dying patients will be excluded from the treatment group, but not from the control group, leading to serious bias.
Additional resources
Unjustified exclusion of elderly people from studies submitted to research ethics committee for approval: descriptive study. A. Bayer and W. Tadd. British Medical Journal 2000:321(7267);992-3. Abstract not available yet. [Medline] [Full text] [PDF]
Exclusion of elderly people from clinical research: a descriptive study of published reports. G. Bugeja, A. Kumar and A. K. Banerjee. British Medical Journal 1997:315(7115);1059. Abstract not available yet. [Medline] [Full text]
Hold the Lard! The Atkins Diet still doesn't work.. Michael Fumento. Accessed on 2002-12-06. A careful analysis of the recent research on the Atkins diet shows that there was a much higher drop out rate in that group, which could partially explain the promising results of this diet. www.reason.com/hod/mf120502.shtml
Participation in Research and Access to Experimental Treatments by HIV-Infected Patients. Allen L. Gifford, William E. Cunningham, Kevin C. Heslin, Ron M. Andersen, Terry Nakazono, Dale K. Lieu, Martin F. Shapiro, Samuel A. Bozzette and the HIV Cost and Services Utilization Study Consortium. N Engl J Med 2002:346(18);1373-1382. Background Although there is concern that minority groups and women are underrepresented in research involving patients with human immunodeficiency virus (HIV) infection, the available data are inconclusive. Methods We used nationally representative data from the HIV Cost and Services Utilization Study to determine the characteristics of the participants and nonparticipants in trials of medications for HIV infection and whether or not patients had access to experimental treatments. A probability sample of 2864 persons, representing all 231,400 adults with known HIV infection who are cared for in the contiguous United States, were interviewed on three occasions between 1996 and 1998. They were asked about participation in clinical research studies of medications and past receipt of experimental medications for HIV. Results We estimate that 14 percent of adults receiving care for HIV infection participated in a medication trial or study; 24 percent had received experimental medications; and 8 percent had tried and failed to obtain experimental treatments. According to multivariate models, non-Hispanic blacks and Hispanics were less likely to be participating in trials than non-Hispanic whites (odds ratio for participation among non-Hispanic blacks, 0.50 [95 percent confidence interval, 0.28 to 0.91]; odds ratio among Hispanics, 0.58 [95 percent confidence interval, 0.37 to 0.93]) and to have received experimental medications (odds ratios, 0.41 [95 percent confidence interval, 0.32 to 0.54] and 0.56 [95 percent confidence interval, 0.41 to 0.78], respectively). Patients who were cared for in private health maintenance organizations were less likely to participate in trials than those with fee-for-service insurance (odds ratio, 0.43 [95 percent confidence interval, 0.21 to 0.88]). Women were not underrepresented in research trials and had a similar likelihood of receiving experimental treatments. Conclusions Among patients with HIV infection, participation in research trials and access to experimental treatment is influenced by race or ethnic group and type of health insurance. [Abstract]
The exclusion of the elderly and women from clinical trials in acute myocardial infarction. J. H. Gurwitz, N. F. Col and J. Avorn. Jama 1992:268(11);1417-22. OBJECTIVE--To determine the extent to which the elderly have been excluded from trials of drug therapies used in the treatment of acute myocardial infarction, to identify factors associated with such exclusions, and to explore the relationship between the exclusion of elderly and the representation of women. DATA SOURCES--We conducted a systematic search of the English-language literature from January 1960 through September 1991 to identify all relevant studies of specific pharmacotherapies employed in the treatment of acute myocardial infarction. To accomplish this, we searched MEDLINE, major cardiology textbooks, meta-analyses, reviews, editorials, and the bibliographies of all identified articles. STUDY SELECTION--Only trials in which patients were randomly allocated to receive a specific therapeutic regimen or a placebo or nonplacebo control regimen were included for review. DATA EXTRACTION--Studies were abstracted for year of publication, source of support, performance location, drug therapies to which patients were randomized, use of invasive diagnostic tests or therapeutic procedures, exclusion criteria, size and demographic characteristics of the randomized study population, and principal outcome measures. DATA SYNTHESIS--A total of 214 trials met inclusion criteria, involving 150,920 study subjects. Over 60% of trials excluded persons over the age of 75 years. Studies published after 1980 were more likely to have age-based exclusions compared with studies published before 1980 (adjusted odds ratio, 4.92; 95% confidence interval, 2.33 to 10.54). Trials of thrombolytic therapy involving an invasive procedure were more likely to exclude elderly patients compared with other studies (adjusted odds ratio, 2.45; 95% confidence interval, 1.10 to 5.47). Studies with age-based exclusions had a smaller percentage of women compared with those without such exclusions (18% vs 23%; P = .0002), with the mean age of the study population significantly associated with the proportion of women participants (P = .0001, R2 = .29). CONCLUSIONS--Age-based exclusions are frequently used in clinical trials of medications used in the treatment of acute myocardial infarction. Such exclusions limit the ability to generalize study findings to the patient population that experiences the most morbidity and mortality from acute myocardial infarction.
Randomised study of long term outcome after epidural versus non-epidural analgesia during labour. C. J. Howell, T. Dean, L. Lucking, K. Dziedzic, P. W. Jones and R. B. Johanson. Bmj 2002:325(7360);357. OBJECTIVE: To determine whether epidural analgesia during labour is associated with long term backache. DESIGN: Follow up after randomised controlled trial. Analysis by intention to treat. SETTING: Department of obstetrics and gynaecology at one NHS trust. PARTICIPANTS: 369 women: 184 randomised to epidural group (treatment as allocated received by 123) and 185 randomised to non-epidural group (treatment as allocated received by 133). In the follow up study 151 women were from the epidural group and 155 from the non-epidural group. MAIN OUTCOME MEASURES: Self reported low back pain, disability, and limitation of movement assessed through one to one interviews with physiotherapist, questionnaire on back pain and disability, physical measurements of spinal mobility. RESULTS: There were no significant differences between groups in demographic details or other key characteristics. The mean time interval from delivery to interview was 26 months. There were no significant differences in the onset or duration of low back pain, with nearly a third of women in each group reporting pain in the week before interview. There were no differences in self reported measures of disability in activities of daily living and no significant differences in measurements of spinal mobility. CONCLUSIONS: After childbirth there are no differences in the incidence of long term low back pain, disability, or movement restriction between women who receive epidural pain relief and women who receive other forms of pain relief. [Medline]
Do safety practices differ between responders and non-responders to a safety questionnaire? D. Kendrick, R. Hapgood and P. Marsh. Injury Prevention 2001:7(2);100-3. OBJECTIVE: To compare reported safety practices between responders and non-responders to a safety survey. DESIGN: Cross sectional survey at baseline compared with safety practices reported at subsequent child health surveillance checks. SUBJECTS: Parents of children aged 3-12 months registered with practices participating in a controlled trial of injury prevention in primary care that did, and did not, respond to the baseline survey and who subsequently attended child health surveillance checks. RESULTS: No difference in safety practices was found between responders and non-responders to the survey at the 6-9 month check. Responders were more likely to report owning a stair gate (odds ratio (OR) 2.75, 95% confidence interval (CI) 1.82 to 4.16) and socket covers (OR 2.16, 95% CI 1.53 to 3.04) at the 12-15 month check, and owning socket covers (OR 2.19, 95% CI 1.34 to 3.61) at the 18-24 month check. Responders were more likely to report greater than the median number of safety practices at the 18 month check. CONCLUSIONS: Non-responders to a safety survey appear to be less likely to report owning several items of safety equipment than responders. Further work is needed to confirm these findings. Extrapolating the results of safety surveys to the population as a whole may lead to over estimation of safety equipment possession. [Medline]
Spectrum bias in the evaluation of diagnostic tests: lessons from the rapid dipstick test for urinary tract infection. M. S. Lachs, I. Nachamkin, P. H. Edelstein, J. Goldman, A. R. Feinstein and J. S. Schwartz. Ann Intern Med 1992:117(2);135-40. OBJECTIVE: To determine if the leukocyte esterase and bacterial nitrite rapid dipstick test for urinary tract infection (UTI) is susceptible to spectrum bias (when a diagnostic test has different sensitivities or specificities in patients with different clinical manifestations of the disease for which the test is intended). DESIGN: Cross-sectional study. PATIENTS: A total of 366 consecutive adult patients in whom clinicians performed urinalysis to diagnose or exclude UTI. SETTING: An urban emergency department and walk-in clinic. MEASUREMENTS: After the patient encounter, but before dipstick test or culture was done, clinicians recorded the signs and symptoms that were the basis for suspecting UTI and for performing a urinalysis and an estimate of the probability of UTI based on the clinical evaluation. For all patients who received urinalysis, dipstick tests and culture were done in the clinical microbiology laboratory by medical technologists blinded to clinical evaluation. Sensitivity for the dipstick was calculated using a positive result in either leukocyte esterase or bacterial nitrite, or both, as the criterion for a positive dipstick, and greater than 10(5) CFU/mL for a positive culture. RESULTS: In the 107 patients with a high (greater than 50%) prior probability of UTI, who had many characteristic UTI symptoms, the sensitivity of the test was excellent (0.92; 95% CI, 0.82 to 0.98). In the 259 patients with a low (less than or equal to 50%) prior probability of UTI, the sensitivity of the test was poor (0.56; CI, 0.03 to 0.79). CONCLUSIONS: The leukocyte esterase and bacterial nitrite dipstick test for UTI is susceptible to spectrum bias, which may be responsible for differences in the test's sensitivity reported in previous studies. As a more general principle, diagnostic tests may have different sensitivities or specificities in different parts of the clinical spectrum of the disease they purport to identify or exclude, but studies evaluating such tests rarely report sensitivity and specificity in subgroups defined by clinical symptoms. When diagnostic tests are evaluated, information about symptoms in the patients recruited for study should be included, and analyses should be done within appropriate clinical subgroups so that clinicians may decide if reported sensitivities and specificities are applicable to their patients. [Medline]
Comorbidity of chronic diseases in general practice. F. G. Schellevis, J. van der Velden, E. van de Lisdonk, J. T. van Eijk and C. van Weel. J Clin Epidemiol 1993:46(5);469-73. With the increasing number of elderly people in The Netherlands the prevalence of chronic diseases will rise in the next decades. It is recognized in general practice that many older patients suffer from more than one chronic disease (comorbidity). The aim of this study is to describe the extent of comorbidity for the following diseases: hypertension, chronic ischemic heart disease, diabetes mellitus, chronic nonspecific lung disease, osteoarthritis. In a general practice population of 23,534 persons, 1989 patients have been identified with one or more chronic diseases. Only diseases in agreement with diagnostic criteria were included. In persons of 65 and older 23% suffer from one or more of the chronic diseases under study. Within this group 15% suffer from more than one of the chronic diseases. Osteoarthritis and diabetes mellitus are the diseases with the highest rate of comorbidity. Comorbidity restricts the external validity of results from single-disease intervention studies and complicates the organization of care.
Sample size slippages in randomized trials: exclusions and the lost and wayward. K. F. Schulz and D.A. Grimes. Lancet 2002:359(781-785. Proper randomisation means little if investigators cannot include all randomised participants in the primary analysis. Participants might ignore follow-up, leave town, or take aspartame when instructed to take aspirin. Exclusions before randomisation do not bias the treatment comparison, but they can hurt generalisability. Eligibility criteria for a trial should be clear, specific, and applied before randomisation. Readers should assess whether any of the criteria make the trial sample atypical or unrepresentative of the people in which they are interested. In principle, assessment of exclusions after randomisation is simple: none are allowed. For the primary analysis, all participants enrolled should be included and analysed as part of the original group assigned (an intent-to-treat analysis). In reality, however, losses frequently occur. Investigators should, therefore, commit adequate resources to develop and implement procedures to maximise retention of participants. Moreover, researchers should provide clear, explicit information on the progress of all randomised participants through the trial by use of, for instance, a trial profile. Investigators can also do secondary analyses on, for instance, per-protocol or as-treated participants. Such analyses should be described as secondary and non-randomised comparisons. Mishandling of exclusions causes serious methodological difficulties. Unfortunately, some explanations for mishandling exclusions intuitively appeal to readers, disguising the seriousness of the issues. Creative mismanagement of exclusions can undermine trial validity.
Nonresponse bias and early versus all responders in mail and telephone surveys. J. Siemiatycki and S. Campbell. Am J Epidemiol 1984:120(2);p291-301. Mail and telephone survey methods, with or without follow-up by other methods, are cost-effective alternatives to the conventional home interview approach. However, it has long been thought that they are especially susceptible to nonresponse bias. The study addressed this issue in the context of parallel mail and telephone health surveys carried out in Montreal. The mail strategy among 1,555 adults achieved 68.5% response and follow-up by telephone and home interview increased response to 80.9%. Respondents were adequately representative of the entire sample with respect to socioeconomic status, number of adults in household, and ethnic distribution. The 68.5% initial stage respondents were similar to all respondents on the above variables as well as on age, sex, education and reported health status. Odds ratios of smoking and respiratory symptoms hardly differed between initial stage and all respondents. The telephone survey among 1,595 adults achieved 72.7% response and follow-up by mail and personal interview increased response to 88.2%. Comparisons between respondents and the entire sample and between initial stage respondents and all respondents gave similar results to those found in the mail strategy, although there was some change in a symptom-smoking odds ratio from the initial stage respondents to all respondents. In both survey strategies, there was no evidence of substantial nonresponse bias and estimates of morbidity and health care would not have differed much if the fieldwork had stopped at the initial mail or telephone stage.
Ozone Depletion, History and politics. Brien Sparling. Accessed on 2002-11-27. "Ground based measurements of Ozone were first started in 1956, in at Halley Bay, Antarctica. Satellite measurements of ozone started in the early 70's, but the first comprehensive worldwide measurements started in 1978 with the Nimbus-7 satellite. Nimbus-7 carried a TOMS (total ozone mapping spectrometer, and a SBUV(solar backscatter UV meter). The TOMS finally broke on May 7th,1993, but today there are several different satellites measuring concentrations of ozone and other atmosheric gases. Gases in the troposphere and lower stratosphere are sampled by weather balloons or by airplanes such as the ER-2 managed by NASA." www.nas.nasa.gov/About/Education/Ozone/history.html
Applying evidence to the individual patient. S. E. Straus and D. L. Sackett. Ann Oncol 1999:10(1);29-32. Abstract not available yet. [Medline]
The Effect of School Dropout Rates on Estimates of Adolescent Substance Use among Three Racial/Ethnic Groups. Randall C. Swaim, F Beauvais, EL Chavez and ER Oetting. American Journal of Public Health 1997:87(1);51-55. ABSTRACT: OBJECTIVES: This study examined, across three racial/ethnic groups, how the inclusion of data on drug use of dropouts can alter estimates of adolescent drug use rates. METHODS: Self-report rates of lifetime prevalence and use in the previous 30 days were obtained from Mexican American, White non-Hispanic, and Native American student (n = 738) and dropouts (n = 774). Rates for the age cohort (students and dropouts) were estimated with a weighted correction formula. RESULTS: Rates of use reported by dropouts were 1.2 to 6.4 times higher than those reported by students. Corrected rates resulted in changes in relative rates of use by different ethnic groups. CONCLUSIONS: When only in-school data are available, errors in estimating drug use among groups with high rates of school dropout can be substantial. Correction of student-based data to include drug use of dropouts leads to important changes in estimated levels of drug use and alters estimates of the relative rates of use for racial/ethnic minority groups with high dropout rates.
Physicians' reasons for not entering eligible patients in a randomized clinical trial of surgery for breast cancer. K. M. Taylor, R. G. Margolese and C. L. Soskolne. N Engl J Med 1984:310(21);p1363-7. We studied the reasons surgical principal investigators chose not to enter patients in a large, multicenter trial sponsored by a cooperative group. In 1976 the National Surgical Adjuvant Project for Breast and Bowel Cancers (NSABP) initiated a clinical trial to compare segmental mastectomy and postoperative radiation, or segmental mastectomy alone, with total mastectomy. Because the low rates of accrual were threatening to close the trial prematurely, we mailed a questionnaire to the 94 NSABP principal investigators, asking why they were not entering eligible patients in the trial. A response rate of 97 per cent was achieved. Physicians who did not enter all eligible patients offered the following explanations: (1) concern that the doctor-patient relationship would be affected by a randomized clinical trial (73 per cent), (2) difficulty with informed consent (38 per cent), (3) dislike of open discussions involving uncertainty (22 per cent), (4) perceived conflict between the roles of scientist and clinician (18 per cent), (5) practical difficulties in following procedures (9 per cent), and (6) feelings of personal responsibility if the treatments were found to be unequal (8 per cent). Further investigation into the behavioral aspects of the investigator-patient relationship is particularly pressing, since fear of change in this relationship was the most common reason given for not entering eligible patients in the trial.
Representation of older patients in cancer treatment trials. EL Trimble, CL Carter, D Cain, B Freidlin, RS Ungerleider and MA Friedman. Cancer 1994:74(7);2208-14. ABSTRACT: In 1990, the five leading causes of cancer death in men aged 65 and older were carcinomas of the lung, prostate, colon and rectum, and pancreas, and leukemia. For women in this age group, the five leading causes of cancer death were carcinomas of the lung, breast, colon and rectum, pancreas, and ovary. To determine the representation of the elderly in clinical trials, the 1992 accrual of the National Cancer Institute (NCI)-sponsored Clinical Cooperative Group treatment trials (which included more than 8000 elderly patients) for the aforementioned sites was compared with the 1990 incidence data from the NCI's Surveillance, Epidemiology, and End Results program. Of the male patients enrolled in the trials, an average of 39% were older than 65 (47.3% lung, 79.5% prostate, 47.5% colorectal, 45.6% pancreas, and 9.6% leukemia); whereas 25.9% of all women enrolled in trials were 65 or older (43.6% lung, 17.3% breast, 46.2% colorectal, 59.6% pancreas, and 35.4% ovary). With respect to incidence, older patients generally are underrepresented in cancer treatment trials. With the exception of the data on prostate cancer, each of the comparisons using the Z statistic gave probability values of less than 0.01. The most significant discrepancies between incidence and participation in cancer treatment protocols were noted for leukemia in males and breast cancer in females. Possible explanations for these findings include (1) a research focus on aggressive therapy, which may be unacceptably toxic to the elderly; (2) presence of comorbidity in the elderly; (3) fewer trials available specifically aimed at older patients; (4) limited expectations for long term benefits on the part of physicians, relatives, and the patients themselves; and (5) a lack of financial, logistic, and social support for the participation of elderly patients in clinical trials. Recognizing this situation, NCI recently sponsored a number of trials that specifically target the elderly. This paper describes the status of all major Phase II and III clinical trials that recently were closed, still are active, or now are in review that address the clinical care of this important segment of the U.S. population.
Are Subjects in Pharmacological Treatment Trials of Depression Representative of Patients in Routine Clincal Practice. M. Zimmerman, J.I. Mattia and Michael A. Posternak. American Journal of Psychiatry 2002:159(3);469-473. OBJECTIVE: The methods used to evaluate the efficacy of antidepressants differ from treatment for depression in routine clinical practice. The rigorous inclusion/exclusion criteria used to select subjects for participation in efficacy studies potentially limit the generalizability of these trials' results. It is unknown how much impact these criteria have on the representativeness of subjects in efficacy trials. This study estimated the proportion of depressed patients treated in routine clinical practice who would meet standard inclusion/exclusion criteria for an efficacy trial. METHOD: A total of 803 individuals, aged 16--65 years, who were seen at intake at an outpatient practice underwent a thorough diagnostic evaluation, including the administration of semistructured diagnostic interviews; 346 patients had current major depression. Common inclusion/exclusion criteria used in efficacy studies of antidepressants were applied to the depressed patients to determine how many would have qualified for an efficacy trial. RESULTS: Approximately one-sixth of the 346 depressed patients would have been excluded from an efficacy trial because they had a bipolar or psychotic subtype of depression. The presence of a comorbid anxiety or substance use disorder, insufficient severity of depressive symptoms, or current suicidal ideation would have excluded 86.0% (N=252) of the remaining 293 outpatients with nonpsychotic unipolar major depressive disorder from an antidepressant efficacy trial. CONCLUSIONS: Subjects treated in antidepressant trials represent a minority of patients treated for major depression in routine clinical practice. These results show that antidepressant efficacy trials tend to evaluate a subset of depressed individuals with a specific clinical profile.
What are the characteristics of general practitioners who routinely do not return postal questionnaires: a cross sectional study. Nigel Stocks, David Grunnell. J Epidemiol Community Health 2000; 54:940-941.
Assessing the generalizability of smoking studies. Hughes JR, Giovino RM, Flore MC. Addiction 1997; 92:469-472.
Intention-to-treat principle. Victor M. Montori, Gordon H. Guyatt. CMAJ 2001;165(10):1339-41. http://www.cma.ca/cmaj/vol-165/issue-10/1339.asp
A comparison of cigarette smokers recruited through the Internet or by mail. Jean-Francois Etter and Thomas V Perneger. International Journal of Epidemiology 2001; 30:521-525.
Summary - Who was left out?
Exclusion of subjects can make the study biased or less generalizable.
4.1 Who was excluded at the start of the study? Excessively strict entry criteria in a research study can make it difficult to extrapolate to the types of patients that you normally see.
4.2 Who dropped out during the study? A large number of drop-outs during the course of a research study can bias the final conclusions.
How much did things change?
Introduction
It's not enough just to assess statistical significance in a study. You need to also make sure that the difference has a practical impact, that it represented a clinically relevant outcome, and that there were sufficient number of patients to provide reasonable precision.
When you are looking at how much things changed, ask yourself the following questions:
- Did the authors measure the right thing?
- Did the authors measure the outcome well?
- Was the change clinically significant?
- Were there enough subjects?
Case study: Non-steroidal anti-inflammatory drugs
A 1987 study of non-steroidal anti-inflammatory drugs (NSAID) showed that patients who took these drugs were 50% more likely to develop upper gastrointestinal (UGI) bleeding. This rate was statistically significant at alpha=.05. UGI bleeding, however, was rare in both groups. Only 1 case per thousand person years in the controls, 1.5 in the NSAID group. If you see 100 patients a year, you would have to wait two decades, more or less, in order see one excess event of bleeding, on average.
In this article, the authors were up front about the very small increase in risk. Most authors, however, are so relieved to achieve statistical significance that they forget to consider whether the size of the difference will improve clinical practice.
This is summarized well in the following Gertrude Stein quote :"For a difference to be a difference it has to make a difference"
Did the authors measure the right outcome?
There is a tendency to focus on intermediate measures that are easy to assess, but which may or may not be predictive of more important endpoints. Improvement in forced expiratory volume may not translate into a reduction in asthma attacks. A reduction in abnormal ventricular depolarization may not translate into a reduction in the recurrence of heart attacks. If an intermediate endpoint is used, ask yourself whether there is an adequate link between this endpoint and something that is relevant to your patients.
Consider, for example, a study (Leeson et al 2001) that showed an association between duration of breast feeding and brachial artery distensibility at 20 to 28 years of age. This is a measure of stiffness, and could be considered a surrogate marker for cardiovascular disease in mid and later life. Such a link is tenuous and the authors themselves as well as an accompanying editorial (Booth 2001) admit that no cause and effect relationship between breast feeding and heart disease.
Typically patients are interested in only three things: morbidity, mortality, and quality of life. They don't care about concentration of homocysteine in their blood, or what their CD4 cell count is. They want to know more fundamental questions like "will I die?" or "will I be able to walk up a flight of stairs unassisted?"
Unvalidated measures
Jadad and Gagliardi (1998) criticize instruments used to rate web sites for the quality of health information. There were 47 such instruments but only 14 discussed how they were created. None of them included measures of validity, which caused these authors to conclude that
"Many incompletely developed instruments to evaluate health information exist on the Internet. It is unclear, however, whether they should exist in the first place, whether they measure what they claim to measure, or whether they lead to more good than harm."
Validity is a loaded word that means different things to different people. A general consensus, though, is that a measure is valid to the extent that it measures the thing that it claims to measure and does not mix in things that are unrelated. There are several ways to measure validity, but most of these involve comparison to an external standard.
Short term measures
As noted in the introduction, a good measure of the effectiveness of an intervention for schizophrenia, should wait at least six months from the start of therapy. Unfortunately, the typical study lasted 6 weeks or less.
This is a problem for many studies where budgetary limitations force the researchers to focus on short term outcomes. The problem with this is that it is usually easier to get a short term change, especially with interventions that involve behavioral changes (e.g., weight loss through the use of diet and exercise). It is the long term change, however, which is relevant in most cases.
Other issues
Be careful that you don’t focus solely on the outcomes mentioned in the abstract. There is a tendency to report only in the abstract the outcome measures that were statistically significant, rather than the outcome measures most of interest to health care professionals.
Also always consider whether the researcher provided adequate inspection of side effects.
Did the authors measure the outcome well?
Research is messy and difficult, so it is not always possible to obtain careful and precise measurements. To what extent are the measurements imprecise and subjective?
Measurement error
Measurement error is simply the inability to measure an important variable accurately. Measurement error in the outcome variable does not ordinarily cause bias, buy measurement error in factors that can predict the outcome are of serious concern.
There are several ways to assess dietary fat intake. The most accurate (and also the most costly) way is through the use of prospectively recorded food diaries.
Sometimes the cost limitations or the retrospective nature of a research study will require a less accurate assessment of dietary fat, such as through an interview. Shapiro (1997) points out that estimation of dietary fat using interviews tends to correlate poorly with estimation using prospective diaries. This would cast doubt, for example, on retrospective studies that tried to associate dietary fat intake with the risk of breast cancer.
Retrospective data
Retrospective data are data that is collected by looking backwards in time. We obtain this data by asking subjects to recall events that occurred earlier in their lives. We also get retrospective data when we review medical records, birth certificates, death certificates, or other sources of historical data. In contrast, data collected during the course of the study is known as prospective data.
Retrospective data are often inexpensive to collect, but you should be concerned about its accuracy. The ability of a subject to recall information is sometimes affected by which group that they are in.
Women who have experienced miscarriages, for example, are more likely to search for and remember events that they feel might "explain" their miscarriage, much more so than a group of comparable control subjects. This differential level of reporting is known as recall bias.
In addition, historical data are often incomplete and it is sometimes difficult to verify its accuracy. Therefore, retrospective data are considered less authoritative than prospective data.
An example of recall bias.
An interesting review of the research process that helped establish that smoking causes lung cancer can be found in Gail (1996). One aspect of the research process was addressing the issue of recall bias.
Doll (1950) studied the association between tobacco smoking and cancer. They selected 709 patients with lung cancer and an equal number of matched controls. The authors were concerned about the retrospective assessment of smoking among patients in both groups. Would patients with lung cancer exaggerate the amount of smoking? Would the interviewers press harder for information about smoking among the cancer patients?
While it would be impossible to totally rule out recall bias, the authors did examine a third group, patients who were diagnosed with lung cancer and who later found out that they suffered from a different disease (false cases). If recall bias was the sole explanation of the difference in reported smoking, then the group of false cases should have had a similar level of smoking with the lung cancer patients. Instead they reported a lower level of smoking. This helped to rule out the possibility that recall bias alone accounted for the higher reported smoking levels in the lung cancer patients.
Confusing causes and effects
Another difficulty with retrospective data is that you may not be able to identify which was the cause and which was the effect. Causes have to occur before and effects have to occur after, but when you examine causes and effects retrospectively, you may end up losing information about timing.
There's an old joke about a statistician who was examining the fire department records, including information about how much damage the fire caused, and how many fire engines responded to the blaze. The statistician noticed a strong relationship between the two variables and concluded that the more fire engines you send, the more damage they cause.
The British Medical Journal highlighted a research study where speech patterns were recorded in two groups of surgeons. The first group had two or more malpractice claims filed against them and the second group had none. There was a large difference between the two groups, with the first group having a dominant tone with less concern for the patient. While the news report of this research suggested that
"dominance coupled with a lack of anxiety in the voice may imply surgeon indifference and lead a patient to launch a malpractice suit when poor outcomes occur." -- bmj.com/cgi/content/full/325/7359/297/a
One reader, however, pointed out that perhaps
"being sued is a brutalising and demoralising experience and that this experience fundamentally changes the attitude of doctors towards their patients." -- bmj.com/cgi/eletters/325/7359/297/a#24658
Measurements without established reliability
Reliability means different things in different fields, but the general concept is that a reliable measurement is one that would stay about the same if it were repeated under similar circumstances. Depending on the context, you would establish reliability differently. For example, one way to establish reliability is to have two people make independent assessments and show a good level of agreement. If you are measuring something that is stable over time, then you could take two measurements on different days or weeks and see how well they agree.
Be especially careful about measurements that have some level of subjectivity. If there is no establishment of reliability for these measures, then you have no assurance that the research is repeatable.
Was the change clinically significant?
Research results should be quantifiable. Look for measurements of important outcomes that are free from bias.
"When you can measure what you are speaking about, and express it in numbers, you know something about it; but when you cannot measure it, when you cannot express it in numbers, your knowledge is of a meagre and unsatisfactory kind: it may be the beginning of knowledge, but you have scarcely, in your thoughts, advanced to the stage of science." William Thomson Kelvin (Lord Kelvin)
Knowing that a new therapy is better is not enough information. You need to quantify how much the new therapy is better. In this respect, confidence intervals are better than p-values. A p-value tells you whether the new therapy is better. A confidence intervals tells you whether the new therapy is better and by how much. A confidence interval allows you to balance the size of the improvement against the possibility of greater cost or more side effects. Many journals now require confidence intervals instead of p-values.
Statistical methods are sometimes able to detect differences that are so small as to be meaningless from any practical perspective. This is known as statistical significance without clinical significance. Always put the numbers into the perspective of your practice. Try to estimate how of the patients you see within a year are likely to perform better under the new therapy.
Murray and Teasdale (2000) and Roberts et al (2000) debate the clinical relevance of a (theoretical) intervention that helps an additional one person out of 10. Does helping "only" one out of every ten patients justify the extra time or money involved? Does it justify an increase in the risk of side effects.
Assessing clinical significance requires clinical judgment. It also needs to factor in preferences of individual patients. It's not easy, and the authors of the research paper should (but usually don't) provide you with their thoughts on clinical significance.
In some studies, however, clinical significance is not important. When you are trying to see if a certain physiologic mechanism can explain why a new therapy works, you just want to know if the mechanism exists or not.
Were there enough subjects?
Every research study, especially negative studies, should justify the sample size chosen. It is unethical to perform research on humans or animals without first demonstrating that the sample size you have chosen is appropriate.
Justification of sample size is particularly important for a negative study (one where no difference between the standard and new therapies were found) and in studies assessing the equivalence of two therapies.
How can you tell if the sample size is too small?
Ideally, the authors should provide justification of the sample size in the paper itself. The justification is considered better if it is made a priori (prior to the start of the data collection). If no justification of sample size (e.g., power calculations) is given, examine the width of the confidence intervals. Very wide intervals indicate an inadequate sample size.
There are many examples of studies with inadequate sample sizes.
A revealing study of inadequate sample size appears in Freiman 1992. In a series of 71 publications appearing between 1960 and 1977, the outcome was either percent mortality, percent complications, or a similar outcome that could be measured as a percentage. The authors examined power, the ability of the study to detect either a moderate improvement (25% relative reduction in the outcome) or a large improvement (50% relative reduction in the outcome). For example, if a study showed a 40% mortality in the controls, then a 30% mortality rate in the treated group would be considered a moderate improvement and a 20% mortality rate would considered a large improvement.
The results of the Freiman study were very disappointing.
Of the 71 papers, 57 had greater than a 50% chance for missing a moderate improvement and 31 had a 50% or greater chance for missing a large improvement.
One wonders why anyone would undertake a study when there is such a high probability for failure. You should never initiate a study unless you know that the chance of missing a reasonable improvement is less than 20%.
Special issues in a study of equivalency.
Some studies attempt to show not that a new therapy is superior to the standard therapy, but that it is equivalent. Showing equivalence requires a very careful assessment of sample size.
An example of an equivalence study is when a drug company tests a generic drug and wishes to show equivalence with the (presumably more expensive) brand name drug.
If we applied the traditional testing approach, the company would have a strong disincentive to design the study with an adequate sample size. A small sample size is more likely to show equivalency under the traditional testing framework.
There are several modifications to the traditional testing framework for equivalency studies. The simplest approach uses confidence interval for the ratio of the outcome under new therapy to the outcome under the standard therapy. If both limits of the confidence interval are reasonably close to 1 (e.g., no less than 0.8 and no more than 1.25) then the two therapies are considered equivalent.
Summary - How much did things change?
Research results should be quantifiable. Look for measurements of important outcomes that are free from bias.
Was there a quantitative measure of the size of the effect? Look for a confidence interval and compare the size of the effect to what you would expect to see in your practice.
Could other factors account for this effect? Look for differences in demographics between the two groups and ask if these differences could explain the results of the research.
Were any important outcomes forgotten? Research results should focus on endpoints that are of interest to your patients.
Additional resources
Does the duration of breast feeding matter? Ian Booth. BMJ 2001; 322: 625-626.
[Full text] Duration of breast feeding and arterial distensibility in early adult life: population based study. C P M Leeson, M Kattenhorn, J E Deanfield, and A Lucas. BMJ 2001; 322: 643-647.
[Abstract] [Full text] Rating health information on the Internet: navigating to knowledge or to Babel? Jadad, A. R. and A. Gagliardi (1998). Jama 279(8): 611-4.
How well is the clinical importance of study results reported? An assessment of randomised controlled trials. Chan KBY, Man-Son-Hing M et al. CMAJ 2001;165:197-202.
Trials in head injury are more complex than review suggests
Gordon D Murray, Graham M Teasdale
BMJ 2000; 321: 1223.[Full text] Authors' reply
Ian Roberts, Frances Bunn, Reinhard Wentz, Phil Edwards
BMJ 2000; 321: 1223.[Full text] Dulcet tones of a surgeon's voice may have a hidden meaning. Roger Dobson. BMJ 2002; 325: 297. bmj.com/cgi/content/full/325/7359/297/a
Cause and effect 2. Douglas N. Salmon. Electronic response to: Dulcet tones of a surgeon's voice may have a hidden meaning. Roger Dobson. BMJ 2002; 325: 297. Accessed on 2002-11-29. bmj.com/cgi/eletters/325/7359/297/a#24658
Chapter 6. Case studies.
In this section, we will apply the techniques discussed in the previous five sections to two research papers. The first paper is a study of Vitamin C as a treatment for advanced cancer. The second is a study of nicotine patch therapy in adolescent smokers.
Case study #1. Vitamin C therapy and cancer.
This example is highlighted in Chapter 1 of Observational Studies by Paul R. Rosenbaum. Cameron and Pauling published an observational study of Vitamin C as a treatment for advanced cancer. For each patient, ten matched controls were selected with the same age, gender, cancer site, and histological tumor type. Patients receiving Vitamin C survived four times longer than the controls (p < 0.0001).
Cameron and Pauling minimize the lack of randomization. "Even though no formal process of randomization was carried out in the selection of our two groups, we believe that they come close to representing random subpopulations of the population of terminal cancer patients in the Vale of Leven Hospital." (Cameron 1976).
Ten years later, the Mayo Clinic conducted a randomized experiment which showed no statistically significant effect of Vitamin C (Moertel 1985).
What went wrong with the Cameron and Pauling study?
The problem with the Cameron and Pauling paper becomes obvious when you ask "Who did the choosing?" Controls were recruited from death certificates. The authors estimated survival time by retrospectively estimating the date at which a prognosis of terminal cancer was made. The Vitamin C group was recruited from people freshly diagnosed with terminal cancer. The authors estimated survival time prospectively from the time therapy was started.
The Cameron and Pauling study had two major flaws. First, the controls and the Vitamin C group differed on a major prognostic factor. No matter how grim the prognosis was in the Vitamin C group, it can’t compare to the prognosis of someone who has already died. Second, the controls were evaluated differently. Their survival times were estimated retrospectively; the Vitamin C survival times were estimated prospectively.
Case Study #2: Nicotine Patch Therapy in Adolescent Smokers.
The Children's Mercy Hospital Journal Club discussed a paper on nicotine patches (Smith 1996).
The authors recruited 22 volunteers from five public high schools in the Rochester, MN area for participation in a smoking cessation program involving behavioral counseling, group therapy, and nicotine patches. Researchers measured the number of cigarettes smoked, side effects, and blood levels of nicotine.
The purpose of the research was to evaluate "the safety, tolerance, and efficacy of 22 mg/d nicotine patch therapy in smokers younger than 18 years who were trying to stop smoking." The authors also listed a secondary goal, "to compare blood cotinine levels, nicotine withdrawal scores, and adverse experiences with those of adults obtained in previous patch studies." Cotinine is a metabolite of nicotine and provides a useful objective measure of cigarette smoking. It also allowed the authors to examine whether nicotine toxicity was an issue.
Who did the choosing?
Was there a good comparison group? If a comparison of smoking rates with historical controls had been done, it would have been problematic because of the timing issue.
Did the authors create the groups? There was not a well defined control group in this study. Smoking cessation rates could have been compared to historical smoking cessation rates, but this was not done. Perhaps the cessation rates in this study were so poor, that no comparison would be necessary. Blood cotinine levels compared to adults in an inpatient smoking cessation program. The groups failed to overlap on age and on patient status (all the adults were in-patient and all the teenagers were out-patient).
Was the assignment randomized? This design did not allow for randomization.
Was there a plan?
Were there enough subjects? Unfortunately, there was no assessment of adequacy of sample size, although the authors claimed there were no major side effects. Was this because the patch is safe, or because they did not study enough subjects to find major side effects?
The authors did provide confidence intervals for smoking cessation rates At eight weeks, they computed a 14% success rate (95% CI 2.9 to 34.9). At six months, they computed a 4.5% success rate (95% CI 0.1 to 22.8). While these are not the narrowest intervals, they are sufficiently narrow to rule out the possibility of any large rates of smoking cessation. Thus, from the perspective of efficacy, the sample size was probably sufficient.
Did the research have a narrow focus? In general, the authors did well to keep a narrow focus. The assessment of side effects is always troublesome, and here the authors noted three types of skin reactions (erythema only, erythema and edema, and erythema and vesicles), headaches, nausea and vomiting, tiredness, dizziness, arm pain, shortness of breath, pyelonephritis (kidney infection), abdominal pain, back pain, fever, cough, flu, diarrhea, shakiness, and depression. While this list of side effects is very long, it would be difficult to shorten it much. The authors did note that none of the reported side effects were serious.
Did the authors deviate from the plan? There are no stated deviations from the protocol.
Did the authors discard outliers? There were no efforts to exclude outliers from any statistical analysis.
Who knew what when?
Was the new therapy indistinguishable from the standard therapy? The study was not blinded, even though a blinded study (using a placebo patch) was possible. This is a major disappointment, but perhaps reflects the preliminary nature of this research. The lack of blinding implies that results presented on safety, tolerance, and efficacy could be accounted for by a placebo effect.
Was the randomization plan known prior to selecting patients? This design did not allow for randomization.
Did the authors rely on retrospective data? The authors did ask students each week to recall their cigarette smoking over the previous week. This time frame was short enough to avoid problems with recall bias. Furthermore, the authors also included exhaled carbon monoxide levels as an objective measure to validate the self-reported data.
Who was left out?
Who was excluded at the start of the study? The Rochester MN location excluded minority students. All the subjects in this study were white. Subjects had to get parental permission, excluding smokers who wished to keep their habit secret from their parents. Subjects were also volunteers, and thus could be considered more motivated to quit than the typical teenage smoker..
Who dropped out during the study? The study had a serious drop out rate. Of the presumably thousands of teenage smokers in the Rochester Minnesota area, only 71 volunteers responded to the initial call for subjects. Of the 71 volunteers, 55% met inclusion criteria. Of the remaining 39, 44% declined to attend the initial meeting. Of the remaining 22, 14% were non-compliant. Of the remaining 18, 39% failed to respond to the one year survey. Only 11 completed the entire study (50% of those who started the study; 28% of those meeting inclusion criteria; 15% of the initial volunteers.)
Fortunately, noncompliant subjects were treated as if they were still smoking (intention to treat). The researchers also took the trouble to characterize the noncompliant subjects and showed that they did not drop out because of any side effects of the nicotine patch.
Were volunteers used? The subjects of the research study were all volunteers. Volunteers could be expected to be more motivated to quit than typical adolescent smokers.
How much did things change?
Was there a quantitative measure of the size of the effect? To their credit, the authors did provide confidence intervals. At eight weeks, they computed a 14% success rate (95% CI 2.9 to 34.9). At six months, they computed a 4.5% success rate (95% CI 0.1 to 22.8). But no attempt was made to compare these cessation rates with historical rates.
Could other factors account for this effect? No direct comparisons were made for efficacy. Indirect comparisons were made between side effects experienced by the teenagers and the side effects experienced by adults. If there were an age effect (e.g., the older you are, the more likely you are to report side effects), then this could be a problem.
Were any important outcomes forgotten? But while the major outcome (smoking cessation rate) was not totally overlooked, it was subordinated to the study of side effects. 90% of the paper focused on the side effects data. Nowhere did the authors mention that the long term success rate (4.5%) was substantially less than what could be hoped for.
Summary of all five sections
The following questions will help you assess the quality of a journal article about a new therapy.
1. Who did the choosing?
1.1 Was there a good comparison group?
1.2 Did the authors create the groups?
1.3 Was the assignment randomized?
2. Was there a plan?
2.1 Were there enough subjects?
2.2 Did the research have a narrow focus?
2.3 Did the authors deviate from the plan?
2.4 Did the authors discard outliers?
3. Who knew what when?
3.1 During the study, did the patients know what group they were in?
3.2 At the start of the study, did the patients know what group they were going to be in?
3.3 Did the authors rely on retrospective data?
4. Who was left out?
4.1 Who was excluded at the start of the study?
4.2 Who dropped out during the study?
4.3 Were volunteers used?
5. How much did things change?
5.1 Was there a quantitative measure of the size of the effect?
5.2 Could other factors account for this effect?
5.3 Were any important outcomes forgotten?
Special guidelines for overviews and meta-analyses
Introduction
Meta-analysis is the quantitative pooling of data from two or more studies. When you are examining the results of a meta-analysis, you should ask the following questions:
Were apples combined with oranges? Heterogeneity among studies may make any pooled estimate meaningless.
Were all of the apples rotten? The quality of a meta-analysis cannot be any better than the quality of the studies it is summarizing.
Were some apples left on the tree? An incomplete search of the literature can bias the findings of a meta-analysis.
Did the pile of apples amount to more than just a hill of beans? Make sure that the meta-analysis quantifies the size of the effect in units that you can understand.
Declining sperm counts
In 1992, the British Medical Journal published a controversial meta-analysis. This study (BMJ 1992: 305(6854); 609-13) reviewed 61 papers published from 1938 and 1991 and showed that there was a significant decrease in sperm count and in seminal volume over this period of time. For example, a linear regression model on the pooled data provided an estimated average count of 113 million per ml in 1940 and 66 million per ml in 1990.
Several researchers (Fertil Steril 1996: 65(5); 1044-6 and Fertil Steril 1995: 63(4); 887-93) noted heterogeneity in this meta-analysis, a mixing of apples and oranges. Studies before 1970 were dominated by studies in the United States and particularly studies in New York. Studies after 1970 included many other locations including third world countries. Thus the early studies were United States apples. The later studies were international oranges. There was also substantial variation in collection methods, especially in the extent to which the subjects adhered to a minimum abstinence period.
The original meta-analysis and the criticisms of it highlight both the greatest weakness and the greatest strength of meta-analysis.
Meta-analysis is the quantitative pooling of data from studies with sometimes small and sometimes large disparities. Think of it as a multi-center trial where each center gets to use its own protocol and where some of the centers are left out.
On the other hand, a meta-analysis lays all the cards on the table. Sitting out in the open are all the methods for selecting studies, abstracting information, and combining the findings. Meta-analysis allows objective criticism of these overt methods and even allows replication of the research.
Contrast this to an invited editorial or commentary that provides a subjective summary of a research area. Even when the subjective summary is done well, you cannot effectively replicate the findings. Since a subjective review is a black box, the only way, it seems, to repudiate a subjective summary is to attack the messenger.
Meta-analysis is used in a variety of different areas. Vine et al (Fertil Steril 1994: 61(1); 35-43) used meta-analysis studied the relationship between smoking and sperm concentration. Oehninger et al (Hum Reprod Update 2000: 6(2); 160-8) assessed the utility of sperm function assays in predicting successful outcomes in IVF. Goldberg et al (Fertil Steril 1999: 72(5); 792-5) compared intrauterine and intracervical insemination with frozen donor sperm. Evers et al (Cochrane 2001: 1CD000479) reviewed the effectiveness of varicocelectomy in subfertile men.
Were apples combined with oranges?
Meta-analyses should not have too broad an inclusion criteria. Including too many studies can lead to problems with "apples-to-oranges" comparisons. For example, when you are studying the effect of cholesterol lowering drugs, it makes no sense to combine a study of patients with recent heart attacks with another study of patients with high cholesterol but no previous heart attacks.
There is a lot of variability in how research is conducted. Even in carefully controlled randomized control trials, researchers have tremendous discretion (Am J Med 1987: 82(3); 498-510.). Sometimes this discretion creates heterogeneity among studies, making it difficult to combine the studies.
Heterogeneity in the composition of the treatment and control groups
Researchers can differ in the inclusion and exclusion criteria.
Even if these criteria do not differ, there may still be differences in the baseline levels of health in the patients, due to geographical differences in the patient population.
The controls could be selected independently, or they could be matched to the treatment group subjects.
The control subjects could be given no treatment, a placebo, or a standard treatment.
The treatment could differ, such as differences in dose or timing of a drug.
Heterogeneity in the design of the study
The length of follow-up for the patients could differ.
The proportion of patients who drop out could differ as well as the proposed statistical treatment of these dropouts.
Heterogeneity in the management of the patients and in the outcome
How comorbid conditions are treated.
How complications are handled.
How much discretion the patient's physician has in controlling patient care.
The outcome measure itself could differ. For example, Abramson (Public Health Rev 1990: 18(1); 1-47) discusses a meta-analysis of hypertension treatment in the elderly. Some of the studies examined cardiovascular deaths and others examined cardiovascular events. Other studies examined cerebrovascular deaths, cerebrovascular events, cardiac deaths, coronary heart disease deaths, and/or total deaths.
Examples of heterogeneity
In a meta-analysis (BMJ 2002; 324(7340): 757) looking at antiretroviral combination therapy, a plot of duration of trial versus the log odds ratio showed that shorter duration trials of zidovudine had substantial evidence of effect (odds ratios much smaller than 1) but that the largest duration studies had little or no evidence of effect (odds ratios very close to 1).
In a meta-analysis (BMJ 1998: 317(7166); 1105-1110) looking at dust mite control measures to help asthmatic patients, the studies exhibited heterogeneity across several factors. Six studies examined chemical interventions, thirteen examined physical interventions, and four examined a combination approach. Nine of these trials were crossovers, and in the remaining fourteen, there was a parallel control group. Seven studies had no blinding, three studies had partial blinding, and the remaining thirteen studies used a double blind. In nine studies the average age of the patients was only 9 or 10 years, but nine other studies had an average age of 30 or more. Eleven studies lasted eight weeks or less and five studies lasted a full year. You can find a table summarizing these studies on the web.
How to handle heterogeneity
Some level of heterogeneity is acceptable. After all, the purpose of research is to generalize results to large groups of patients. Furthermore, demonstrating that a treatment shows consistent results across a variety of conditions strengthens our confidence in that treatment.
Nevertheless, you should be aware of the problems that excessive heterogeneity can cause. Mixing apples and oranges may not be so bad; you get a fruit salad this way. But when heterogeneity becomes too large, you might end up combining not apples and oranges but apples and onions.
Subgroup analysis
When there is substantial heterogeneity, you can look and compare subgroups of the studies. In a meta-analysis (BMJ. 2000; 321(7273): 1371-6) studying atypical antipsychotics, the dose of the comparison drug (haloperidol or an equivalent) varied substantially. Among those studies where the dose of haloperidol was greater than 12 mg/day, atypical antipsychotics showed advantages in efficacy or tolerability. When the dose was less than or equal to 12 mg/day, the atypical antipsychotics showed no advantages in these areas.
Meta-regression
You can try to adjust for heterogeneity in a meta-analysis. This would work very similarly to the adjustment for covariates in a regression model. For example, Derry et al (BMJ 2000: 321(7270); 1183-7) used meta-analysis to see if long term aspirin therapy was associated with problems with gastrointestinal hemorrhage. They identified 24 studies that looked at aspirin as a preventive measure against heart attacks. In each of these studies, the rate of gastrointestinal hemorrhages were recorded for both the aspirin group and the placebo or no treatment group. There was substantial heterogeneity in the dosage of aspirin used in the studies, however, with some studies giving as little as 50 mg/day and some as much as 1500 mg/day.
This was actually good news in a way, because the researchers wanted to see if the risk of gastrointestinal hemorrhage was dependent on the dose of aspirin. A plot of the dose versus the risk showed that there was indeed an increased risk but that this risk seemed to be unrelated to the dosage.
Inclusion of very old studies
Dear Professor Mean: When conducting a systematic review how far back should you look? Do you set your exclusion criteria judging on the amount of literature available, or do you limit your search to, say the last 10 years? Hunting Heather
That depends a lot on the topic, don't you think? Anything in the field of neonatology would have to have a very narrow time window because the field has changed so much so rapidly.
Other areas where the practice of medicine has been much more stable could have wider time windows. I've seen several reviews that have covered half a century of studies.
If you do select a wide time window be sure to see if your results are similar if you restrict yourself to just the most recent studies.
Ask yourself if there was a sudden change in technology that makes any comparisons before and after that technology an apples-to-oranges comparison. So, for example, a meta-analysis involving AIDS patients should restrict itself to the years following the use of AZT.
Also, ask yourself if researchers in your area tend to discount any research that is more than X years old. If so, then your meta-analysis would lose credibility among those researchers if it included studies older than X.
Sensitivity analysis
A good approach to heterogeneity is to include a wide range of studies, but then examine the sensitivity of the results by looking at more narrowly drawn subsets of the studies.
The authors can also weight studies by a quality factor and give greater emphasis to randomized studies, which are less likely to have bias. Second, the authors can perform sensitivity analyses. Would the results change if we changed the entry criteria?
In general, heterogeneity increases uncertainty, but this uncertainty cannot be reflected in the width of the confidence limits in the meta-analysis results. When there is heterogeneity, the most information may reside not in a single estimate of how effective the treatment is, but in a careful examination of the variation in the treatment under different conditions.
Were all of the apples rotten?
The quality of a meta-analysis is constrained by the quality of articles that are used in a meta-analysis. Meta-analysis cannot correct or compensate for methodologically flawed studies. In fact, meta-analysis may reinforce or amplify the flaws of the original studies.
Observational studies in a meta-analysis
The use of meta-analysis on observational studies is very controversial. Some experts have argued that the biases inherent in observational studies make a meta-analysis an exercise in mega-silliness. But even those experts who do not take such an extreme viewpoint warn that the current statistical methods for summarizing the results of observational studies may grossly understate the amount of uncertainty in the final result (BMJ 1998: 316(7125); 140-4).
Sensitivity analysis may be a useful way of highlighting the uncertainties in a meta-analysis of observational studies. Restricting the meta-analysis to selective subgroups of the data can yield insight into the size and direction of biases in observational studies. For example, the researchers could contrast case-control designs with cohort designs, with the latter expected to show less bias, in general. Or the researchers could compare retrospective studies to prospective studies, where again, the latter is expected to show less bias in general. Another possibility for comparison involve comparing studies by the amount to which measurement error is expected to cause problems. In general, researchers should try to stratify the observational studies by known sources of bias.
Meta-analyses of randomized trials
Some meta-analyses restrict their attention to randomized trials because these studies are less likely to have problems with bias. In other words, they wish to avoid mixing bad observational apples with good randomized trial apples. Sometimes further restrictions can be made on the basis of partial or full blinding of results or on the proper accounting of dropouts.
Concato et al (NEJM 2000: 342(25); 1887-1892) evaluated clinical topics where there were publications of both randomized controlled trials and observational studies. In this review, the observational studies produced results quite similar to the randomized studies.
Sensitivity analysis
Even for randomized trials, sensitivity analysis may help. Researchers can use "quality scores" to rate individual studies and then see what happens when studies are restricted to those of highest quality only.
For example, Lucassen et al (BMJ 1998; 316(7144): 1563-9) looked at interventions for infant colic. Although substituting soy milk for cows milk appeared to have an effect, this effect disappeared when only studies of high methodological quality were considered.
Quality Scores
Many times, the reporting of a study will be inadequate, and this will make it impossible to assess the quality of a study. There is indeed empirical evidence that incomplete reporting is associated with poor quality (JAMA 1995: 273(5); 408-12). In such a case, a "guilty until proven innocent" approach may make sense (BMJ 2001: 323(7303); 42-6). For example, if the authors fail to mention whether their study was blinded, assume that it was not. You might expect that authors are quick to report strengths of their study, but may (perhaps unconsciously) forget to mention their weaknesses. On the other hand, Liberati (J Clin Oncol 1986: 4(6); 942-51) rated the quality of 63 randomized trials, and found that the quality scores increased by seven points on average on a 100 point scale after talking to the researchers over the telephone. So some small amount of ambiguity may relate to carelessness in reporting rather than quality problems.
Another approach is to look at subgroups of studies of a similar design and see if the results are consistent across subgroups. For example, Etminan et al (BMJ. 2003; 327(7407): 128) examined the risk of Alzheimer's disease in users of non-steroidal anti-inflammatory drugs. They identified six cohort studies which showed a combined relative risk of 0.84 (95% CI 0.54 to 1.05) and three case-control studies which showed a much lower combined relative risk, 0.62 (95% CI 0.45 to 0.82).
Meta-analysis of studies with small sample sizes
Some experts advocate great caution in the assessment of meta-analyses where all of the trials consist of small sample size studies. The effect of publication bias can be far more pronounced here than in situations where some medium and large size trials are included.
Were some apples left on the tree?
One of the greatest concerns in a meta-analysis is whether all the relevant studies have been identified. If some studies are missed, this could lead to serious biases.
Intentional exclusion of studies
In any meta-analysis, you have to draw a line somewhere. Studies that fail to meet your criteria will not be included in the results. But this can lead to serious controversy. In a Cochrane Review of mammography (Cochrane 2001: (4); CD001877), seven studies were identified, but only two were of sufficient quality to be used. The Cochrane Review of these two studies reached a negative conclusion, but would have reached an opposite conclusion if the other five studies were added back in (BMJ. 2001; 323(7319): 956).
Publication bias
Many important studies are never published; these studies are more likely to be negative (Dickersin 1990). This is known as publication bias. The inclusion of unpublished studies, however, is controversial (Cook 1993).
Publication bias is the tendency on the parts of investigators, reviewers, and editors to submit or accept manuscripts for publication based on the direction or strength of the study findings. Much of what has been learned about publication bias comes from the social sciences, less from the field of medicine. In medicine, three studies have provided direct evidence for this bias. Prevention of publication bias is important both from the scientific perspective (complete dissemination of knowledge) and from the perspective of those who combine results from a number of similar studies (meta-analysis). If treatment decisions are based on the published literature, then the literature must include all available data that is of acceptable quality. Currently, obtaining information regarding all studies undertaken in a given field is difficult, even impossible. Registration of clinical trials, and perhaps other types of studies, is the direction in which the scientific community should move.
Another aspect of publication bias is that the delay in publication of negative results is likely to be longer than that for positive studies. For example, Stern and Simes 1997 showed that among 130 clinical trials, the median time to publication was 4.7 years among the positive studies and 8.0 years among the negative studies. So a meta-analysis restricted to a certain time window may be more likely to exclude published research that is negative.
Many experts are advocating the registration of trials as a way of avoiding publication bias. If trials are registered prospectively (i.e., prior to data collection and analysis) then they can be included in any appropriate meta-analysis without worry about publication bias.
Duplicate publication
Duplicate publication is the flip side of the publication bias coin. Studies which are positive are more likely to appear more than once in publication. This is especially problematic for multi-center trials where an individual centers may publish results specific to their site. Tramer et al (1997) found 84 studies of the effect of ondansetron on postoperative emesis. Unfortunately, 14 of these studies (17%) were second or even third time publications of the same data set. The duplicate studies had much larger effects and adding the duplicates to the originals produced an overestimation of treatment efficacy of 23%. Tracking down the duplicate publications was quite difficult. More than 90% of the duplicate publications did not corss-reference the other studies. Four pairs of identical trials were published by completely different authors without any common authorship
The limitations of a Medline search
While a Medline search is the most convenient way to identify published research, it should not be the only source of publications for a meta-analysis. Medline searches cover only 3,000 of some 13,000 medical journals (Halvorsen 1992). The studies missed by Medline and other databases are more likely to be negative studies.
Furthermore, these databases may fail to index major journals in the third world that can provide important trials. Egger (1997) cites an interesting example of how Medline excludes most Indian journals, even though these journals are published in English and India produces a significant amount of medical research.
Foreign language publications
Some meta-analyses restrict their attention to English language publications only. While this may seem like a convenience, in some situations, researchers might tend to publish in an English language journal for those trials which are positive, and publish in a (presumably less prestigious) native language journal for those trials which are negative. Interestingly, some studies have shown that the quality of studies published in other languages is comparable to the quality of studies published in English.
Picking the low hanging fruit
In an informal meta-analyis, you should also worry about the tendency for people to preferentially choose articles that are convenient. For example, there is a natural tendency to rely on articles where the full text is available on the Internet or where the abstract is available for review (Wentz 2002).
How to avoid bias from exclusion of publications
Search for studies should involve several bibliographic databases, registries for clinical trials, examination of bibliographies of all articles found, the so-called gray literature (presentation abstracts, dissertations, theses, etc.) and a letter calling for unpublished papers to be sent out to key researchers.
Consider the search strategy adopted in Evers et al 2001.
Relevant trials were identified in the Cochrane Menstrual Disorders and Subfertility Group's specialised register of controlled trials. A MEDLINE search, using the group's search strategy, was performed for the period 1966-2000. Also, hand searching was performed of 22 specialist journals in the field from their first issue till 2000. Cross references and references from review articles were checked.
Subjectivity
"Blinding," a common tool in other research areas should also be used in meta-analyses. Blinding prevents the differential application of inclusion/exclusion criteria. The people deciding whether a paper meets the inclusion/exclusion criteria should be unaware of the authors of that paper and the journal. They should also include or exclude the paper on the basis of the methods section only; they should not see the results section until later.
There is empirical evidence, however, that blinding does not affect the conclusions of a meta-analysis (Jadad et al 1996, Berlin et al 1997). Furthermore, blinding takes substantial time and energy.
Data should be extracted from papers by multiple sources and their level of agreement should be assessed. Researchers have found disagreements even on such fundamental concepts such as whether a study was positive or negative (Glass 1981).
Like any other research project, an overview or meta-analysis needs a protocol. Unfortunately, many published meta-analyses do not state whether a protocol was used (Sacks 1992). The protocol should specify: the inclusion/exclusion criteria for studies; a detailed description of the process used to identify studies; and the statistical methods used to combine results. Without a protocol, the meta-analysis research is not reproducible.
Authors have been shown to be biased in the articles that they cite in the bibliographies of their research papers (Gotsche 1987; Ravnskov 1992). This same bias could potentially affect the selection of articles in a meta-analysis.
If the authors do not present objective criteria for the selection of articles in their overview or meta-analysis, then you should be concerned about possible conscious or sub-conscious bias in the selection process.
Researchers should also list all of the articles found in the original search, not just the articles used. This allows others to examine whether the inclusion/exclusion criteria were applied appropriately.
Preventing publication bias
[Registry]
Detecting and correcting for publication bias
Sensitivity analysis is also useful here. If the results from published studies are comparable to the results from unpublished studies, for example, then publication bias is less of a concern. Along the same lines, the authors can estimate the number of undiscovered negative studies that would be required to overturn the results of this meta-analysis.
Publication bias is also more likely to occur for studies with small sample sizes. If the results of a meta-analysis are stratified by the sample sizes in the studies, a shift away from the null hypothesis in the smaller studies would be a warning flag about the possibility of publication bias. Statistical and graphical methods have been proposed to examine this further but you should be cautious, however, because sometimes there are other explanations. For example, smaller studies may tend to use less rigorous designs and these designs may be associated with exaggerated effects (Sterne et al 2001).
McManus et al (1998) highlight the importance of consulting experts in the area. They we trying to identify all publications associated with near patient testing, tests where the results are available without sending materials to a lab. The authors used a search of electronic databases, a survey of experts in the area, and hand searching of specific journals. The electronic databases yielded the most number of publications, 50, but still missed 52 publications found by the other two methods.
Copas and Shi (2000) present a re-analysis of a meta-analysis on lung cancer that adjusts for publication bias, but this adjustment is controversial (Johnson et al 2000).
Reanalysis of Epidemiological Evidence on Lung Cancer and Passive Smoking
J B Copas and J Q Shi
BMJ 2000; 320: 417-418.[Abstract] [Full text] [PDF] Lung Cancer and Passive Smoking
Kenneth C Johnson, James Repace, Allan Hackshaw, Malcolm Law, Nicholas Wald, Stanton A Glantz, Christopher Cates, John Copas, and Jain Qing Shi
BMJ 2000; 321: 1221.[Full text]
Did the pile of apples amount to more than just a hill of beans?
It’s not enough to know that the overall effect of a therapy is positive. You have to balance the magnitude of the effect versus the added cost and/or the side effects of the new therapy. Unfortunately, most meta-analyses use an effect size (the improvement due to the therapy divided by the standard deviation). The effect size is unitless, allowing the combination of results from studies where slightly different outcomes with slightly different measurement units might have been used.
Vote counting
Avoid "vote counting" or the tallying of positive versus negative studies. Vote counts ignore the possibility that some studies are negative solely because of their sample size. Abramson (1990) notes, for example, a meta-analysis of parenteral nutrition in cancer patients undergoing chemotherapy. Although each of the seven randomized control trials in the meta-analysis failed to achieve statistical significance, the pooled results were highly significant.
Unitless measures
When you are examining a continuous outcome measure, you should be sure that the results are presented in interpretable units. A measure of effect size does not help you much because it is unitless and impossible to interpret. Consider a store that is offering a sale and announces boldly
"All prices reduced by 0.8 standard deviations!"
One meta-analysis shows how important it is to express measurements in interpretable units. Lumley et al (2001) studied the effect of smoking cessation programs on the health of the fetus and infant. One of the outcome measures was birth weight, and the study showed that the typical program can improve birth weight by a statistically significant amount. The researchers then quantified the amount: 28g (95% confidence interval 9 to 49).
Keep in mind that this is measuring the effectiveness of the smoking cessation program, and not the effect of smoking cessation directly. Typically, you would have to send about 12 to 16 women to these programs in order to get one extra woman to quit smoking. So the effect seen here reflects, in part, how difficult it is to get people to change their behavior.
Still the small size of the effect is important. If you want to assess the costs and benefits of smoking cessation programs, it helps to know that the impact of the typical smoking cessation program on birth weight is quite small. This provides a useful yardstick for comparison to other prenatal interventions.
Where does meta-analysis sit on the hierarchy of evidence?
[Meta-analysis] possesses certain flaws and limitations that preclude its use as a broad-based methodologic approach for formulating definitive therapeutic recommendations. -- Boden 1992.
Bibliography
Meta-Analysis: A Review of Pros and Cons. Abramson J. Public Health Reviews 1990 18(1): 1-47.
Does Blinding of Readers Affect the Results of Meta-Analyses? Jesse A Berlin, on behalf of University of Pennsylvania Meta-analysis Blinding Study Group. Lancet 1997; 350: 185-186.
Evidence for Decreasing Quality of Semen During Past 50 Years. Carlsen E, Giwercman A, Keiding N, Skakkebaek NE. Bmj 1992; 305(6854): 609-13.
The Existence of Publication Bias and Risk Factors for its Occurrence. Dickersin, K. (1990). Jama 263(10): 1385-9.
Egger (1997)
Surgery or Embolisation for Varicocele in Subfertile Men (Cochrane Review). Evers JL, Collins JA, Vandekerckhove P. Cochrane Database Syst Rev 2001; 1: CD000479.
Should Unpublished Data Be Included in Meta-Analyses. Cook DJ, Guyatt GH, Ryan E, Clifton J, Buckingham L, Willan A, WcIlroy W, Oxman AD. Journal of the American Medical Association, 269: 2749-2753 (1993).
Geographic Variations in Sperm Counts: A Potential Cause of Bias in Studies of Semen Quality. Fisch H; Goluboff ET. Fertil Steril (United States), May 1996, 65(5) p1044-6.
Meta-analysis in Social Research. Glass GV, McGaw B, Smith ML. pp.18-20. Newbury Park CA: Sage (1981).
Comparison of Intrauterine and Intracervical Insemination with Frozen Donor Sperm: A Meta-Analysis. Goldberg JM, Mascha E, Falcone T, Attaran M. Fertil Steril 1999 Nov; 72(5): 792-5.
Reference Bias in Reports of Drug Trials. Gotzsche PC. Bmj 1992 295(6599): 654-6.
Combining Results from Independent Investigations: Meta-Analysis in Clinical Research. Halvorsen KT, Burdick E, Colditz GA, Frazier HS, Mosteller F. pp. 413-426, in Medical Uses of Statistics: 2nd Edition, Bailar JC and Mosteller F (editors), Boston MA: NEJM Books (1992).
Systematic Reviews in Health Care: Assessing the Quality of Controlled Clinical Trials. Peter Jüni, Douglas G Altman, and Matthias Egger. BMJ 2001; 323: 42-46.
[Full text] A Quality Assessment of Randomized Control Trials of Primary Treatment for Breast Cancer. Liberati A, Himel HN, Chalmers TC. J Clin Oncol 1986; 4: 942-951.
Interventions for Promoting Smoking Cessation During Pregnancy (Cochrane Review). Lumley J, Oliver S, Waters E. In: The Cochrane Library, 4, 2001. Oxford: Update Software. http://www.update-software.com/abstracts/ab001055.htm
Review of the Usefulness of Contacting Other Experts When Conducting A Literature Search for Systematic Reviews
R J McManus, S Wilson, B C Delaney, D A Fitzmaurice, C J Hyde, R S Tobias, S Jowett, and F D R Hobbs
BMJ 1998; 317: 1562-1563.[Full text] Sperm Function Assays and Their Predictive Value for Fertilization Outcome in IVF Therapy: A Meta-Analysis. Oehninger S, Franken DR, Sayed E, Barroso G, Kolm P. Hum Reprod Update 2000 Mar-Apr; 6(2): 160-8.
Have Sperm Counts Been Reduced 50 Percent in 50 Years? A Statistical Model Revisited. Olsen GW; Bodner KM; Ramlow JM; Ross CE; Lipshultz LI . Fertil Steril (United States), Apr 1995, 63(4) p887-93
Frequency of Citation and Outcome of Cholesterol Lowering Trials. Ravnskov, U. BMJ 1992 305(6855): 717.
Meta-Analyses of Randomized Control Trials: An Update of the Quality and Methodology. Sacks HS, Berrier J, Reitman D, PAgano D, Chalmers TC. pp. 427-442, in Medical Uses of Statistics: 2nd Edition, Bailar JC and Mosteller F (editors), Boston MA: NEJM Books (1992).
Schulz et al 1995 JAMA
Publication Bias: Evidence of Delayed Publication in a Cohort Study of Clinical Research Projects
Jerome M Stern and R John Simes
BMJ 1997; 315: 640-645.[Abstract] [Full text] Systematic Reviews in Health Care: Investigating and Dealing with Publication and Other Biases in Meta-Analysis
Jonathan A C Sterne, Matthias Egger, and George Davey Smith
BMJ 2001; 323: 101-105.[Full text] Meta-Analysis of Observational Studies in Epidemiology: A Proposal for Reporting. Donna F. Stroup, PhD, MSc; Jesse A. Berlin, ScD; Sally C. Morton, PhD; Ingram Olkin, PhD; G. David Williamson, PhD; Drummond Rennie, MD; David Moher, MSc; Betsy J. Becker, PhD; Theresa Ann Sipe, PhD; Stephen B. Thacker, MD, MSc; for the Meta-analysis Of Observational Studies in Epidemiology (MOOSE) Group April 19, 2000. JAMA. 2000;283:2008-2012. Also available at http://www.consort-statement.org/MOOSE.pdf
Impact of Covert Duplicate Publication on Meta-Analysis: A Case Sudy. Martin R Tramèr, D John M Reynolds, R Andrew Moore, and Henry J McQuay. BMJ 1997; 315: 635-640.
[Abstract] [Full text] Cigarette Smoking and Sperm Density: A Meta-Analysis. Vine MF, Margolin BH, Morrison HI, Hulka BS. Fertil Steril 1994 Jan; 61(1): 35-43.
Visibility of Research: FUTON Bias. Wentz R. Lancet 2002 (October 19): 360 (9341); 1256.
Additional Resources and Materials
The Cochrane Library. www.update-software.com/cochrane/cochrane-frame.html
"The Cochrane Library is an electronic publication designed to supply high quality evidence to inform people providing and receiving care, and those responsible for research, teaching, funding and administration at all levels."
Meta-Analysis in Clinical Trials Reporting: Has a Tool Become a Weapon? [Editorial]. Boden, W. E. (1992). Am J Cardiol 69(6): 681-6.
A New System for Grading Recommendations in Evidence -Based Guidelines
Robin Harbour and Juliet Miller
BMJ 2001; 323: 334-336.[Full text] Rating the Quality of Evidence for Clinical Practice Guidelines. Hadorn DC, Baker D, Hodges JS, Hicks N. J Clin Epidemiol 1996 Jul;49(7):749-54.
This article describes the system for rating the quality of medical evidence developed and used during creation of the Agency for Health Care Policy and Research-sponsored heart failure guideline. Previous approaches to rating evidence were not designed for use in the setting of clinical practice guidelines. The present system is based on the tenet that flaws in research design are serious to the extent they threaten the validity of the results of studies. A taxonomy of major and minor flaws based on that tenet was developed for randomized controlled trials and for cohort and medical registry studies. The use of the system is described in the context of two difficult clinical issues considered by the Panel: the role of coronary artery revascularization and the use of metoprolol.
PMID: 8691224 [PubMed - indexed for MEDLINE]
"Is Meta-Analysis a Valid Approach to the Evaluation of Small Effects in Observational Studies?" Shapiro S. Journal of Clinical Epidemiology. 50(3): 223-229 (1997).
Assessment Criteria http://www.jr2.ox.ac.uk/bandolier/band6/b6-5.html
Evidence-Based Everything http://www.jr2.ox.ac.uk/bandolier/band12/b12-1.html
Ionnidis et al 1998. [comparing meta-analyses to large trials]
Odds ratio versus relative risk
Dear Professor Mean, There is some confusion about the use of the odds ratio versus the relative risk. Can you explain the difference between these two numbers?
Both the odds ratio and the relative risk compare the likelihood of an event between two groups. Consider the following data on survival of passengers on the Titanic. There were 462 female passengers: 308 survived and 154 died. There were 851 male passengers: 142 survived and 709 died (see table below).
Alive Dead Total Female 308 154 462 Male 142 709 851 Total 450 863 1,313
Clearly, a male passenger on the Titanic was more likely to die than a female passenger. But how much more likely? You can compute the odds ratio or the relative risk to answer this question.
The odds ratio compares the relative odds of death in each group. For females, the odds were exactly 2 to 1 against dying (154/308=0.5). For males, the odds were almost 5 to 1 in favor of death (709/142=4.993). The odds ratio is 9.986 (4.993/0.5). There is a ten fold greater odds of death for males than for females.
The relative risk (sometimes called the risk ratio) compares the probability of death in each group rather than the odds. For females, the probability of death is 33% (154/462=0.3333). For males the probability is 83% (709/851=0.8331). The relative risk of death is 2.5 (0.8331/0.3333). There is a 2.5 greater probability of death for males than for females.
There is quite a difference. Both measurements show that men were more likely to die. But the odds ratio implies that men are much worse off than the relative risk. Which number is a fairer comparison?
There are three issues here: The relative risk measures events in a way that is interpretable and consistent with the way people really think. The relative risk, though, cannot always be computed in a research design. Also, the relative risk can sometimes lead to ambiguous and confusing situations. But first, we need to remember that fractions are funny.
Fractions are funny.
Suppose you invested money in a stock. On the first day, the value of the stock decreased by 20%. On the second day it increased by 20%. You would think that you have broken even, but that's not true.
Take the value of the stock and multiply by 0.8 to get the price after the first day. Then multiply by 1.2 to get the price after the second day. The successive multiplications do not cancel out because 0.8 * 1.2 = 0.96. A 20% increase followed by a 20% decrease leave you slightly worse off.
It turns out that to counteract a 20% decrease, you need a 25% increase. That is because 0.8 and 1.25 are reciprocal. This is easier to see if you express them as simple fractions: 4/5 and 5/4 are reciprocal fractions. Listed below is a table of common reciprocal fractions.
0.8 (4/5) 1.25 (5/4) 0.75 (3/4) 1.33 (4/3) 0.67 (2/3) 1.50 (3/2) 0.50 (1/2) 2.00 (2/1) Sometimes when we are comparing two groups, we'll put the first group in the numerator and the other in the denominator. Sometimes we will reverse ourselves and put the second group in the numerator. The numbers may look quite different (e.g., 0.67 and 1.5) but as long as you remember what the reciprocal fraction is, you shouldn't get too confused.
For example, we computed 2.5 as the relative risk in the example above. In this calculation we divided the male probability by the female probability. If we had divided the female probability by the male probability, we would have gotten a relative risk of 0.4. This is fine because 0.4 (2/5) and 2.5 (5/2) are reciprocal fractions.
Interpretability
The most commonly cited advantage of the relative risk over the odds ratio is that the former is the more natural interpretation.
The relative risk comes closer to what most people think of when they compare the relative likelihood of events. Suppose there are two groups, one with a 25% chance of mortality and the other with a 50% chance of mortality. Most people would say that the latter group has it twice as bad. But the odds ratio is 3, which seems too big. The latter odds are even (1 to 1) and the former odds are 3 to 1 against.
Even more extreme examples are possible. A change from 25% to 75% mortality represents a relative risk of 3, but an odds ratio of 9.
A change from 10% to 90% mortality represents a relative risk of 9 but an odds ratio of 81.
There are some additional issues about interpretability that are beyond the scope of this paper. In particular, both the odds ratio and the relative risk are computed by division and are relative measures. In contrast, absolute measures, computed as a difference rather than a ratio, produce estimates with quite different interpretations (Fahey et al 1995, Naylor et al 1992).
Designs that rule out the use of the relative risk
Some research designs, particularly the case-control design, prevent you from computing a relative risk. A case-control design involves the selection of research subjects on the basis of the outcome measurement rather than on the basis of the exposure.
Consider a case-control study of prostate cancer risk and male pattern balding. The goal of this research was to examine whether men with certain hair patterns were at greater risk of prostate cancer. In that study, roughly equal numbers of prostate cancer patients and controls were selected. Among the cancer patients, 72 out of 129 had either vertex or frontal baldness compared to 82 out of 139 among the controls (see table below).
Cancer cases Controls Total Balding 72 82 154 Hairy 55 57 112 Total 129 139 268
In this type of study, you can estimate the probability of balding for cancer patients, but you can't calculate the probability of cancer for bald patients. The prevalence of prostate cancer was artificially inflated to almost 50% by the nature of the case-control design.
So you would need additional information or a different type of research design to estimate the relative risk of prostate cancer for patients with different types of male pattern balding. Contrast this with data from a cohort study of male physicians (Lotufo et al 2000). In this study of the association between male pattern baldness and coronary heart disease, the researchers could estimate relative risks, since 1,446 physicians had coronary heart disease events during the 11-year follow-up period.
For example, among the 8,159 doctors with hair, 548 (6.7%) developed coronary heart disease during the 11 years of the study. Among the 1,351 doctors with severe vertex balding, 127 (9.4%) developed coronary heart disease (see table below). The relative risk is 1.4 = 9.4% / 6.7%.
Heart disease Healthy Total Balding 127 (9.4%) 1,224 (90.6%) 1,351 Hairy 548 (6.7%) 7,611 (93.3%) 8,159 Total 675 8,835 9,510
You can always calculate and interpret the odds ratio in a case control study. It has a reasonable interpretation as long as the outcome event is rare (Breslow and Day 1980, page 70). The interpretation of the odds ratio in a case-control design is, however, also dependent on how the controls were recruited (Pearce 1993).
Another situation which calls for the use of odds ratio is covariate adjustment. It is easy to adjust an odds ratio for confounding variables; the adjustments for a relative risk are much trickier.
In a study on the likelihood of pregnancy among people with epilepsy (Schupf and Ottman 1994), 232 out of 586 males with idiopathic/cryptogenic epilepsy had fathered one or more children. In the control group, the respective counts were 79 out of 109 (see table below).
Children No children Total Epilepsy 232 (40%) 354 (60%) 586 Control 79 (72%) 30 (28%) 109 Total 311 384 695
The simple relative risk is 0.55 and the simple odds ratio is 0.25. Clearly the probability of fathering a child is strongly dependent on a variety of demographic variables, especially age (the issue of marital status was dealt with by a separate analysis). The control group was 8.4 years older on average (43.5 years versus 35.1), showing the need to adjust for this variable. With a multivariate logistic regression model that included age, education, ethnicity and sibship size, the adjusted odds ratio for epilepsy status was 0.36. Although this ratio was closer to 1.0 than the crude odds ratio, it was still highly significant. A comparable adjusted relative risk would be more difficult to compute (although it can be done as in Lotufo et al 2000).
Ambiguous and confusing situations
The relative risk can sometimes produce ambiguous and confusing situations. Part of this is due to the fact that relative measurements are often counter-intuitive. Consider an interesting case of relative comparison that comes from a puzzle on the Car Talk radio show. You have a hundred pound sack of potatoes. Let's assume that these potatoes are 99% water. That means 99 parts water and 1 part potato. These are soggier potatoes than I am used to seeing, but it makes the problem more interesting.
If you dried out the potatoes completely, they would only weigh one pound. But let's suppose you only wanted to dry out the potatoes partially, until they were 98% water. How much would they weigh then?
The counter-intuitive answer is 50 pounds. 98% water means 49 parts water and 1 part potato. An alternative way of thinking about the problem is that in order to double the concentration of potato (from 1% to 2%), you have to remove about half of the water.
Relative risks have the same sort of counter-intuitive behavior. A small relative change in the probability of a common event's occurrence can be associated with a large relative change in the opposite probability (the probability of the event not occurring).
Consider a recent study on physician recommendations for patients with chest pain (Schulman et al 1999). This study found that when doctors viewed videotape of hypothetical patients, race and sex influenced their recommendations. One of the findings was that doctors were more likely to recommend cardiac catheterization for men than for women. 326 out of 360 (90.6%) doctors viewing the videotape of male hypothetical patients recommended cardiac catheterization, while only 305 out of 360 (84.7%) of the doctors who viewed tapes of female hypothetical patients made this recommendation.
No cath Cath Total Male patient 34 (9.4%) 326 (90.6%) 360 Female patient 55 (15.3%) 305 (84.7%) 360 Total 89 631 720
The odds ratio is either 0.57 or 1.74, depending on which group you place in the numerator. The authors reported the odds ratio in the original paper and concluded that physicians make different recommendations for male patients than for female patients.
A critique of this study (Schwarz et al 1999) noted among other things that the odds ratio overstated the effect, and that the relative risk was only 0.93 (reciprocal 1.07). In this study, however, it is not entirely clear that 0.93 is the appropriate risk ratio. Since 0.93 is so much closer to 1 and 0.57, the critics claimed that the odds ratio overstated the tendency for physicians to make different recommendations for male and female patients.
Although the relative change from 90.6% to 84.7% is modest, consider the opposite perspective. The rates for recommending a less aggressive intervention than catheterization was 15.3% for doctors viewing the female patients and 9.4% for doctors viewing the male patients, a relative risk of 1.63 (reciprocal 0.61).
This is the same thing that we just saw in the Car Talk puzzler: a small relative change in the water content implies a large relative change in the potato content. In the physician recommendation study, a small relative change in the probability of a recommendation in favor of catheterization corresponds to a large relative change in the probability of recommending against catheterization.
Thus, for every problem, there are two possible ways to compute relative risk. Sometimes, it is obvious which relative risk is appropriate. For the Titanic passenger, the appropriate risk is for death rather than survival. But what about a breast feeding study. Are we trying to measure how much an intervention increases the probability of breast feeding success or are we trying to see how much the intervention decreases the probability of breast feeding failure? For example, Deeks 1998 expresses concern about an odds ratio calculation in a study aimed at increasing the duration of breast feeding. At three months, 32/51 (63%) of the mothers in the treatment group had stopped breast feeding compared to 52/57 (91%) in the control group.
Continued bf Stopped bf Total Treatment 19 (37.3%) 32 (62.7%) 51 Control 5 (8.8%) 52 (91.2%) 57 Total 24 84 108
While the relative risk of 0.69 (reciprocal 1.45) for this data is much less extreme than the odds ratio of 0.16 (reciprocal 6.2), one has to wonder why Deeks chose to compare probabilities of breast feeding failures rather than successes. The rate of successful breast feeding at three months was 4.2 times higher in the treatment group than the control group. This is still not as extreme as the odds ratio; the odds ratio for successful breast feeding is 6.25, which is simply the inverse of the odds ratio for breast feeding failure.
One advantage of the odds ratio is that it is not dependent on whether we focus on the event's occurrence or its failure to occur. If the odds ratio for an event deviates substantially from 1.0, the odds ratio for the event's failure to occur will also deviate substantially from 1.0, though in the opposite direction.
Summary
Both the odds ratio and the relative risk compare the relative likelihood of an event occurring between two distinct groups. The relative risk is easier to interpret and consistent with the general intuition. Some designs, however, prevent the calculation of the relative risk. Also there is some ambiguity as to which relative risk you are comparing. When you are reading research that summarizes the data using odds ratios, or relative risks, you need to be aware of the limitations of both of these measures.
Bibliography
When can odds ratios mislead? Odds ratios should be used only in case-control studies and logistic regression analyses [letter]. Deeks J. British Medical Journal 1998:317(7166);1155-6; discussion 1156-7. Abstract not available.
Evidence based purchasing: understanding results of clinical trials and systematic reviews. Fahey T, Griffiths S and Peters TJ. British Medical Journal 1995:311(7012);1056-9; discussion 1059-60. OBJECTIVE--To assess whether the way in which the results of a randomised controlled trial and a systematic review are presented influences health policy decisions. DESIGN--A postal questionnaire to all members of a health authority within one regional health authority. SETTING--Anglia and Oxford regional health authorities. SUBJECTS--182 executive and non-executive members of 13 health authorities, family health services authorities, or health commissions. MAIN OUTCOME MEASURES--The average score from all health authority members in terms of their willingness to fund a mammography programme or cardiac rehabilitation programme according to four different ways of presenting the same results of research evidence--namely, as a relative risk reduction, absolute risk reduction, proportion of event free patients, or as the number of patients needed to be treated to prevent an adverse event. RESULTS--The willingness to fund either programme was significantly influenced by the way in which data were presented. Results of both programmes when expressed as relative risk reductions produced significantly higher scores when compared with other methods (P < 0.05). The difference was more extreme for mammography, for which the outcome condition is rarer. CONCLUSIONS--The method of reporting trial results has a considerable influence on the health policy decisions made by health authority members.
Interpretation and Choice of Effect Measures in Epidemiologic Analyses. Greenland S. American Journal of Epidemiology 1987:125(5);761-767. Abstract not available.
Male Pattern Baldness and Coronary Heart Disease: The Physician's Health Study. Lotufo PA. Archives of Internal Medicine 2000:160(165-171. Abstract not available yet.
Measured enthusiasm: does the method of reporting trials results alter perceptions of therapeutic effectiveness? Naylor C, Chen E and Strauss B. American College of Physicians 1992:117(11);916-21. ABSTRACT: OBJECTIVE: To compare clinicians' ratings of therapeutic effectiveness when different trial end points were presented as percent reductions in relative compared with absolute risk and as numbers of patients treated to avoid one adverse outcome. DESIGN: Survey, with random allocation of two questionnaires. SETTING: Toronto teaching hospitals. RESPONDENTS: Convenience sample of 100 faculty and housestaff in internal medicine and family medicine. INTERVENTION: One questionnaire presented results for three end points of the Helsinki Heart Study as separate drug trials using only absolute differences in events; the other showed the same end points as relative differences. Both questionnaires included a fourth "trial," showing person-years of treatment needed to prevent one myocardial infarction. MAIN OUTCOME MEASURE: The "trials" were each rated on an 11-point scale, from treatment "harmful" to "very effective." RESULTS: Respondents' ratings of effectiveness varied with the end point. Controlling for end point, ratings of effectiveness by the 50 participants receiving absolute event data were lower than those by 50 participants responding to relative risk reductions (P < 0.001); however, no end-point difference was more than 0.6 scale points. For a "trial" reporting that 77 persons were treated for 5 years to prevent one myocardial infarction, mean ratings were 2.3 or 1.8 scale points lower, respectively (both P < 0.001), than when the same data were shown as relative or absolute risk reductions. CONCLUSIONS: Clinicians' views of drug therapies are affected by the common use of relative risk reductions in both trial reports and advertisements, by end-point emphasis, and, above all, by underuse of summary measures that relate treatment burden to therapeutic yields in a clinically relevant manner.
What does the odds ratio estimate in a case-control study? Pearce N. Int J Epidemiol 1993:22(6);1189-92. The use of the term 'odds ratio' in reporting the findings of case-control studies is technically correct, but is often misleading. The meaning of the odds ratio estimates obtained in a case-control study differs according to whether controls are selected from person-time at risk (the study base), persons at risk (the base-population at risk at the beginning of follow-up), or survivors (the population at risk at the end of follow-up). These three methods of control selection correspond to estimating the rate ratio, risk ratio, or the odds ratio respectively, by means of calculating the odds ratio in the subjects actually studied. None of these estimation procedures depends on any rare disease assumption. Where the rare disease assumption is relevant is whether the effect which is estimated (e.g. the odds ratio) is approximately equal to some other effect measure of interest (e.g. the risk ratio or rate ratio) in the underlying study base. To avoid confusion on this issue, authors should be encouraged to not only specify the manner in which controls have been selected (e.g. by density sampling) but also the corresponding effect measure which is being estimated (e.g. the rate ratio) by the 'odds ratio' which is obtained in a case-control analysis.
Likelihood of pregnancy in individuals with idiopathic/cryptogenic epilepsy: social and biologic influences. Schupf N. Epilepsia 1994:35(4);750-756. Abstract not available yet.
A Haircut in Horse Town: And Other Great Car Talk Puzzlers. (1999) Tom Magliozzi, Ray Magliozzi, Douglas Berman. New York NY: Berkley Publishing Group.
Confidence Intervals
Dear Professor Mean: Can you give me a simple explanation of what a confidence interval is?
We statisticians have a habit of hedging our bets. We always insert qualifiers into our reports, warn about all sorts of assumptions, and never admit to anything more extreme than probable. There's a famous saying: "Statistics means never having to say you're certain."
We qualify our statements, of course, because we are always dealing with imperfect information. In particular, we are often asked to make statements about a population (a large group of subjects) using information from a sample (a small, but carefully selected subset of this population). No matter how carefully this sample is selected to be a fair and unbiased representation of the population, relying on information from a sample will always lead to some level of uncertainty.
Short Explanation
A confidence interval is a range of values that tries to quantify this uncertainty. Consider it as a range of plausible values. A narrow confidence interval implies high precision; we can specify plausible values to within a tiny range. A wide interval implies poor precision; we can only specify plausible values to a broad and uninformative range.
Consider a recent study of homoeopathic treatment of pain and swelling after oral surgery (Lokken 1995). When examining swelling 3 days after the operation, they showed that homoeopathy led to 1 mm less swelling on average. The 95% confidence interval, however, ranged from -5.5 to 7.5 mm. From what little I know about oral surgery, this appears to be a very wide interval. This interval implies that neither a large improvement due to homoeopathy nor a large decrement could be ruled out.
Generally when a confidence interval is very wide like this one, it is an indication of an inadequate sample size, an issue that the authors mention in the discussion section of this paper.
How to Interpret a Confidence Interval
When you see a confidence interval in a published medical report, you should look for two things. First, does the interval contain a value that implies no change or no effect? For example, with a confidence interval for a difference look to see whether that interval includes zero. With a confidence interval for a ratio, look to see whether that interval contains one.
Here's an example of a confidence interval that contains the null value. The interval shown below implies no statistically significant change.
Here's an example of a confidence interval that excludes the null value. If we assume that larger implies better, then the interval shown below would imply a statistically significant improvement.
Here's a different example of a confidence interval that excludes the null value. The interval shown below implies a statistically significant decline.
Practical Significance
You should also see whether the confidence interval lies partly or entirely within a range of clinical indifference. Clinical indifference represents values of such a trivial size that you would not want to change your current practice. For example, you would not recommend a special diet that showed a one year weight loss of only five pounds. You would not order a diagnostic test that had a predictive value of less than 50%.
Clinical indifference is a medical judgement, and not a statistical judgement. It depends on your knowledge of the range of possible treatments, their costs, and their side effects. As statistician, I can only speculate on what a range of clinical indifference is. I do want to emphasize, however, that if a confidence interval is contained entirely within your range of clinical indifference, then you have clear and convincing evidence to keep doing things the same way (see below).
One the other hand, if part of the confidence interval lies outside the range of clinical indifference, then you should consider the possibility that the sample size is too small (see below).
Some studies have sample sizes that are so large that even trivial differences are declared statistically significant. If your confidence interval excludes the null value but still lies entirely within the range of clinical indifference, then you have a result with statistical significance, but no practical significance (see below).
Finally, if your confidence interval excludes the null value and lies outside the range of clinical indifference, then you have both statistical and practical significance (see below).
The Standard Error
In many situations, the width of a confidence interval is proportional to the standard error. The standard error is defined the variability for a statistical estimate. You can compute a crude confidence interval by taking the estimate plus or minus twice the standard error.
Confidence Interval for a Simple Average
There are lots of different formulas for the confidence interval and the standard error, depending on the context of the problem. The simplest formula appears when you estimate an average from a single sample. In this situation, the standard error would be
where sigma represents the variability of the original data and n represents the size of the sample. The crude confidence interval would be the sample mean plus or minus two standard errors.
The width of your confidence interval goes down as the sample size goes up, since you are placing a larger value in the denominator. This is a classic and intuitive relationship in statistics: larger sample sizes provide greater precision (that is, narrower confidence intervals).
One way of planning a sample size for your study is to try to make sure your confidence interval has an adequate amount of precision. Although larger sample sizes mean narrower confidence intervals, there is usually a point of diminishing returns. This occurs when further shrinking of the interval is not worth the cost of additional subjects.
An often overlooked strategy for gaining precision is by finding a way to shrink sigma, the variability in your original data set. For example, use of calibration and quality control checks in a laboratory can often provide substantially smaller values for sigma.
Confidence Interval for a Difference Between Two Averages
If we were interested in estimating the difference in averages between two independent samples of data, the standard error of the estimated difference would be
where the subscripts 1 and 2 indicate whether the values come from the first or the second group. Notice that the standard error and hence the width of the confidence interval goes down as either or both sample sizes go up.
When you are planning a research study comparing two groups, it is often helpful to consider different allocations of samples to the two groups. For example, if your first group is much more variable than the second group, you might be better off trying for a larger sample size in that group, rather than trying to get equal numbers in each group.
Confidence Interval for a Proportion
If we compute a proportion, p, from a sample, the standard error of that proportion would be
Just like the previous examples, larger sample sizes lead to smaller standard errors and narrower confidence intervals.
Did you notice in this formula that the width of the confidence interval is related to the estimate itself. A bit of work with calculus will show you that, assuming the sample size stays the same, the widest confidence interval occurs when p=0.5. Both rarer and more frequent events than 50% will produce narrower intervals.
Confidence Interval for an Odds Ratio
The final example involves computing an odds ratio. We often use the odds ratio to summarize data in a two by two table. The rows of the table might represent disease status (healthy/diseased) and the columns might represent exposure status (exposed/unexposed). In this case, the odds ratio would represent the relative change in the odds of disease between exposed and unexposed patients.
Or possibly the rows might represent treatment status (active drug/placebo) and the columns might represent health outcome (improvement/no improvement). Here, the odds ratio represents the relative change in the odds of improvement between drug and placebo.
If we let the letters a, b, c, and d represent the frequency counts in a two by two table (see below)
then the odds ratio would be ad/bc. The odds ratio is skewed, so we cannot easily compute a standard error for the odds ratio itself. We can, however, find a standard error for the natural logarithm of the odds ratio. It is simply
We see that as any or all of the counts in the two by two table increase, the confidence interval for the log odds ratio shrinks. Also, it turns out that the smallest count in the two by two table plays the largest role in determining the size of the standard error.
Example of a Confidence Interval For a Mean
In a study of immunotherapy in children with asthma, 61 patients showed an average improvement of 2.5% peak expiratory flow rate with a standard deviation of 11%. We divide the standard deviation by the square root of 61 to get a standard error of 1.4. A crude confidence interval would be 2.5% plus or minus 2.8% which equals 0.3% to 4.8%. I'm not an expert of asthma, but if we defined a range of clinical indifference to be an improvement of less than 5%, then this confidence interval is entirely within the range of clinical indifference.
Example of a Confidence Interval for An Odds Ratio
In the same study, the author noted that 15 out of 53 immunotherapy patients showed partial remission on their need for medication. This sample size is smaller because of a small number of dropouts. In the placebo group, 12 out of 57 showed partial remission. The two by two table for these data looks like
The odds ratio is 1.5, which shows that the immunotherapy treatment increases the odds of partial remission. The natural log of the odds ratio is 0.6. For this calculation, be sure that you use a natural logarithm and not a base 10 logarithm.
The standard error of the log odds ratio is
![]()
So a crude confidence interval for the log odds ratio is 0.4 plus or minus 0.9 which equals -0.5 to 1.3. We can exponentiate (use the exp button on your scientific calculator) to convert back to the original measurement scale. This gives us a confidence interval of 0.6 to 3.6 for the odds ratio itself. Even though this interval contains 1, we still have to allow for the possibility that the improvement might be as large as two-fold or three-fold.
Summary
A confidence interval is a range of plausible values that accounts for uncertainty in a statistical estimate.. A narrow confidence interval implies high precision; a wide interval implies poor precision.
When you see a confidence interval in a published medical report, you should look for two things.
- Does the interval contain a value that implies no change or no effect?
- Does the confidence interval lie partly or entirely within a range of clinical indifference?
Further Reading
General references
Where's the Evidence? Debates in Modern Medicine. William A. Silverman (1998) New York: Oxford University Press.
Exposing Flawed Science. Rick Groleau, Nova. Accessed on 2003-04-30. "Science is a human endeavor, subject to human imperfections. At one end of a spectrum covering what would generally be considered poor science we have those who intentionally deceive. At the other end are those who have the best of intentions but, for some reason, produce flawed results. Somewhere in the middle are those who have some knowledge of the topic they are investigating, but not enough to produce results that will stand up to scrutiny." www.pbs.org/wgbh/nova/holocaust/pseudoscience.html
Evidence-based medicine and treatment choices. D. L. Sackett. Lancet 1997: 349(9051); 570; discussion 572-3. Abstract not available yet.
Design and analysis of prostate cancer trials. R. Sylvester. Acta Urologica Belgica 1994: 62(1); 23-29. ABSTRACT: This paper presents an overview of various statistical concepts related to the design and analysis of prostate cancer trials: the need for randomization, stratification for prognostic factors, sample size determination, trial objectives, the choice of a control group, patient entry criteria, the number of treatments to be compared, the choice of endpoints, analysis by the intent to treat principle, interim statistical analysis and early stopping rule, and subgroup analyses.
Content and quality of 2000 controlled trials in schizophrenia over 50 years. Ben Thornley, C Adams. British Medical Journal 1998: 317(7167); 1181-1184. ABSTRACT: OBJECTIVE: To provide a comprehensive survey of the content and quality of intervention studies relevant to the treatment of schizophrenia. DESIGN:Data were extracted from 2000 trials on the Cochrane Schizophrenia Group's register. MAIN OUTCOME MEASURES: Type and date of publication, country of origin, language, size of study, treatment setting, participant group, interventions, outcomes, and quality of study. RESULTS: Hospital based drug trials undertaken in the United States were dominant in the sample (54%). Generally, studies were short (54%<6 weeks), small (mean number of patients 65), and poorly reported (64% had a quality score of <=2 (maximum score 5)). Over 600 different interventions were studied in these trials, and 640 different rating scales were used to measure outcome. CONCLUSIONS: Half a century of studies of limited quality, duration, and clinical utility leave much scope for well planned, conducted, and reported trials. The drug regulatory authorities should stipulate that the results of both explanatory and pragmatic trials are necessary before a compound is given a licence for everyday use. [Abstract] [Full text] [PDF]
Biological mechanisms
Unconventional therapies for cancer: a refuge from the rules of evidence? I. F. Tannock, D. G. Warr. Cmaj 1998: 159(7); 801-2. Abstract not available yet. [Full text] [PDF]
Biologic plausibility in causal inference: current method and practice. D. L. Weed, S. D. Hursting. Am J Epidemiol 1998: 147(5); 415-25. Abstract not available.
Proof versus plausibility: rules of engagement for the struggle to evaluate alternative cancer therapies. L. J. Hoffer. Cmaj 2001: 164(3); 351-3.
Blinding and concealed allocation lists
How study design affects outcomes in comparisons of therapy. I: Medical. GA Colditz, JN Miller, F. Mosteller. Stat Med 1989: 8(4); 441-454. ABSTRACT: We analysed 113 reports published in 1980 in a sample of medical journals to relate features of study design to the magnitude of gains attributed to new therapies over old. Overall we rated 87 per cent of new therapies as improvements over standard therapies. The mean gain (measured by the Mann-Whitney statistic) was relatively constant across study designs, except for non-randomized controlled trials with sequential assignment to therapy, which showed a significantly higher likelihood that a patient would do better on the innovation than on standard therapy (p = 0.004). Randomized controlled trials that did not use a double-blind design had a higher likelihood of showing a gain for the innovation than did double-blind trials (p = 0.02). Any evaluation of an innovation may include both bias and the true efficacy of the new therapy, therefore we may consider making adjustments for the average bias associated with a study design. When interpreting an evaluation of a new therapy, readers should consider the impact of the following average adjustments to the Mann-Whitney statistic: for trials with non-random sequential assignment a decrease of 0.15, for non-double-blind randomized controlled trials a decrease of 0.11.
Randomised trials, human nature, and reporting guidelines. K. F. Schulz. Lancet 1996: 348(9027); 596-8. Abstract not available.
Empirical evidence of bias dimensions of methodological quality associated with estimates of treatment effects in controlled trials. KF Schulz, I Chalmers, RJ Hayes, DG Altman. JAMA 1995: 273(5); 408-12. ABSTRACT: OBJECTIVE--To determine if inadequate approaches to randomized controlled trial design and execution are associated with evidence of bias in estimating treatment effects. DESIGN--An observational study in which we assessed the methodological quality of 250 controlled trials from 33 meta-analyses and then analyzed, using multiple logistic regression models, the associations between those assessments and estimated treatment effects. DATA SOURCES--Meta-analyses from the Cochrane Pregnancy and Childbirth Database. MAIN OUTCOME MEASURES--The associations between estimates of treatment effects and inadequate allocation concealment, exclusions after randomization, and lack of double-blinding. RESULTS--Compared with trials in which authors reported adequately concealed treatment allocation, trials in which concealment was either inadequate or unclear (did not report or incompletely reported a concealment approach) yielded larger estimates of treatment effects (P < .001). Odds ratios were exaggerated by 41% for inadequately concealed trials and by 30% for unclearly concealed trials (adjusted for other aspects of quality). Trials in which participants had been excluded after randomization did not yield larger estimates of effects, but that lack of association may be due to incomplete reporting. Trials that were not double-blind also yielded larger estimates of effects (P = .01), with odds ratios being exaggerated by 17%. CONCLUSIONS--This study provides empirical evidence that inadequate methodological approaches in controlled trials, particularly those representing poor allocation concealment, are associated with bias. Readers of trial reports should be wary of these pitfalls, and investigators must improve their design, execution, and reporting of trials.
A comparison of active and simulated chiropractic manipulation as adjunctive treatment for childhood asthma. J. Balon, P. D. Aker, E. R. Crowther, C. Danielson, P. G. Cox, D. O'Shaughnessy, C. Walker, C. H. Goldsmith, E. Duku, M. R. Sears. New England Journal of Medicine 1998: 339(15); 1013-20. BACKGROUND: Chiropractic spinal manipulation has been reported to be of benefit in nonmusculoskeletal conditions, including asthma. METHODS: We conducted a randomized, controlled trial of chiropractic spinal manipulation for children with mild or moderate asthma. After a three-week base-line evaluation period, 91 children who had continuing symptoms of asthma despite usual medical therapy were randomly assigned to receive either active or simulated chiropractic manipulation for four months. None had previously received chiropractic care. Each subject was treated by 1 of 11 participating chiropractors, selected by the family according to location. The primary outcome measure was the change from base line in the peak expiratory flow, measured in the morning, before the use of a bronchodilator, at two and four months. Except for the treating chiropractor and one investigator (who was not involved in assessing outcomes), all participants remained fully blinded to treatment assignment throughout the study. RESULTS: Eighty children (38 in the active-treatment group and 42 in the simulated-treatment group) had outcome data that could be evaluated. There were small increases (7 to 12 liters per minute) in peak expiratory flow in the morning and the evening in both treatment groups, with no significant differences between the groups in the degree of change from base line (morning peak expiratory flow, P=0.49 at two months and P=0.82 at four months). Symptoms of asthma and use of 3-agonists decreased and the quality of life increased in both groups, with no significant differences between the groups. There were no significant changes in spirometric measurements or airway responsiveness. CONCLUSIONS: In children with mild or moderate asthma, the addition of chiropractic spinal manipulation to usual medical care provided no benefit. [Medline] [Abstract] [Full text] [PDF]
Why Bogus Therapies Seem to Work. Barry L. Beyerstein. Skeptical Inquirer 1997: 21(5); At least ten kinds of errors and biases can convince intelligent, honest people that cures have been achieved when they have not. [Full text]
Controlled trial of acupuncture for severe recidivist alcoholism. M. L. Bullock, P. D. Culliton, R. T. Olander. Lancet 1989: 1(8652); 1435-9. In a placebo-controlled study, 80 severe recidivist alcoholics received acupuncture either at points specific for the treatment of substance abuse (treatment group) or at nonspecific points (control group). 21 of 40 patients in the treatment group completed the programme compared with 1 of 40 controls. Significant treatment effects persisted at the end of the six-month follow-up: by comparison with treatment patients more control patients expressed a moderate to strong need for alcohol, and had more than twice the number of both drinking episodes and admissions to a detoxification centre.
Physician interpretations and textbook definitions of blinding terminology in randomized controlled trials. P. J. Devereaux, B. J. Manns, W. A. Ghali, H. Quan, C. Lacchetti, V. M. Montori, M. Bhandari, G. H. Guyatt. Jama 2001: 285(15); 2000-3. CONTEXT: When clinicians assess the validity of randomized controlled trials (RCTs), they commonly evaluate the blinding status of individuals in the RCT. The terminology authors often use to convey blinding status (single, double, and triple blinding) may be open to various interpretations. OBJECTIVE: To determine physician interpretations and textbook definitions of RCT blinding terms. DESIGN AND SETTING: Observational study undertaken at 3 Canadian university tertiary care centers between February and May 1999. PARTICIPANTS: Ninety-one internal medicine physicians who responded to a survey. MAIN OUTCOME MEASURES: Respondents identified which of the following groups they thought were blinded in single-, double-, and triple-blinded RCTs: participants, health care providers, data collectors, judicial assessors of outcomes, data analysts, and personnel who write the article. Definitions from 25 systematically identified textbooks published since 1990 providing definitions for single, double, or triple blinding. RESULTS: Physician respondents identified 10, 17, and 15 unique interpretations of single, double, and triple blinding, respectively, and textbooks provided 5, 9, and 7 different definitions of each. The frequencies of the most common physician interpretation and textbook definition were 75% (95% confidence interval [CI], 65%-83%) and 74% (95% CI, 52%-90%) for single blinding, 38% (95% CI, 28%-49%) and 43% (95% CI, 24%-63%) for double blinding, and 18% (95% CI, 10%-28%) and 14% (95% CI, 0%-58%) for triple blinding, respectively. CONCLUSIONS: Our study suggests that both physicians and textbooks vary greatly in their interpretations and definitions of single, double, and triple blinding. Explicit statements about the blinding status of specific groups involved in RCTs should replace the current ambiguous terminology. [Medline] [Abstract] [Full text] [PDF]
"Double blind, you are the weakest link- good-bye!" P.J. Devereaux, M. Bhandari, V. M. Montori, B.J. Manns, W.A. Ghali, G. H. Guyatt. ACP Journal Club 2002: 136A11-A12. Abstract not available.
Removing bias in surgical trials. A. G. Johnson, J. M. Dixon. British Medical Journal 1997: 314(7085); 916-7. Abstract not available.
An addition to the controversy on sunlight exposure and melanoma risk: a meta-analytical approach. P. J. Nelemans, F. H. Rampen, D. J. Ruiter, A. L. Verbeek. J Clin Epidemiol 1995: 48(11); 1331-42. Case control studies on the association between sunlight exposure and melanoma risk show considerable differences in design; this could be responsible for the variation in study results. In an attempt to resolve the controversy between study results, the results of 25 publications on case control studies were evaluated using meta-analytical techniques. Comparison of odds ratios between subgroups of studies revealed that the range of odds ratios was far greater for hospital-based studies than for population-based studies. For the latter type of studies, the odds ratios were homogeneous and the pooled odds ratios were 1.57 (95% confidence interval [CI], 1.29-1.91) for intermittent sunlight exposure and 0.73 (95% CI, 0.60-0.89) for chronic exposure. However, among other problems, the lack of standardized measures for sunlight exposure warrants cautious interpretation of these results. It is concluded that evidence to support the intermittent sunlight theory is still far from complete.
The impact of blinding on the results of a randomized, placebo-controlled multiple sclerosis clinical trial. J. H. Noseworthy, G. C. Ebers, M. K. Vandervoort, R. E. Farquhar, E. Yetisir, R. Roberts. Neurology 1994: 44(1); p16-20. In the randomized, placebo-controlled, physician-blinded Canadian cooperative trial of cyclophosphamide and plasma exchange, neither active treatment regimens (group I: i.v. cyclophosphamide and prednisone; group II: weekly plasma exchange, oral cyclophosphamide, and prednisone) were superior to placebo (group III: sham plasma exchange and placebo medications) using the blinded, evaluating neurologists' assessments of disease course (primary analysis). All patients were examined by both a blinded and an unblinded neurologist at each assessment in this trial. We compared the blinded and unblinded neurologists' judgment of treatment response and analyzed the clinical behavior of patients who correctly guessed their treatment. The unblinded (but not the blinded) neurologists' scores demonstrated an apparent treatment benefit at 6, 12, and 24 months for the group II patients (not group I or placebo; p < 0.05, two-tailed). There were no significant differences in the time to treatment failure or in the proportions of patients improved, stable, or worse between the group II and group III patients who correctly guessed their treatment assignments and those who did not. Physician blinding prevented an erroneous conclusion about treatment efficacy (false positive, type 1 error).
Inconsistencies and Errors in Alternative Medicine Research. W Sampson. Skeptical Inquirer 1997: 21(5); 35-38. Abstract not available yet.
The Landscape and Lexicon of Blinding in Randomized Trials. K.F. Schulz, I. Chalmers, D.G. Altman. Annals of Internal Medicine 2002: 136(3); 254-259. Abstract not available.
Blinding in randomised trials: hiding who got what. K. F. Schulz, D.A. Grimes. Lancet 2002: 359696-700. Blinding embodies a rich history spanning over two centuries. Most researchers worldwide understand blinding terminology, but confusion lurks beyond a general comprehension. Terms such as single blind, double blind, and triple blind mean different things to different people. Moreover, many medical researchers confuse blinding with allocation concealment. Such confusion indicates misunderstandings of both. The term blinding refers to keeping trial participants, investigators (usually health-care providers), or assessors (those collecting outcome data) unaware of the assigned intervention, so that they will not be influenced by that knowledge. Blinding usually reduces differential assessment of outcomes (information bias), but can also improve compliance and retention of trial participants while reducing biased supplemental care or treatment (sometimes called co-intervention). Many investigators and readers naively consider a randomised trial as high quality simply because it is double blind, as if double-blinding is the sine qua non of a randomised controlled trial. Although double blinding (blinding investigators, participants, and outcome assessors) indicates a strong design, trials that are not double blinded should not automatically be deemed inferior. Rather than solely relying on terminology like double blinding, researchers should explicitly state who was blinded, and how. We recommend placing greater credence in results when investigators at least blind outcome assessments, except with objective outcomes, such as death, which leave little room for bias. If investigators properly report their blinding efforts, readers can judge them. Unfortunately, many articles do not contain proper reporting. If an article claims blinding without any accompanying clarification, readers should remain sceptical about its effect on bias reduction.
Assessing Allocation Concealment and Blinding in Randomised Controlled Trials: Why bother? KF Schulz. Evid Based Nurs 2001: 44 - 6. NA
Empirical evidence of design-related bias in studies of diagnostic tests. JG Lijmer, BW Mol, S Heisterkamp, GJ Bonsel, MH Prins, JH van der Meulen, PM Bossuyt. JAMA 1999: 282(11); 1061-1066. ABSTRACT: CONTEXT: The literature contains a large number of potential biases in the evaluation of diagnostic tests. Strict application of appropriate methodological criteria would invalidate the clinical application of most study results. OBJECTIVE: To empirically determine the quantitative effect of study design shortcomings on estimates of diagnostic accuracy. DESIGN AND SETTING: Observational study of the methodological features of 184 original studies evaluating 218 diagnostic tests. Meta-analyses on diagnostic tests were identified through a systematic search of the literature using MEDLINE, EMBASE, and DARE databases and the Cochrane Library (1996-1997). Associations between study characteristics and estimates of diagnostic accuracy were evaluated with a regression model. MAIN OUTCOME MEASURES: Relative diagnostic odds ratio (RDOR), which compared the diagnostic odds ratios of studies of a given test that lacked a particular methodological feature with those without the corresponding shortcomings in design. RESULTS: Fifteen (6.8%) of 218 evaluations met all 8 criteria; 64 (30%) met 6 or more. Studies evaluating tests in a diseased population and a separate control group overestimated the diagnostic performance compared with studies that used a clinical population (RDOR, 3.0; 95% confidence interval [CI], 2.0-4.5). Studies in which different reference tests were used for positive and negative results of the test under study overestimated the diagnostic performance compared with studies using a single reference test for all patients (RDOR, 2.2; 95% CI, 1.5-3.3). Diagnostic performance was also overestimated when the reference test was interpreted with knowledge of the test result (RDOR, 1.3; 95% CI, 1.0-1.9), when no criteria for the test were described (RDOR, 1.7; 95% CI, 1.1-2.5), and when no description of the population under study was provided (RDOR, 1.4; 95% CI, 1.1-1.7). CONCLUSION: These data provide empirical evidence that diagnostic studies with methodological shortcomings may overestimate the accuracy of a diagnostic test, particularly those including nonrepresentative patients or applying different reference standards.
Bias in treatment assignment in controlled clinical trails. TC Chalmers, P Celano, HS Sacks, H Jr Smith. N Engl J Med 1983: 309(22); 1358-61. ABSTRACT: Controlled clinical trials of the treatment of acute myocardial infarction offer a unique opportunity for the study of the potential influence on outcome of bias in treatment assignment. A group of 145 papers was divided into those in which the randomization process was blinded (57 papers), those in which it may have been unblinded (45 papers), and those in which the controls were selected by a nonrandom process (43 papers). At least one prognostic variable was maldistributed (P less than 0.05) in 14.0 per cent of the blinded-randomization studies, in 26.7 per cent of the unblinded-randomization studies, and in 58.1 per cent of the nonrandomized studies. Differences in case-fatality rates between treatment and control groups (P less than 0.05) were found in 8.8 per cent of the blinded-randomization studies, 24.4 per cent of the unblinded-randomization studies, and 58.1 per cent of the nonrandomized studies. These data emphasize the importance of keeping those who recruit patients for clinical trials from suspecting which treatment will be assigned to the patient under consideration.
Allocation concealment in randomised trials: defending against deciphering. K. F. Schulz, D.A. Grimes. Lancet 2002: 359614-618. Proper randomisation rests on adequate allocation concealment. An allocation concealment process keeps clinicians and participants unaware of upcoming assignments. Without it, even properly developed random allocation sequences can be subverted. Within this concealment process, the crucial unbiased nature of randomised controlled trials collides with their most vexing implementation problems. Proper allocation concealment frequently frustrates clinical inclinations, which annoys those who do the trials. Randomised controlled trials are anathema to clinicians. Many involved with trials will be tempted to decipher assignments, which subverts randomisation. For some implementing a trial, deciphering the allocation scheme might frequently become too great an intellectual challenge to resist. Whether their motives indicate innocent or pernicious intents, such tampering undermines the validity of a trial. Indeed, inadequate allocation concealment leads to exaggerated estimates of treatment effect, on average, but with scope for bias in either direction. Trial investigators will be crafty in any potential efforts to decipher the allocation sequence, so trial designers must be just as clever in their design efforts to prevent deciphering. Investigators must effectively immunise trials against selection and confounding biases with proper allocation concealment. Furthermore, investigators should report baseline comparisons on important prognostic variables. Hypothesis tests of baseline characteristics, however, are superfluous and could be harmful if they lead investigators to suppress reporting any baseline imbalances.
Generation of allocation sequences in randomised trials: chance not choice. K. F. Schulz, D.A. Grimes. Lancet 2002: 359515-519. The randomised controlled trial sets the gold standard of clinical research. However, randomisation persists as perhaps the least-understood aspect of a trial. Moreover, anything short of proper randomisation courts selection and confounding biases. Researchers should spurn all systematic, non-random methods of allocation. Trial participants should be assigned to comparison groups based on a random process. Simple (unrestricted) randomisation, analogous to repeated fair coin-tossing, is the most basic of sequence generation approaches. Furthermore, no other approach, irrespective of its complexity and sophistication, surpasses simple randomisation for prevention of bias. Investigators should, therefore, use this method more often than they do, and readers should expect and accept disparities in group sizes. Several other complicated restricted randomisation procedures limit the likelihood of undesirable sample size imbalances in the intervention groups. The most frequently used restricted sequence generation procedure is blocked randomisation. If this method is used, investigators should randomly vary the block sizes and use larger block sizes, particularly in an unblinded trial. Other restricted procedures, such as urn randomisation, combine beneficial attributes of simple and restricted randomisation by preserving most of the unpredictability while achieving some balance. The effectiveness of stratified randomisation depends on use of a restricted randomisation approach to balance the allocation sequences for each stratum. Generation of a proper randomisation sequence takes little time and effort but affords big rewards in scientific accuracy and credibility. Investigators should devote appropriate resources to the generation of properly randomised trials and reporting their methods clearly.
Case-control designs
A case-control study of HIV seroconversion in health care workers after percutaneous exposure. Centers for Disease Control and Prevention Needlestick Surveillance Group. D. M. Cardo, D. H. Culver, C. A. Ciesielski, P. U. Srivastava, R. Marcus, D. Abiteboul, J. Heptonstall, G. Ippolito, F. Lot, P. S. McKibben, D. M. Bell. N Engl J Med 1997: 337(21); 1485-90. BACKGROUND: The average risk of human immunodeficiency virus (HIV) infection after percutaneous exposure to HIV-infected blood is 0.3 percent, but the factors that influence this risk are not well understood. METHODS: We conducted a case-control study of health care workers with occupational, percutaneous exposure to HIV-infected blood. The case patients were those who became seropositive after exposure to HIV, as reported by national surveillance systems in France, Italy, the United Kingdom, and the United States. The controls were health care workers in a prospective surveillance project who were exposed to HIV but did not seroconvert. RESULTS: Logistic-regression analysis based on 33 case patients and 665 controls showed that significant risk factors for seroconversion were deep injury (odds ratio= 15; 95 percent confidence interval, 6.0 to 41), injury with a device that was visibly contaminated with the source patient's blood (odds ratio= 6.2; 95 percent confidence interval, 2.2 to 21), a procedure involving a needle placed in the source patient's artery or vein (odds ratio=4.3; 95 percent confidence interval, 1.7 to 12), and exposure to a source patient who died of the acquired immunodeficiency syndrome within two months afterward (odds ratio=5.6; 95 percent confidence interval, 2.0 to 16). The case patients were significantly less likely than the controls to have taken zidovudine after the exposure (odds ratio=0.19; 95 percent confidence interval, 0.06 to 0.52). CONCLUSIONS: The risk of HIV infection after percutaneous exposure increases with a larger volume of blood and, probably, a higher titer of HIV in the source patient's blood. Postexposure prophylaxis with zidovudine appears to be protective.
Obstetric care and proneness of offspring to suicide as adults: case-control study. Bertil Jacobson, Marc Bygdeman. British Medical Journal 1998: 317(7169); 1346-1349. ABSTRACT: OBJECTIVE: To investigate any long term effects of traumatic birth and obstetric procedures in relation to suicide by violent means in offspring as adults. DESIGN: Prospective case-control study. SETTING: Stockholm, Sweden. SUBJECTS: 242 adults who committed suicide by violent means from 1978 to 1995, and who were born in one of seven hospitals in Stockholm during 1945-80, matched with 403 biological siblings born during the same period and at the same group of hospitals. MAIN OUTCOME MEASURES: Adverse and beneficial perinatal factors expressed as relative risks (odds ratios) and 95% confidence intervals, derived from logistic regression of cases matched with their siblings. RESULTS: For multiple birth trauma the estimated relative risks of offspring subsequently committing suicide by violent means were 4.9 (95% confidence interval 1.8 to 13) for men and 1.04 (0.2 to 4.6) for women. In mothers who received multiple opiate treatment during delivery, the estimated relative risk of offspring subsequently committing suicide was equal for both sexes (0.26, 0.09 to 0.69). CONCLUSION: Minimising pain and discomfort to the infant during birth seems to be of importance in reducing the risk of committing suicide by violent means as an adult. [Medline] [Abstract] [Full text] [PDF]
Risk of testicular cancer in subfertile men: case-control study. H. Moller, N. E. Skakkebaek. British Medical Journal 1999: 318(7183); 559-62. OBJECTIVE: To evaluate the association between subfertility in men and the subsequent risk of testicular cancer. DESIGN: Population based case-control study. SETTING: The Danish population. PARTICIPANTS: Cases were identified in the Danish Cancer Registry; controls were randomly selected from the Danish population with the computerised Danish Central Population Register. Men were interviewed by telephone; 514 men with cancer and 720 controls participated. OUTCOME MEASURE: Occurrence of testicular cancer. RESULTS: A reduced risk of testicular cancer was associated with paternity (relative risk 0.63; 95% confidence interval 0.47 to 0.85). In men who before the diagnosis of testicular cancer had a lower number of children than expected on the basis of their age, the relative risk was 1.98 (1.43 to 2.75). There was no corresponding protective effect associated with a higher number of children than expected. The associations were similar for seminoma and non-seminoma and were not influenced by adjustment for potential confounding factors. CONCLUSION: These data are consistent with the hypothesis that male subfertility and testicular cancer share important aetiological factors.
Testicular cancer risk in relation to use of disposable nappies. H. Moller. Arch Dis Child 2002: 86(1); 28-9. Information on the use of disposable nappies in childhood was available for 296 testicular cancer cases and 287 population controls in Denmark. No association was found between disposable nappy use and the subsequent risk of testicular cancer in adulthood.
Are risk factors for sudden infant death syndrome different at night? S. M. Williams, E. A. Mitchell, B. J. Taylor. Arch Dis Child 2002: 87(4); 274-8. AIMS: To determine whether the risk factors for SIDS occurring at night were different from those occurring during the day. METHODS: Large, nationwide case-control study, with data for 369 cases and 1558 controls in New Zealand. RESULTS: Two thirds of SIDS deaths occurred at night (between 10 pm and 7 30 am). The odds ratio (95% CI) for prone sleep position was 3.86 (2.67 to 5.59) for deaths occurring at night and 7.25 (4.52 to 11.63) for deaths occurring during the day; the difference was significant. The odds ratio for maternal smoking for deaths occurring at night was 2.28 (1.52 to 3.42) and that for the day 1.27 (0.79 to 2.03); that for the mother being single was 2.69 (1.29 to 3.99) for a night time death and 1.25 (0.76 to 2.04) for a daytime death. Both interactions were significant. The interactions between time of death and bed sharing, not sleeping in a cot or bassinet, Maori ethnicity, late timing of antenatal care, binge drinking, cannabis use, and illness in the baby were also significant, or almost so. All were more strongly associated with SIDS occurring at night. CONCLUSIONS: Prone sleep position was more strongly associated with SIDS occurring during the day, whereas night time deaths were more strongly associated with maternal smoking and measures of social deprivation.
Reye's syndrome in the United States from 1981 through 1997. E. D. Belay, J. S. Bresee, R. C. Holman, A. S. Khan, A. Shahriari, L. B. Schonberger. New England Journal of Medicine 1999: 340(18); 1377-82. BACKGROUND: Reye's syndrome is characterized by encephalopathy and fatty degeneration of the liver, usually after influenza or varicella. Beginning in 1980, warnings were issued about the use of salicylates in children with those viral infections because of the risk of Reye's syndrome. METHODS: To describe the pattern of Reye's syndrome in the United States, characteristics of the patients, and risk factors for poor outcomes, we analyzed national surveillance data collected from December 1980 through November 1997. The surveillance system is based on voluntary reporting with the use of a standard case-report form. RESULTS: From December 1980 through November 1997 (surveillance years 1981 through 1997), 1207 cases of Reye's syndrome were reported in patients less than 18 years of age. Among those for whom data on race and sex were available, 93 percent were white and 52 percent were girls. The number of reported cases of Reye's syndrome declined sharply after the association of Reye's syndrome with aspirin was reported. After a peak of 555 cases in children reported in 1980, there have been no more than 36 cases per year since 1987. Antecedent illnesses were reported in 93 percent of the children, and detectable blood salicylate levels in 82 percent. The overall case fatality rate was 31 percent. The case fatality rate was highest in children under five years of age (relative risk, 1.8; 95 percent confidence interval, 1.5 to 2.1) and in those with a serum ammonia level above 45 microg per deciliter (26 micromol per liter) (relative risk, 3.4; 95 percent confidence interval, 1.9 to 6.2). CONCLUSIONS: Since 1980, when the association between Reye's syndrome and the use of aspirin during varicella or influenza-like illness was first reported, there has been a sharp decline in the number of infants and children reported to have Reye's syndrome. Because Reye's syndrome is now very rare, any infant or child suspected of having this disorder should undergo extensive investigation to rule out the treatable inborn metabolic disorders that can mimic Reye's syndrome. [Abstract] [Full text] [PDF]
Reye's syndrome. M. Casteels-Van Daele, C. Van Geet, C. Wouters, E. Eggermont. Lancet 2001: 358(9278); 334. Abstract not available yet.
The disappearance of Reye's syndrome--a public health triumph. A. S. Monto. N Engl J Med 1999: 340(18); p1423-4. Abstract not available.
Hospital controls versus community controls: differences in inferences regarding risk factors for hip fracture. D. J. Moritz, J. L. Kelsey, J. A. Grisso. Am J Epidemiol 1997: 145(7); 653-60. In case-control studies using cases identified from persons admitted to hospitals, two types of controls are most often used: persons from the communities served by the hospitals and persons admitted to the same hospitals as those to which the cases were admitted. It is often unclear which is the more appropriate choice, and whether the use of one or the other type of control group will lead to biased conclusions. The purpose of the present analysis was to determine whether the choice of hospital controls versus community controls would influence conclusions regarding risk factors for hip fracture. Cases (n = 425), hospital controls (n = 312) and community controls (n = 454) were drawn from a case-control study of risk factors for hip fracture in women. Study participants were white and black women aged 45 years or older and living in New York City or Philadelphia, Pennsylvania, who were selected between September 1987 and July 1989. Using community controls but not hospital controls, investigators would have concluded that having a fall during the previous 6 months, current smoking, and moving during the previous year were associated with an increased risk of hip fracture. Associations of hip fracture risk with stroke and prior use of ambulatory aids were stronger using community controls, but associations with estrogen use and body mass index were not influenced by choice of control group. Community controls were quite similar to representative samples of community-dwelling elderly women, whereas hospital controls were somewhat sicker and more likely to be current smokers. The authors conclude that community controls comprise the more appropriate control group in case-control studies of hip fracture in the elderly.
Case-control studies: research in reverse. K. F. Schulz, D.A. Grimes. Lancet 2002: 359431-434. Epidemiologists benefit greatly from having case-control study designs in their research armamentarium. Case-control studies can yield important scientific findings with relatively little time, money, and effort compared with other study designs. This seemingly quick road to research results entices many newly trained epidemiologists. Indeed, investigators implement case-control studies more frequently than any other analytical epidemiological study. Unfortunately, case-control designs also tend to be more susceptible to biases than other comparative studies. Although easier to do, they are also easier to do wrong. Five main notions guide investigators who do, or readers who assess, case-control studies. First, investigators must explicitly define the criteria for diagnosis of a case and any eligibility criteria used for selection. Second, controls should come from the same population as the cases, and their selection should be independent of the exposures of interest. Third, investigators should blind the data gatherers to the case or control status of participants or, if impossible, at least blind them to the main hypothesis of the study. Fourth, data gatherers need to be thoroughly trained to elicit exposure in a similar manner from cases and controls; they should use memory aids to facilitate and balance recall between cases and controls. Finally, investigators should address confounding in case-control studies, either in the design stage or with analytical techniques. Devotion of meticulous attention to these points enhances the validity of the results and bolsters the reader's confidence in the findings.
Selection of controls in case-control studies. I. Principles. S. Wacholder, J. K. McLaughlin, D. T. Silverman, J. S. Mandel. Am J Epidemiol 1992: 135(9); p1019-28. A synthesis of classical and recent thinking on the issues involved in selecting controls for case-control studies is presented in this and two companion papers (S. Wacholder et al. Am J Epidemiol 1992; 135:1029-50). In this paper, a theoretical framework for selecting controls in case-control studies is developed. Three principles of comparability are described: 1) study base, that all comparisons be made within the study base; 2) deconfounding, that comparisons of the effects of the levels of exposure on disease risk not be distorted by the effects of other factors; and 3) comparable accuracy, that any errors in measurement of exposure be nondifferential between cases and controls. These principles, if adhered to in a study, can reduce selection, confounding, and information bias, respectively. The principles, however, are constrained by an additional efficiency principle regarding resources and time. Most problems and controversies in control selection reflect trade-offs among these four principles.
Selection of controls in case-control studies. II. Types of controls. S. Wacholder, D. T. Silverman, J. K. McLaughlin, J. S. Mandel. Am J Epidemiol 1992: 135(9); p1029-41. Types of control groups are evaluated using the principles described in paper 1 of the series, "Selection of Controls in Case-Control Studies" (S. Wacholder et al. Am J Epidemiol 1992; 135:1019-28). Advantages and disadvantages of population controls, neighborhood controls, hospital or registry controls, medical practice controls, friend controls, and relative controls are considered. Problems with the use of decreased controls and proxy respondents are discussed.
Selection of controls in case-control studies. III. Design options. S. Wacholder, D. T. Silverman, J. K. McLaughlin, J. S. Mandel. Am J Epidemiol 1992: 135(9); p1042-50. Several design options available in the planning stage of case-control studies are examined. Topics covered include matching, control/case ratio, choice of nested case-control or case-cohort design, two-stage sampling, and other methods that can be used for control selection. The effect of potential problems in obtaining comparable accuracy of exposure is also examined. A discussion of the difficulty in meeting the principles of study base, deconfounding, and comparable accuracy (S. Wacholder et al. Am J Epidemiol 1992; 135:1019-28) in a single study completes this series of papers.
Design issues in case-control studies. S. Wacholder. Stat Methods Med Res 1995: 4(4); p293-309. The most difficult and most important considerations in planning the protocol of a case-control study are ascertainment of cases, selection of controls and the quality of the exposure measurement. Plans to ensure careful field work are equally important; without attention to data collection, the protocol will be meaningless. In most case-control studies, the measurement problem is magnified because one cannot implement the collection of exposure information at the beginning of follow-up, and instead must rely on interviews, existing records or extrapolation into the past. Consideration of a case-control study as an efficient way to study a cohort helps to resolve some design issues.
Cause and effect
Association and Cause. Raymond Agius. Accessed on 2002-12-09. "Aims of this resource: To enable an understanding of the important concepts in determining causes of ill-health with emphasis on epidemiology and the environmental and occupational aspects of public health. To enable a distinction to be made between associations that are likely to be causal and those which probably have other explanations." www.agius.com/hew/resource/assoc.htm
Dulcet tones of a surgeon's voice may have a hidden meaning. R. Dobson. Bmj 2002: 325(7359); 297. [Full text] [PDF]
Assessing cause and effect from trials: a cautionary note. D. Howel, R. Bhopal. Control Clin Trials 1994: 15(5); 331-4. Abstract not available.
Minerva Review. Author Unknown. British Medical Journal 2000: 320(7243); 1218-1236. About a fifth of hip fractures in both men and women are caused by smoking (International Journal of Epidemiology 2000;29:253-9). An analysis of longitudinal data from over 30 000 Danish people shows that for men, the risk falls if they stop smoking, whereas women remain vulnerable to hip fracture for much longer after quitting. Fortunately, exercise reduces the risk of hip fracture in middle aged and older women (308-14), so stop smoking and start cycling, jogging, or (Minerva's favourite) bouncing up and down on a small trampoline in front of the telly. [Full text] [PDF]
Clinical importance
How well is the clinical importance of study results reported? An assessment of randomized controlled trials. K. B. Chan, M. Man-Son-Hing, F. J. Molnar, A. Laupacis. Cmaj 2001: 165(9); 1197-202. BACKGROUND: The interpretation of the results of randomized controlled trials (RCTs) has traditionally emphasized statistical significance rather than clinical importance. Our aim was to assess the quality of reporting of factors related to clinical importance in a sample of published RCTs. METHODS: A random sample of 27 (of a total of 266) RCTs published in 5 major medical journals over a 1-year period were reviewed by 4 independent reviewers for factors considered important in the interpretation of the clinical importance of study results: identification of a clearly defined primary outcome, reporting of the expected difference between groups used in the calculation of sample size (the delta value) and whether it was based on the minimal clinically important difference of the intervention, the statistical significance of the results, presentation of pertinent confidence intervals, and the authors' interpretation of the clinical importance of the results. RESULTS: Twenty-two of 27 (81%) articles explicitly reported a single primary outcome. Of the 20 articles that included a sample size calculation, 18 (90%) reported a delta value. Two of the 18 (11%) articles explicitly stated that the delta value was chosen to reflect the minimal clinically important difference of the intervention. For the primary outcomes, confidence intervals surrounding the point estimates of the efficacy of the interventions were reported in 11 of 27 (41%) studies. The study results were interpreted from the perspective of clinical importance in 20 of 27 (74%) of the articles. Of these 20 reports, 5 (25%) provided justification for their clinical interpretation of the results. INTERPRETATION: Authors of RCTs published in major general medical and internal medicine journals do not consistently provide their own interpretation of the clinical importance of their results, and they often do not provide sufficient information to allow readers to make their own interpretation.
Compliance
Rules of evidence and clinical recommendations for the management of patients. D. L. Sackett. Can J Cardiol 1993: 9(6); 487-9. Abstract not available yet.
Randomised study of long term outcome after epidural versus non-epidural analgesia during labour. C. J. Howell, T. Dean, L. Lucking, K. Dziedzic, P. W. Jones, R. B. Johanson. Bmj 2002: 325(7360); 357. (This paper uses ITT analysis, but it may not be appropriate.) OBJECTIVE: To determine whether epidural analgesia during labour is associated with long term backache. DESIGN: Follow up after randomised controlled trial. Analysis by intention to treat. SETTING: Department of obstetrics and gynaecology at one NHS trust. PARTICIPANTS: 369 women: 184 randomised to epidural group (treatment as allocated received by 123) and 185 randomised to non-epidural group (treatment as allocated received by 133). In the follow up study 151 women were from the epidural group and 155 from the non-epidural group. MAIN OUTCOME MEASURES: Self reported low back pain, disability, and limitation of movement assessed through one to one interviews with physiotherapist, questionnaire on back pain and disability, physical measurements of spinal mobility. RESULTS: There were no significant differences between groups in demographic details or other key characteristics. The mean time interval from delivery to interview was 26 months. There were no significant differences in the onset or duration of low back pain, with nearly a third of women in each group reporting pain in the week before interview. There were no differences in self reported measures of disability in activities of daily living and no significant differences in measurements of spinal mobility. CONCLUSIONS: After childbirth there are no differences in the incidence of long term low back pain, disability, or movement restriction between women who receive epidural pain relief and women who receive other forms of pain relief.
Intention-to-treat principle. V. M. Montori, G. H. Guyatt. Cmaj 2001: 165(10); p1339-41. Abstract not available yet. [Medline] [Full text] [PDF]
Covariate adjustment
Maternal smoking and Down syndrome: the confounding effect of maternal age. C. L. Chen, T. J. Gilbert, J. R. Daling. Am J Epidemiol 1999: 149(5); 442-6. Inconsistent results have been reported from studies evaluating the association of maternal smoking with birth of a Down syndrome child. Control of known risk factors, particularly maternal age, has also varied across studies. By using a population-based case-control design (775 Down syndrome cases and 7,750 normal controls) and Washington State birth record data for 1984-1994, the authors examined this hypothesized association and found a crude odds ratio of 0.80 (95% confidence interval 0.65-0.98). Controlling for broad categories of maternal age (<35 years, > or =35 years), as described in prior studies, resulted in a negative association (odds ratio = 0.87, 95% confidence interval 0.71-1.07). However, controlling for exact year of maternal age in conjunction with race and parity resulted in no association (odds ratio = 1.00, 95% confidence interval 0.82-1.24). In this study, the prevalence of Down syndrome births increased with increasing maternal age, whereas among controls the reported prevalence of smoking during pregnancy decreased with increasing maternal age. There is a substantial potential for residual confounding by maternal age in studies of maternal smoking and Down syndrome. After adequately controlling for maternal age in this study, the authors found no clear relation between maternal smoking and the risk of Down syndrome.
Look before You Leap: Stratify before You Standardize. Bernard C.K. Choi. American Journal of Epidemiology 1999: 149(12); 1087-1095. ABSTRACT: This paper presents a mathematical model to show the conditions in which age standardization can be used to summarize age-specific rates for comparison purposes over calendar time. It shows that the conditions for valid comparison depend on the type of measure used for comparison, that is, difference, ratio, or percent change. If the measure for comparison is a difference of the standardized rates at two time points, then the age-specific rates need to maintain a constant rate difference over time for the comparison to be valid. If the measure for comparison is a ratio or percent change of the standardized rates at two time points, then the age-specific rates need to maintain a constant rate ratio over time for the comparison to be valid. Since in reality, as shown by our Canadian empirical data, age-specific rates do not always maintain a consistent pattern over time, it is recommended that one should always stratify the data to look at patterns of age-specific rates before applying age standardization.
Causal Knowledge as a Prerequisite for Confounding Evaluation: An Application to Birth Defects Epidemiology. Miguel A. Hernán, Sonia Hernández-Díaz2, Martha M. Werler2 and Allen A. Mitchell2. Am. J of Epidemiology 2002: 155(2); 176-184. Common strategies to decide whether a variable is a confounder that should be adjusted for in the analysis rely mostly on statistical criteria. The authors present findings from the Slone Epidemiology Unit Birth Defects Study, 1992–1997, a case-control study on folic acid supplementation and risk of neural tube defects. When statistical strategies for confounding evaluation are used, the adjusted odds ratio is 0.80 (95% confidence interval: 0.62, 1.21). However, the consideration of a priori causal knowledge suggests that the crude odds ratio of 0.65 (95% confidence interval: 0.46, 0.94) should be used because the adjusted odds ratio is invalid. Causal diagrams are used to encode qualitative a priori subject matter knowledge.
Modeling treatment effects on binary outcomes with grouped-treatment variables and individual covariates. S. C. Johnston, T. Henneman, C. E. McCulloch, M. van der Laan. Am J Epidemiol 2002: 156(8); 753-60. During evaluation of treatment effects in observational studies, confounding is a constant threat because it is always possible that patients with a better prognosis, not adequately characterized by measured covariates, are chosen for a specific therapy. Ecologic analyses may avoid confounding that would be present in analysis at the individual level because variations in regional or hospital practice may be unrelated to prognosis. The authors used simulated data with an excluded confounder to evaluate the reliability and limitations of the grouped-treatment approach, a method of incorporating an ecologic measure of treatment assignment into an individual-level multivariable model, similar to the instrumental variable approach. Estimates based on the grouped-treatment approach were closer to the true value than those of standard individual-level multivariable analysis in every simulation. Furthermore, confidence intervals based on the grouped-treatment approach achieved approximately their nominal coverage, whereas those based on individual-level analyses did not. The grouped-treatment approach appears to be more reliable than standard individual-level analysis in situations where the grouped-treatment variable is unassociated with the outcome except via the actual treatment assignment and measured covariates.
Socioeconomic status and health in blacks and whites: the problem of residual confounding and the resiliency of race. J. S. Kaufman, R. S. Cooper, D. L. McGee. Epidemiology 1997: 8(6); 621-8. A large number of epidemiologic studies have focused on racial/ethnic differences, particularly between blacks and whites. Because health endpoints and racial categorizations are associated with socioeconomic status, investigators generally adjust for socioeconomic indicators. The intention is usually to control for confounding, thereby making groups comparable and excluding socioeconomic status as an alternative explanation to hypotheses of innate physiologic differences. A threat to the validity of these analyses is therefore the presence of residual confounding. We identify four potential sources of residual confounding in this analytical design: categorization of socioeconomic status variables, measurement error in socioeconomic indicators, use of aggregated socioeconomic status measures, and incommensurate socioeconomic indicators. Using simulations and examples from the literature, we demonstrate that the effect of residual confounding is to bias interpretation of data toward the conclusion of independent racial/ethnic group effects. Investigators often refer to possible "genetic" differences on the basis of models that control for socioeconomic status. We propose that such conclusions on the basis of this analytical strategy are generally unwarranted. Racial/ethnic differences in disease are a pressing public health concern, but the current approach does not often provide a basis for inference about putative biological factors in the etiology of this disparity.
META-ANALYSIS Dose-specific Meta-Analysis and Sensitivity Analysis of the Relation between Alcohol Consumption and Lung Cancer Risk. Jeffrey E. Korte, Paul Brennan, S. Jane Henley, Paolo Boffetta. Am. J of Epidemiology 2002: 155(6); 496-506. Alcohol drinking increases the risk of several types of cancer, but studies of the relation between alcohol and lung cancer risk are complicated by smoking. The authors carried out meta-analyses for four study designs and conducted sensitivity analyses to assess the results. Pooled smoking-unadjusted relative risks (RRs) for brewery workers and alcoholics were 1.17 (95% confidence interval (CI): 0.99, 1.39) and 1.99 (95% CI: 1.66, 2.39), respectively, relative to population rates. For cohort and case-control studies, the authors conducted dose-specific meta-analyses for ethanol consumption of 1–499, 500–999, 1,000–1,999, and 2,000 g/month, relative to nondrinking. Smoking-adjusted RRs for ascending dose groups in cohort studies were 0.98 (95% CI: 0.79, 1.21), 0.92 (95% CI: 0.81, 1.04), 1.04 (95% CI: 0.88, 1.22), and 1.53 (95% CI: 1.04, 2.25), respectively. Smoking-adjusted odds ratios for ascending groups in case-control studies were 0.63 (95% CI: 0.51, 0.78), 1.30 (95% CI: 0.98, 1.70), 1.13 (95% CI: 0.46, 2.75), and 1.86 (95% CI: 1.39, 2.49), respectively. Elevated odds ratios were seen for hospital-based case-control studies but not for population-based case-control studies. Sensitivity analyses indicated that smoking explained the elevated RRs in studies of alcoholics and that strong misclassification of smoking status could produce an elevated smoking-adjusted RR in cohort and case-control studies. Overall, evidence for a smoking-adjusted association between alcohol and lung cancer risk is limited to very high consumption groups in cohort and hospital-based case-control studies. At lower levels, any associations observed appear to be explained by confounding.
How do risk factors work together? Mediators, moderators, and independent, overlapping, and proxy risk factors. H. C. Kraemer, E. Stice, A. Kazdin, D. Offord, D. Kupfer. Am J Psychiatry 2001: 158(6); 848-56. OBJECTIVE: The authors developed a methodological basis for investigating how risk factors work together. Better methods are needed for understanding the etiology of disorders, such as psychiatric syndromes, that presumably are the result of complex causal chains. METHOD: Approaches from psychology, epidemiology, clinical trials, and basic sciences were synthesized. RESULTS: The authors define conceptually and operationally five different clinically important ways in which two risk factors may work together to influence an outcome: as proxy, overlapping, and independent risk factors and as mediators and moderators. CONCLUSIONS: Classifying putative risk factors into these qualitatively different types can help identify high-risk individuals in need of preventive interventions and can help inform the content of such interventions. These methods may also help bridge the gaps between theory, the basic and clinical sciences, and clinical and policy applications and thus aid the search for early diagnoses and for highly effective preventive and treatment interventions.
Mediators and moderators of treatment effects in randomized clinical trials. H. C. Kraemer, G. T. Wilson, C. G. Fairburn, W. S. Agras. Arch Gen Psychiatry 2002: 59(10); 877-83. (Covariate adjustment is important, even in randomized trials and can identify important subgroups and mechanisms of action.) Randomized clinical trials (RCTs) not only are the gold standard for evaluating the efficacy and effectiveness of psychiatric treatments but also can be valuable in revealing moderators and mediators of therapeutic change. Conceptually, moderators identify on whom and under what circumstances treatments have different effects. Mediators identify why and how treatments have effects. We describe an analytic framework to identify and distinguish between moderators and mediators in RCTs when outcomes are measured dimensionally. Rapid progress in identifying the most effective treatments and understanding on whom treatments work and do not work and why treatments work or do not work depends on efforts to identify moderators and mediators of treatment outcome. We recommend that RCTs routinely include and report such analyses.
Clinical trials in acute myocardial infarction: Should we adjust for baseline characteristics? Ewout W. Steyerberg, Patrick M.M. Bossuyt, Kerry L. Lee. American Heart Journal 2000: 139(5); 745-751. ABSTRACT: BACKGROUND: Clinical trials concerning acute myocardial infarction often evaluate short-term death. Several baseline characteristics are predictors of death, most notably age. Adjustment for one or more predictors in a multivariable analysis may be considered to correct the estimate of the treatment effect for any imbalance that by chance may have occurred between the randomized groups. Moreover, adjustment results in a stratified estimate of the effect of treatment. METHODS AND RESULTS: The effects of adjustment (correction for imbalance and stratification) were studied with logistic regression analysis in the Global Use of Strategies to Open Occluded Coronary Arteries (GUSTO)-I trial. The primary end point was 30-day death, which occurred in 6.3% of 10,348 patients randomly assigned to tissue plasminogen activator and 7.3% of 20,162 patients randomly assigned to streptokinase thrombolytic therapy. This is equivalent to an unadjusted odds ratio of 0.853. No significant imbalance had occurred for any of 17 baseline characteristics considered, including well-known demographic, presenting, and history characteristics. Adjusted for age, the odds ratio was 0.829, which is an 18% increase in estimated effect on the logistic scale. When adjusted for 17 characteristics, the odds ratio was 0.820, an increase of 25%. The increase in effect estimate was largely explained by the stratification effect and only partly by imbalance of predictors. CONCLUSIONS: Adjustment for predictive baseline characteristics, even when largely balanced, may lead to clearly different estimates of the treatment effect on mortality rates. Adjustment for important predictors such as age is recommended in clinical trials studying patients with acute myocardial infarction.
Research Methods: Why Covariance? A Rationale for Using Analysis of Covariance Procedures in Randomized Studies. Matthew J. Taylor. Journal of Early Intervention 1993: 17(4); 455-466. Abstract not available yet.
A comparison of direct adjustment and regression adjustment of epidemiologic measures. T. C. Wilcosky, L. E. Chambless. J Chronic Dis 1985: 38(10); 849-56. Although regression adjustment can provide a useful alternative to direct adjustment, especially when data are sparse, many researchers are unaware that adjusted summary measures can be easily derived from regression coefficients. In a non-technical discussion with examples, the direct adjustment procedure is compared with three methods of regression adjustment based on analysis of covariance models: the conditional prediction method, the stratified prediction method, and the marginal prediction method. Both the stratified prediction and direct adjustment methods yield summary measures that are weighted averages of stratum-specific measures, while adjusted measures from the conditional prediction method are similar to stratum-specific estimates. In contrast to the other adjustment procedures, which can use internal or external weights, the marginal prediction method always gives an internally adjusted measure. Under certain conditions, the three regression adjustment procedures produce identical results. Major advantages of direct adjustment include computational simplicity and relatively few statistical assumptions. Regression adjustment, however, is more convenient for statistical tests for interactions and group differences, and often precludes the need to categorize continuous variables, so that problems with empty strata are avoided.
Dropouts
Hold the Lard! The Atkins Diet still doesn't work.. Michael Fumento. Accessed on 2002-12-06. A careful analysis of the recent research on the Atkins diet shows that there was a much higher drop out rate in that group, which could partially explain the promising results of this diet. www.reason.com/hod/mf120502.shtml
Article makes simple errors and could cause unnecessary deaths. C. Baigent, R. Collins, R. Peto. British Medical Journal 2002: 324(7330); 167. (An interesting critical review of a large randomized stduy and a meta-analysis.) "The worldwide meta-analysis of antiplatelet trials shows that low dose aspirin (or some other effective antiplatelet regimen) reduces non-fatal myocardial infarction, non-fatal stroke, and vascular death in a wide range of patients who are at high risk of occlusive vascular disease. A paper disputing this was published concurrently in the For Debate section of the journal, but the arguments in it (some of which the author also published on the same date in an editorial in the Lancet) depend strongly on quite simple mistakes about the randomised evidence and could cause unnecessary deaths." [Medline] [Full text] [PDF]
The Effect of School Dropout Rates on Estimates of Adolescent Substance Use among Three Racial/Ethnic Groups. Randall C. Swaim, F Beauvais, EL Chavez, ER Oetting. American Journal of Public Health 1997: 87(1); 51-55. ABSTRACT: OBJECTIVES: This study examined, across three racial/ethnic groups, how the inclusion of data on drug use of dropouts can alter estimates of adolescent drug use rates. METHODS: Self-report rates of lifetime prevalence and use in the previous 30 days were obtained from Mexican American, White non-Hispanic, and Native American student (n = 738) and dropouts (n = 774). Rates for the age cohort (students and dropouts) were estimated with a weighted correction formula. RESULTS: Rates of use reported by dropouts were 1.2 to 6.4 times higher than those reported by students. Corrected rates resulted in changes in relative rates of use by different ethnic groups. CONCLUSIONS: When only in-school data are available, errors in estimating drug use among groups with high rates of school dropout can be substantial. Correction of student-based data to include drug use of dropouts leads to important changes in estimated levels of drug use and alters estimates of the relative rates of use for racial/ethnic minority groups with high dropout rates.
Ecologic studies
Medicine and the Media: Did Monica really say that? Hugh Tunstall-Pedoe. British Medical Journal 1998: 3171023. Abstract not available yet. [Full text]
The Semi-individual Study in Air Pollution Epidemiology: A Valid Design as Compared to Ecologic Studies. Nino Kunzli, Ira B. Tager. Environmental Health Perspectives 1997: 105(10); 1078-1083. ABSTRACT: The assessment of long-term effects of air pollution in humans relies on epidemiologic studies. A widely used design consists of cross-sectional or cohort studies in which ecologic assignment of exposure, based on a fixed-site ambient monitor, is employed. Although health outcome and usually a large number of covariates are measured in individuals, these studies are often called ecological. We will introduce the term semi-individual design for these studies. We review the major properties and limitations with regard to causal inference of truly ecologic studies, in which outcome, exposure, and covariates are available on an aggregate level only. Misclassification problems and issues related to confounding and model specification in truly ecologic studies limit etiologic inference to individuals. In contrast, the semi-individual study shares its methodological and inferential properties with typical individual-level study designs. The major caveat relates to the case where too few study areas, e.g., two or three, are used, which render control of aggregate level confounding impossible. The issue of exposure misclassification is of general concern in epidemiology and not an exclusive problem of the semi-individual design. In a multicenter setting, the semi-individual study is a valuable tool to approach long-term effects of air pollution. Knowledge about the error structure of the ecologically assigned exposure allows consideration of the impact of ecologically assigned exposure on effect estimation. Semi-individual studies, i.e., individual level air pollution studies with ecologic exposure assignment, more readily permit valid inference to individuals and should not be labeled as ecologic studies.
Ecologic studies in epidemiology: concepts, principles, and methods. H. Morgenstern. Annu Rev Public Health 1995: 1661-81. An ecologic study focuses on the comparison of groups, rather than individuals; thus, individual-level data are missing on the joint distribution of variables within groups. Variables in an ecologic analysis may be aggregate measures, environmental measures, or global measures. The purpose of an ecologic analysis may be to make biologic inferences about effects on individual risks or to make ecologic inferences about effects on group rates. Ecologic study designs may be classified on two dimensions: (a) whether the primary group is measured (exploratory vs analytic study); and (b) whether subjects are grouped by place (multiple-group study), by time (time-trend study), or by place and time (mixed study). Despite several practical advantages of ecologic studies, there are many methodologic problems that severely limit causal inference, including ecologic and cross-level bias, problems of confounder control, within-group misclassification, lack of adequate data, temporal ambiguity, collinearity, and migration across groups.
Ecological study for reasons for sharp decline in mortality from ischaemic heart disease in Poland since 1991. WA Zatonski, AJ McMichael, JW Powles. British Medical Journal 1998: 316(7137); 1047-1051. ABSTRACT: OBJECTIVE: To investigate the reasons for the decline in deaths attributed to ischaemic heart disease in Poland since 1991 after two decades of rising rates. DESIGN: Recent changes in mortality were measured as percentage deviations in 1994 from rates predicted by extrapolation of sex and age specific death rates for 1980-91 for diseases of the circulatory system and selected other categories. Available data on national and household food availability, alcohol consumption, cigarette smoking, socioeconomic indices, and medical services over time were reviewed. MAIN OUTCOME MEASURES: Age specific and age standardised rates of death attributed to ischaemic heart disease and related causes. RESULTS: The change in trend in mortality attributed to diseases of the circulatory system was similar in men and women and most marked (> 20%) in early middle age. For ages 45 to 64 the decrease was greatest for deaths attributed to ischaemic heart disease and atherosclerosis (around 25%) and less for stroke (< 10%). For most of the potentially explanatory variables considered, there were no corresponding changes in trend. However, between 1986-90 and 1994 there was a marked switch from animal fats (estimated availability down 23%) to vegetable fats (up 48%) and increased imports of fruit. CONCLUSION: Reporting biases are unlikely to have exaggerated the true fall in ischaemic heart disease; neither is it likely to be mainly due to changes in smoking, drinking, stress, or medical care. Changes in type of dietary fat and increased supplies of fresh fruit and vegetables seem to be the best candidates. [Medline] [Abstract] [PDF]
Exclusions
Papers and Programs. Joop Hox. Accessed on 2003-02-17. "This section includes papers that are being worked on, programs that I find useful, and data sets that I have published about, for re-analysis." www.fss.uu.nl/ms/jh/papers/papers.htm
Representativeness and response rates from the Domestic/International Gastroenterology Surveillance Study (DIGEST). J. G. Tijssen. Scand J Gastroenterol Suppl 1999: 23115-9. BACKGROUND: The Domestic/international Gastroenterology Surveillance Study (DIGEST) examined the prevalence of upper gastrointestinal symptoms among the general population in 10 countries, and the impact of these symptoms on healthcare usage and quality of life. This report discusses the validation of the DIGEST sample and reviews the response rates from the survey. METHODS: External validation of the DIGEST sample was conducted by comparing the age, age by gender and annual household incomes of the sample with census-derived data. A comparison was also made between Psychological General Well-Being Index (PGWBI) scores from study subjects in the Scandinavian countries and the USA and the total sample population norms. RESULTS: Under- and oversampling, defined as > or =5% difference from the population norms, was evident in eight out of 10 countries, but no systematic bias was evident. The final distribution of the sample by gender was 51% female and 49% male. Although differences in PGWBI scores were noted between DIGEST subjects and population norms, these differences were <0.30 standard deviations--markedly below the difference considered as relevant for the PGWBI. Response for the survey in individual countries ranged from 17% in the USA to 61% in Norway, with a survey-wide rate of 27%. The overall response rate, including primary non-respondents, was 13.4%. The majority of nonresponse (51.4%) was attributed to failure to establish contact with the subjects, with 41.7% of subjects declining to be interviewed and the remaining 6.9% of subjects not meeting the age and sex criteria used for the survey. CONCLUSIONS: The DIGEST sample exhibited good external validity, providing a foundation for comparison between data derived from individual countries in the survey.
Sample size slippages in randomized trials: exclusions and the lost and wayward. K. F. Schulz, D.A. Grimes. Lancet 2002: 359781-785. Proper randomisation means little if investigators cannot include all randomised participants in the primary analysis. Participants might ignore follow-up, leave town, or take aspartame when instructed to take aspirin. Exclusions before randomisation do not bias the treatment comparison, but they can hurt generalisability. Eligibility criteria for a trial should be clear, specific, and applied before randomisation. Readers should assess whether any of the criteria make the trial sample atypical or unrepresentative of the people in which they are interested. In principle, assessment of exclusions after randomisation is simple: none are allowed. For the primary analysis, all participants enrolled should be included and analysed as part of the original group assigned (an intent-to-treat analysis). In reality, however, losses frequently occur. Investigators should, therefore, commit adequate resources to develop and implement procedures to maximise retention of participants. Moreover, researchers should provide clear, explicit information on the progress of all randomised participants through the trial by use of, for instance, a trial profile. Investigators can also do secondary analyses on, for instance, per-protocol or as-treated participants. Such analyses should be described as secondary and non-randomised comparisons. Mishandling of exclusions causes serious methodological difficulties. Unfortunately, some explanations for mishandling exclusions intuitively appeal to readers, disguising the seriousness of the issues. Creative mismanagement of exclusions can undermine trial validity.
Sample size slippages in randomised trials: exclusions and the lost and wayward. K. F. Schulz, D. A. Grimes. Lancet 2002: 359(9308); 781-5. Proper randomisation means little if investigators cannot include all randomised participants in the primary analysis. Participants might ignore follow-up, leave town, or take aspartame when instructed to take aspirin. Exclusions before randomisation do not bias the treatment comparison, but they can hurt generalisability. Eligibility criteria for a trial should be clear, specific, and applied before randomisation. Readers should assess whether any of the criteria make the trial sample atypical or unrepresentative of the people in which they are interested. In principle, assessment of exclusions after randomisation is simple: none are allowed. For the primary analysis, all participants enrolled should be included and analysed as part of the original group assigned (an intent-to-treat analysis). In reality, however, losses frequently occur. Investigators should, therefore, commit adequate resources to develop and implement procedures to maximise retention of participants. Moreover, researchers should provide clear, explicit information on the progress of all randomised participants through the trial by use of, for instance, a trial profile. Investigators can also do secondary analyses on, for instance, per-protocol or as-treated participants. Such analyses should be described as secondary and non-randomised comparisons. Mishandling of exclusions causes serious methodological difficulties. Unfortunately, some explanations for mishandling exclusions intuitively appeal to readers, disguising the seriousness of the issues. Creative mismanagement of exclusions can undermine trial validity.
A controlled trial of immunotherapy for asthma in allergic children. N. F. Adkinson, Jr., P. A. Eggleston, D. Eney, E. O. Goldstein, K. C. Schuberth, J. R. Bacon, R. G. Hamilton, M. E. Weiss, H. Arshad, C. L. Meinert, J. Tonascia, B. Wheeler. New England Journal of Medicine 1997: 336(5); 324-31. (Noncompliant patients were excluded prior to the start of the trial) BACKGROUND: Injections of allergens are widely prescribed for patients with asthma, but little is known about the effectiveness of immunotherapy. METHODS: We conducted a double-blind, placebo-controlled trial of multiple-allergen immunotherapy in 121 allergic children with moderate-to-severe, perennial asthma. The children, who required daily medication for their asthma, were randomly assigned to receive subcutaneous injections of either a mixture of up to seven aeroallergen extracts or a placebo. Maintenance injections were continued for 18 months or longer. Medications were adjusted every two to three weeks on the basis of peak flow rates and symptoms. The principal outcome was the daily medication score. Bronchial sensitivity to methacholine (the concentration provoking a 20 percent decrease in the forced expiratory volume in one second [PC20]) was measured twice yearly. RESULTS: The median medication score declined from 5.4 to 4.9 in the immunotherapy group (P<0.001) and from 5.2 to 5.0 in the placebo group (P<0.001), but there was no significant difference between the groups (P>0.6). The number of days on which oral corticosteroids were used was similar in the two groups. Partial or complete remission of asthma occurred in 31 percent of the immunotherapy group and in 28 percent of the placebo group (P>0.5). There was no difference between the groups in the use of medical care, symptoms, or peak flow rates. The median PC20 increased significantly in both groups, but again with no difference between the two groups. CONCLUSIONS: Immunotherapy with injections of allergens for over two years was of no discernible benefit in allergic children with perennial asthma who were receiving appropriate medical treatment.
Unjustified exclusion of elderly people from studies submitted to research ethics committee for approval: descriptive study. A. Bayer, W. Tadd. British Medical Journal 2000: 321(7267); 992-3. Abstract not available yet. [Full text] [PDF]
Exclusion of elderly people from clinical research: a descriptive study of published reports. G. Bugeja, A. Kumar, A. K. Banerjee. British Medical Journal 1997: 315(7115); 1059. Abstract not available yet. [Full text]
Participation in Research and Access to Experimental Treatments by HIV-Infected Patients. Allen L. Gifford, William E. Cunningham, Kevin C. Heslin, Ron M. Andersen, Terry Nakazono, Dale K. Lieu, Martin F. Shapiro, Samuel A. Bozzette, the HIV Cost and Services Utilization Study Consortium. N Engl J Med 2002: 346(18); 1373-1382. Background Although there is concern that minority groups and women are underrepresented in research involving patients with human immunodeficiency virus (HIV) infection, the available data are inconclusive. Methods We used nationally representative data from the HIV Cost and Services Utilization Study to determine the characteristics of the participants and nonparticipants in trials of medications for HIV infection and whether or not patients had access to experimental treatments. A probability sample of 2864 persons, representing all 231,400 adults with known HIV infection who are cared for in the contiguous United States, were interviewed on three occasions between 1996 and 1998. They were asked about participation in clinical research studies of medications and past receipt of experimental medications for HIV. Results We estimate that 14 percent of adults receiving care for HIV infection participated in a medication trial or study; 24 percent had received experimental medications; and 8 percent had tried and failed to obtain experimental treatments. According to multivariate models, non-Hispanic blacks and Hispanics were less likely to be participating in trials than non-Hispanic whites (odds ratio for participation among non-Hispanic blacks, 0.50 [95 percent confidence interval, 0.28 to 0.91]; odds ratio among Hispanics, 0.58 [95 percent confidence interval, 0.37 to 0.93]) and to have received experimental medications (odds ratios, 0.41 [95 percent confidence interval, 0.32 to 0.54] and 0.56 [95 percent confidence interval, 0.41 to 0.78], respectively). Patients who were cared for in private health maintenance organizations were less likely to participate in trials than those with fee-for-service insurance (odds ratio, 0.43 [95 percent confidence interval, 0.21 to 0.88]). Women were not underrepresented in research trials and had a similar likelihood of receiving experimental treatments. Conclusions Among patients with HIV infection, participation in research trials and access to experimental treatment is influenced by race or ethnic group and type of health insurance. [Abstract]
The exclusion of the elderly and women from clinical trials in acute myocardial infarction. J. H. Gurwitz, N. F. Col, J. Avorn. Jama 1992: 268(11); 1417-22. OBJECTIVE--To determine the extent to which the elderly have been excluded from trials of drug therapies used in the treatment of acute myocardial infarction, to identify factors associated with such exclusions, and to explore the relationship between the exclusion of elderly and the representation of women. DATA SOURCES--We conducted a systematic search of the English-language literature from January 1960 through September 1991 to identify all relevant studies of specific pharmacotherapies employed in the treatment of acute myocardial infarction. To accomplish this, we searched MEDLINE, major cardiology textbooks, meta-analyses, reviews, editorials, and the bibliographies of all identified articles. STUDY SELECTION--Only trials in which patients were randomly allocated to receive a specific therapeutic regimen or a placebo or nonplacebo control regimen were included for review. DATA EXTRACTION--Studies were abstracted for year of publication, source of support, performance location, drug therapies to which patients were randomized, use of invasive diagnostic tests or therapeutic procedures, exclusion criteria, size and demographic characteristics of the randomized study population, and principal outcome measures. DATA SYNTHESIS--A total of 214 trials met inclusion criteria, involving 150,920 study subjects. Over 60% of trials excluded persons over the age of 75 years. Studies published after 1980 were more likely to have age-based exclusions compared with studies published before 1980 (adjusted odds ratio, 4.92; 95% confidence interval, 2.33 to 10.54). Trials of thrombolytic therapy involving an invasive procedure were more likely to exclude elderly patients compared with other studies (adjusted odds ratio, 2.45; 95% confidence interval, 1.10 to 5.47). Studies with age-based exclusions had a smaller percentage of women compared with those without such exclusions (18% vs 23%; P = .0002), with the mean age of the study population significantly associated with the proportion of women participants (P = .0001, R2 = .29). CONCLUSIONS--Age-based exclusions are frequently used in clinical trials of medications used in the treatment of acute myocardial infarction. Such exclusions limit the ability to generalize study findings to the patient population that experiences the most morbidity and mortality from acute myocardial infarction.
Do safety practices differ between responders and non-responders to a safety questionnaire? D. Kendrick, R. Hapgood, P. Marsh. Injury Prevention 2001: 7(2); 100-3. OBJECTIVE: To compare reported safety practices between responders and non-responders to a safety survey. DESIGN: Cross sectional survey at baseline compared with safety practices reported at subsequent child health surveillance checks. SUBJECTS: Parents of children aged 3-12 months registered with practices participating in a controlled trial of injury prevention in primary care that did, and did not, respond to the baseline survey and who subsequently attended child health surveillance checks. RESULTS: No difference in safety practices was found between responders and non-responders to the survey at the 6-9 month check. Responders were more likely to report owning a stair gate (odds ratio (OR) 2.75, 95% confidence interval (CI) 1.82 to 4.16) and socket covers (OR 2.16, 95% CI 1.53 to 3.04) at the 12-15 month check, and owning socket covers (OR 2.19, 95% CI 1.34 to 3.61) at the 18-24 month check. Responders were more likely to report greater than the median number of safety practices at the 18 month check. CONCLUSIONS: Non-responders to a safety survey appear to be less likely to report owning several items of safety equipment than responders. Further work is needed to confirm these findings. Extrapolating the results of safety surveys to the population as a whole may lead to over estimation of safety equipment possession.
Comorbidity of chronic diseases in general practice. F. G. Schellevis, J. van der Velden, E. van de Lisdonk, J. T. van Eijk, C. van Weel. J Clin Epidemiol 1993: 46(5); 469-73. With the increasing number of elderly people in The Netherlands the prevalence of chronic diseases will rise in the next decades. It is recognized in general practice that many older patients suffer from more than one chronic disease (comorbidity). The aim of this study is to describe the extent of comorbidity for the following diseases: hypertension, chronic ischemic heart disease, diabetes mellitus, chronic nonspecific lung disease, osteoarthritis. In a general practice population of 23,534 persons, 1989 patients have been identified with one or more chronic diseases. Only diseases in agreement with diagnostic criteria were included. In persons of 65 and older 23% suffer from one or more of the chronic diseases under study. Within this group 15% suffer from more than one of the chronic diseases. Osteoarthritis and diabetes mellitus are the diseases with the highest rate of comorbidity. Comorbidity restricts the external validity of results from single-disease intervention studies and complicates the organization of care.
Nonresponse bias and early versus all responders in mail and telephone surveys. J. Siemiatycki, S. Campbell. Am J Epidemiol 1984: 120(2); p291-301. Mail and telephone survey methods, with or without follow-up by other methods, are cost-effective alternatives to the conventional home interview approach. However, it has long been thought that they are especially susceptible to nonresponse bias. The study addressed this issue in the context of parallel mail and telephone health surveys carried out in Montreal. The mail strategy among 1,555 adults achieved 68.5% response and follow-up by telephone and home interview increased response to 80.9%. Respondents were adequately representative of the entire sample with respect to socioeconomic status, number of adults in household, and ethnic distribution. The 68.5% initial stage respondents were similar to all respondents on the above variables as well as on age, sex, education and reported health status. Odds ratios of smoking and respiratory symptoms hardly differed between initial stage and all respondents. The telephone survey among 1,595 adults achieved 72.7% response and follow-up by mail and personal interview increased response to 88.2%. Comparisons between respondents and the entire sample and between initial stage respondents and all respondents gave similar results to those found in the mail strategy, although there was some change in a symptom-smoking odds ratio from the initial stage respondents to all respondents. In both survey strategies, there was no evidence of substantial nonresponse bias and estimates of morbidity and health care would not have differed much if the fieldwork had stopped at the initial mail or telephone stage.
Physicians' reasons for not entering eligible patients in a randomized clinical trial of surgery for breast cancer. K. M. Taylor, R. G. Margolese, C. L. Soskolne. N Engl J Med 1984: 310(21); p1363-7. We studied the reasons surgical principal investigators chose not to enter patients in a large, multicenter trial sponsored by a cooperative group. In 1976 the National Surgical Adjuvant Project for Breast and Bowel Cancers (NSABP) initiated a clinical trial to compare segmental mastectomy and postoperative radiation, or segmental mastectomy alone, with total mastectomy. Because the low rates of accrual were threatening to close the trial prematurely, we mailed a questionnaire to the 94 NSABP principal investigators, asking why they were not entering eligible patients in the trial. A response rate of 97 per cent was achieved. Physicians who did not enter all eligible patients offered the following explanations: (1) concern that the doctor-patient relationship would be affected by a randomized clinical trial (73 per cent), (2) difficulty with informed consent (38 per cent), (3) dislike of open discussions involving uncertainty (22 per cent), (4) perceived conflict between the roles of scientist and clinician (18 per cent), (5) practical difficulties in following procedures (9 per cent), and (6) feelings of personal responsibility if the treatments were found to be unequal (8 per cent). Further investigation into the behavioral aspects of the investigator-patient relationship is particularly pressing, since fear of change in this relationship was the most common reason given for not entering eligible patients in the trial.
Representation of older patients in cancer treatment trials. EL Trimble, CL Carter, D Cain, B Freidlin, RS Ungerleider, MA Friedman. Cancer 1994: 74(7); 2208-14. ABSTRACT: In 1990, the five leading causes of cancer death in men aged 65 and older were carcinomas of the lung, prostate, colon and rectum, and pancreas, and leukemia. For women in this age group, the five leading causes of cancer death were carcinomas of the lung, breast, colon and rectum, pancreas, and ovary. To determine the representation of the elderly in clinical trials, the 1992 accrual of the National Cancer Institute (NCI)-sponsored Clinical Cooperative Group treatment trials (which included more than 8000 elderly patients) for the aforementioned sites was compared with the 1990 incidence data from the NCI's Surveillance, Epidemiology, and End Results program. Of the male patients enrolled in the trials, an average of 39% were older than 65 (47.3% lung, 79.5% prostate, 47.5% colorectal, 45.6% pancreas, and 9.6% leukemia); whereas 25.9% of all women enrolled in trials were 65 or older (43.6% lung, 17.3% breast, 46.2% colorectal, 59.6% pancreas, and 35.4% ovary). With respect to incidence, older patients generally are underrepresented in cancer treatment trials. With the exception of the data on prostate cancer, each of the comparisons using the Z statistic gave probability values of less than 0.01. The most significant discrepancies between incidence and participation in cancer treatment protocols were noted for leukemia in males and breast cancer in females. Possible explanations for these findings include (1) a research focus on aggressive therapy, which may be unacceptably toxic to the elderly; (2) presence of comorbidity in the elderly; (3) fewer trials available specifically aimed at older patients; (4) limited expectations for long term benefits on the part of physicians, relatives, and the patients themselves; and (5) a lack of financial, logistic, and social support for the participation of elderly patients in clinical trials. Recognizing this situation, NCI recently sponsored a number of trials that specifically target the elderly. This paper describes the status of all major Phase II and III clinical trials that recently were closed, still are active, or now are in review that address the clinical care of this important segment of the U.S. population.
Are Subjects in Pharmacological Treatment Trials of Depression Representative of Patients in Routine Clincal Practice. M. Zimmerman, J.I. Mattia, Michael A. Posternak. American Journal of Psychiatry 2002: 159(3); 469-473. OBJECTIVE: The methods used to evaluate the efficacy of antidepressants differ from treatment for depression in routine clinical practice. The rigorous inclusion/exclusion criteria used to select subjects for participation in efficacy studies potentially limit the generalizability of these trials' results. It is unknown how much impact these criteria have on the representativeness of subjects in efficacy trials. This study estimated the proportion of depressed patients treated in routine clinical practice who would meet standard inclusion/exclusion criteria for an efficacy trial. METHOD: A total of 803 individuals, aged 16--65 years, who were seen at intake at an outpatient practice underwent a thorough diagnostic evaluation, including the administration of semistructured diagnostic interviews; 346 patients had current major depression. Common inclusion/exclusion criteria used in efficacy studies of antidepressants were applied to the depressed patients to determine how many would have qualified for an efficacy trial. RESULTS: Approximately one-sixth of the 346 depressed patients would have been excluded from an efficacy trial because they had a bipolar or psychotic subtype of depression. The presence of a comorbid anxiety or substance use disorder, insufficient severity of depressive symptoms, or current suicidal ideation would have excluded 86.0% (N=252) of the remaining 293 outpatients with nonpsychotic unipolar major depressive disorder from an antidepressant efficacy trial. CONCLUSIONS: Subjects treated in antidepressant trials represent a minority of patients treated for major depression in routine clinical practice. These results show that antidepressant efficacy trials tend to evaluate a subset of depressed individuals with a specific clinical profile.
Generalization and particularization
Statistical Assumptions as Empirical Commitments. Richard A. Berk, David A. Freedman. Accessed on 2001-August. "Researchers who study punishment and social control, like those who study other social phenomena, typically seek to generalize their findings from the data they have to some larger context: in statistical jargon, they generalize from a sample to a population. Generalizations are one important product of empirical inquiry. Of course, the process by which the data are selected introduces uncertainty. Indeed, any given dataset is but one of many that could have been studied. If the dataset had been different, the statistical summaries would have been different, and so would the conclusions, at least by a little." stat-www.berkeley.edu/~census/berk2.pdf
Applying evidence to the individual patient. S. E. Straus, D. L. Sackett. Ann Oncol 1999: 10(1); 29-32. (This paper provides practical guidance on the NNT/NNH tradeoffs.) Abstract not available yet.
Using research findings in clinical practice. S. E. Straus, D. L. Sackett. British Medical Journal 1998: 317(7154); 339-42. [Full text]
Guidelines
Improving the quality of reporting of randomized controlled trials. The CONSORT statement. C. Begg, M. Cho, S. Eastwood, R. Horton, D. Moher, I. Olkin, R. Pitkin, D. Rennie, K. F. Schulz, D. Simel, D. F. Stroup. Jama 1996: 276(8); 637-9. Abstract not available yet.
Assessing the quality of randomized controlled trials: an annotated bibliography of scales and checklists. D. Moher, A. R. Jadad, G. Nichol, M. Penman, P. Tugwell, S. Walsh. Control Clin Trials 1995: 16(1); p62-73. Assessing the quality of randomized controlled trials (RCTs) is important and relatively new. Quality gives us an estimate of the likelihood that the results are a valid estimate of the truth. We present an annotated bibliography of scales and checklists developed to assess quality. Twenty-five scales and nine checklists have been developed to assess quality. The checklists are most useful in providing investigators with guidelines as to what information should be included in reporting RCTs. The scales give readers a quantitative index of the likelihood that the reported methodology and results are free of bias. There are several shortcomings with these scales. Future scale development is likely to be most beneficial if questions common to all trials are assessed, if the scale is easy to use, and if it is developed with sufficient rigor.
The CONSORT statement: revised recommendations for improving the quality of reports of parallel-group randomised trials. D. Moher, K. F. Schulz, D. G. Altman, L. Lepage. Lancet 2001: 357(9263); p1191-4. To comprehend the results of a randomised controlled trial (RCT), readers must understand its design, conduct, analysis, and interpretation. That goal can be achieved only through total transparency from authors. Despite several decades of educational efforts, the reporting of RCTs needs improvement. Investigators and editors developed the original CONSORT (Consolidated Standards of Reporting Trials) statement to help authors improve reporting by use of a checklist and flow diagram. The revised CONSORT statement presented here incorporates new evidence and addresses some criticisms of the original statement. The checklist items pertain to the content of the Title, Abstract, Introduction, Methods, Results, and Discussion. The revised checklist includes 22 items selected because empirical evidence indicates that not reporting this information is associated with biased estimates of treatment effect, or because the information is essential to judge the reliability or relevance of the findings. We intended the flow diagram to depict the passage of participants through an RCT. The revised flow diagram depicts information from four stages of a trial (enrollment, intervention allocation, follow-up, and analysis). The diagram explicitly shows the number of participants, for each intervention group, included in the primary data analysis. Inclusion of these numbers allows the reader to judge whether the authors have done an intention-to-treat analysis. In sum, the CONSORT statement is intended to improve the reporting of an RCT, enabling readers to understand a trial's conduct and to assess the validity of its results.
Matching
Removal of radiation dose response effects: an example of over-matching. J. L. Marsh, J. L. Hutton, K. Binks. Bmj 2002: 325(7359); 327-30.
Paired versus Two-Sample Design for a Clinical Trial of Treatments with Dichotomous Outcome: Power Considerations. S Wacholder, CR Weinberg. Biometrics 1982: 38(3); 801-812. ABSTRACT: For the same number of observations in a small-sample clinical trial with dichotomous outcome, the statistical power associated with a two-sample design, analyzed by Fisher's exact test, is slightly greater than that associated with a matched design, analyzed by McNemar's test, and hence of the matched design, is monotone increasing in the within-pair correlation between the treatment responses. Power curves are presented which demonstrate that positive within-pair correlation, even when quite small, can result in a superiority in power for the matched design. Conversely, in the rare situations where there is a negative within-pair correlation, choice of a two-sample design can result in a substantial gain in power.
Matching in epidemiology as a paradigm for twin research on the Etiology of Disease. C White. Acta Geneticae Medicae Et Gemellologiae 1981: 30(1); 77-86. Abstract not available.
Multiple comparisons
Do multiple outcome measures require p-value adjustment? R. J. Feise. BMC Med Res Methodol 2002: 2(1); 8. BACKGROUND: Readers may question the interpretation of findings in clinical trials when multiple outcome measures are used without adjustment of the p-value. This question arises because of the increased risk of Type I errors (findings of false "significance") when multiple simultaneous hypotheses are tested at set p-values. The primary aim of this study was to estimate the need to make appropriate p-value adjustments in clinical trials to compensate for a possible increased risk in committing Type I errors when multiple outcome measures are used. DISCUSSION: The classicists believe that the chance of finding at least one test statistically significant due to chance and incorrectly declaring a difference increases as the number of comparisons increases. The rationalists have the following objections to that theory: 1) P-value adjustments are calculated based on how many tests are to be considered, and that number has been defined arbitrarily and variably; 2) P-value adjustments reduce the chance of making type I errors, but they increase the chance of making type II errors or needing to increase the sample size. SUMMARY: Readers should balance a study's statistical significance with the magnitude of effect, the quality of the study and with findings from other studies. Researchers facing multiple outcome measures might want to either select a primary outcome measure or use a global assessment measure, rather than adjusting the p-value.
Quantitative Evaluation of Multiplicity in Epidemiology and Public Health Research. Kenneth J. Ottenbacher. American Journal of Epidemiology 1998: 147(7); 615-619. ABSTRACT: Epidemiologic and public health researchers frequently include several dependent variables, repeated assessments, or subgroup analyses in their investigations. These factors result in multiple tests of statistical significance and may produce type 1 experimental errors. This study examined the type 1 error rate in a sample of public health and epidemiologic research. A total of 173 articles chosen at random from 1996 issues of the American Journal of Public Health and the American Journal of Epidemiology were examined to determine the incidence of type 1 errors. Three different methods of computing type 1 error rates were used: experiment-wise error rate, error rate per experiment, and percent error rate. The results indicate a type 1 error rate substantially higher than the traditionally assumed level of 5% (p < 0.05). No practical or statistically significant difference was found between type 1 error rates across the two journals. Methods to determine and correct type 1 errors should be reported in epidemiologic and public health research investigations that include multiple statistical tests.
Cured and broiled meat consumption in relation to childhood cancer: Denver, Colorado (United States). S. Sarasua, D. A. Savitz. Cancer Causes Control 1994: 5(2); 141-8. The association between cured and broiled meat consumption by the mother during pregnancy and by the child was examined in relation to childhood cancer. Five meat groups (ham, bacon, or sausage; hot dogs; hamburgers; bologna, pastrami, corned beef, salami, or lunch meat; charcoal broiled foods) were assessed. Exposures among 234 cancer cases (including 56 acute lymphocytic leukemia [ALL], 45 brain tumor) and 206 controls selected by random-digit dialing in the Denver, Colorado (United States) standard metropolitan statistical area were compared, with adjustment for confounders. Maternal hot-dog consumption of one or more times per week was associated with childhood brain tumors (odds ratio [OR] = 2.3, 95 percent confidence interval [CI] = 1.0-5.4). Among children, eating hamburgers one or more times per week was associated with risk of ALL (OR = 2.0, CI = 0.9-4.6) and eating hot dogs one or more times per week was associated with brain tumors (OR = 2.1, CI = 0.7-6.1). Among children, the combination of no vitamins and eating meats was associated more strongly with both ALL and brain cancer than either no vitamins or meat consumption alone, producing ORs of two to seven. The results linking hot dogs and brain tumors (replicating an earlier study) and the apparent synergism between no vitamins and meat consumption suggest a possible adverse effect of dietary nitrites and nitrosamines.
False positive outcomes and design characteristics in occupational cancer epidemiology studies. G. G. Swaen, O. Teggeler, L. G. van Amelsvoort. Int J Epidemiol 2001: 30(5); 948-54. BACKGROUND: Recently there has been considerable debate about possible false positive study outcomes. Several well-known epidemiologists have expressed their concern and the possibility that epidemiological research may loose credibility with policy makers as well as the general public. METHODS: We have identified 75 false positive studies and 150 true positive studies, all published reports and all epidemiological studies reporting results on substances or work processes generally recognized as being carcinogenic to humans. All studies were scored on a number of design characteristics and factors relating to the specificity of the research objective. These factors included type of study design, use of cancer registry data, adjustment for smoking and other factors, availability of exposure data, dose- and duration-effect relationship, magnitude of the reported relative risk, whether the study was considered a 'fishing expedition', affiliation and country of the first author. RESULTS: The strongest factor associated with the false positive or true positive study outcome was if the study had a specific a priori hypothesis. Fishing expeditions had an over threefold odds ratio of being false positive. Factors that decreased the odds ratio of a false positive outcome included observing a dose-effect relationship, adjusting for smoking and not using cancer registry data. CONCLUSION: The results of the analysis reported here clearly indicate that a study with a specific a priori study objective should be valued more highly in establishing a causal link between exposure and effect than a mere fishing expedition.
Invited Commentary: Re: "Multiple Comparisons and Related Issues in the Interpretation of Epidemiologic Data". John R. Thompson. American Journal of Epidemiology 1998: 147(9); 801-811. Abstract not available.
Observational studies
Observational Studies. PR Rosenbaum (1995) New York: Springer-Verlag.
Postmarketing surveillance study of a non-chlorofluorocarbon inhaler according to the safety assessment of marketed medicines guidelines. J. G. Ayres, C. D. Frost, W. F. Holmes, D. R. Williams, S. M. Ward. British Medical Journal 1998: 317(7163); 926-30. OBJECTIVE: To evaluate the safety of a non-chlorofluorocarbon metered dose salbutamol inhaler. DESIGN: This was a postmarketing surveillance study, conducted under formal guidelines for company sponsored safety assessment of marketed medicines (SAMM). A non-randomised, non-interventional, observational design compared patients prescribed metered doses of salbutamol delivered by inhalers using either hydrofluoroalkane or chlorofluorocarbon as the propellant. Follow up was three months. SETTING: 646 general practices throughout the United Kingdom. SUBJECTS: 6614 patients with obstructive airways disease (1667 patient years of exposure). MAIN OUTCOME MEASURES: Proportions of patients who were: admitted to hospital for respiratory diseases, reported adverse side effects, or withdrew because of adverse affects. RESULTS: There were no significant differences between the hydrofluoroalkane (HFA 134a) and chlorofluorocarbon inhaler groups in relation to the proportions of patients admitted to hospital for respiratory diseases (odds ratio 0.75; 95% confidence interval 0.51 to 1.08) or the proportions who reported adverse events (1.01; 0.88 to 1.17). However, more patients using the hydrofluoroalkane inhaler than the chlorofluorocarbon inhaler withdrew because of adverse events (3.8% and 0.9% respectively). CONCLUSION: The hydrofluoroalkane inhaler was as safe as the chlorofluorocarbon inhaler when judged by hospital admissions and adverse affects. The study design successfully fulfilled the recommendations of the guidelines. Differences between postmarketing surveillance studies and randomised clinical trials in assessing safety were identified. These may lead to difficulties in the design of postmarketing surveillance studies. [Medline] [Abstract] [Full text] [PDF]
Statistical Inquiries into the Efficacy of Prayer. Sir Francis Galton. Fortnightly Review 1872: 12125-135. (This article was originally published in 1872 and is reproduced by the Pictures of Health Web Site.) An eminent authority has recently published a challenge to test the efficacy of prayer by actual experiment. I have been induced, through reading this, to prepare the following memoir for publication, nearly the whole of which I wrote and laid by many years ago, after completing a large collection of data, which I had undertaken for the satisfaction of my own conscience. [Full text] [PDF]
Dietary fat intake and the risk of coronary heart disease in women. F. B. Hu, M. J. Stampfer, J. E. Manson, E. Rimm, G. A. Colditz, B. A. Rosner, C. H. Hennekens, W. C. Willett. N Engl J Med 1997: 337(21); 1491-9. BACKGROUND: The relation between dietary intake of specific types of fat, particularly trans unsaturated fat and the risk of coronary disease remains unclear. We therefore studied this relation in women enrolled in the Nurses' Health Study. METHODS: We prospectively studied 80,082 women who were 34 to 59 years of age and had no known coronary disease, stroke, cancer, hypercholesterolemia, or diabetes in 1980. Information on diet was obtained at base line and updated during follow-up by means of validated questionnaires. During 14 years of follow-up, we documented 939 cases of nonfatal myocardial infarction or death from coronary heart disease. Mutivariate analyses included age, smoking status, total energy intake, dietary cholesterol intake, percentages of energy obtained from protein and specific types of fat, and other risk factors. RESULTS: Each increase of 5 percent of energy intake from saturated fat, as compared with equivalent energy intake from carbohydrates, was associated with a 17 percent increase in the risk of coronary disease (relative risk, 1.17; 95 percent confidence interval, 0.97 to 1.41; P=0.10). As compared with equivalent energy from carbohydrates, the relative risk for a 2 percent increment in energy intake from trans unsaturated fat was 1.93 (95 percent confidence interval, 1.43 to 2.61; P<0.001); that for a 5 percent increment in energy from monounsaturated fat was 0.81 (95 percent confidence interval, 0.65 to 1.00; P=0.05); and that for a 5 percent increment in energy from polyunsaturated fat was 0.62 (95 percent confidence interval, 0.46 to 0.85; P= 0.003). Total fat intake was not signficantly related to the risk of coronary disease (for a 5 percent increase in energy from fat, the relative risk was 1.02; 95 percent confidence interval, 0.97 to 1.07; P=0.55). We estimated that the replacement of 5 percent of energy from saturated fat with energy from unsaturated fats would reduce risk by 42 percent (95 percent confidence interval, 23 to 56; P<0.001) and that the replacement of 2 percent of energy from trans fat with energy from unhydrogenated, unsaturated fats would reduce risk by 53 percent (95 percent confidence interval, 34 to 67; P<.001). CONCLUSIONS: Our findings suggest that replacing saturated and trans unsaturated fats with unhydrogenated monounsaturated and polyunsaturated fats is more effective in preventing coronary heart disease in women than reducing overall fat intake.
A comparison of observational studies and randomized, controlled trials. K. Benson, A. J. Hartz. New England Journal of Medicine 2000: 342(25); 1878-86. BACKGROUND: For many years it has been claimed that observational studies find stronger treatment effects than randomized, controlled trials. We compared the results of observational studies with those of randomized, controlled trials. METHODS: We searched the Abridged Index Medicus and Cochrane data bases to identify observational studies reported between 1985 and 1998 that compared two or more treatments or interventions for the same condition. We then searched the Medline and Cochrane data bases to identify all the randomized, controlled trials and observational studies comparing the same treatments for these conditions. For each treatment, the magnitudes of the effects in the various observational studies were combined by the Mantel-Haenszel or weighted analysis-of-variance procedure and then compared with the combined magnitude of the effects in the randomized, controlled trials that evaluated the same treatment. RESULTS: There were 136 reports about 19 diverse treatments, such as calcium-channel-blocker therapy for coronary artery disease, appendectomy, and interventions for subfertility. In most cases, the estimates of the treatment effects from observational studies and randomized, controlled trials were similar. In only 2 of the 19 analyses of treatment effects did the combined magnitude of the effect in observational studies lie outside the 95 percent confidence interval for the combined magnitude in the randomized, controlled trials. CONCLUSIONS: We found little evidence that estimates of treatment effects in observational studies reported after 1984 are either consistently larger than or qualitatively different from those obtained in randomized, controlled trials.
Interpreting the evidence: choosing between randomised and non-randomised studies. M McKee, A Britton, N Black, K McPherson, C Sanderson, C Bain. British Medical Journal 1999: 319(7205); 312-15. Abstract not available. [Medline] [Full text] [PDF]
The arrogance of preventive medicine. D. L. Sackett. Cmaj 2002: 167(4); 363-4.
Fat chance: diet and ischemic stroke [editorial; comment]. R. Sherwin, T. R. Price. Jama 1997: 278(24); 2185-6. Abstract not available.
Smoking as "independent" risk factor for suicide: illustration of an artifact from observational epidemiology? G. D. Smith, A. N. Phillips, J. D. Neaton. Lancet 1992: 340(8821); 709-12. Two widely used criteria for determining whether an association between a risk factor and a disease is causal are dose response and independence from other factors. Data from a large US risk factor study (MRFIT) throw up a relation between cigarette smoking and suicide that meets these criteria, yet appears to be biologically implausible. It is likely that many more such associations, for other exposures and other diseases, are equally spurious, but are protected by their lack of obvious implausibility.
Statistics in Action. M.H. Gail. Journal of the American Statistical Association 1996: 91(433); 1-13. Abstract not available.
Epidemiology faces its limits. G. Taubes. Science 1995: 269(5221); p164-9. Abstract not available.
Systematic reviews and lifelong diseases. H. E. Elphick, A. Tan, D. Ashby, R. L. Smyth. Bmj 2002: 325(7360); 381-4. Systematic reviews of randomised controlled trials provide an evidence base for treatment but too often fail to give adequate information on long term outcomes. Elphick and colleagues discuss the limitations of the systematic review of randomised controlled trials for patients with chronic or lifelong diseases and suggest that long term observational studies have a place in the evaluation of the benefits and risks of treatment. [Full text] [PDF]
Outcomes
Statistical issues in randomized trials of cancer screening. S. G. Baker, B. S. Kramer, P. C. Prorok. BMC Med Res Methodol 2002: 2(1); 11. BACKGROUND: The evaluation of randomized trials for cancer screening involves special statistical considerations not found in therapeutic trials. Although some of these issues have been discussed previously, we present important recent and new methodologies. METHODS: Our emphasis is on simple approaches. RESULTS: We make the following recommendations:(1) Use death from cancer as the primary endpoint, but review death records carefully and report all causes of death(2) Use a simple "causal" estimate to adjust for nonattendance and contamination occurring immediately after randomization(3) Use a simple adaptive estimate to adjust for dilution in follow-up after the last screen CONCLUSION: The proposed guidelines combine recent methodological work on screening endpoints and noncompliance/contamination with a new adaptive method to adjust for dilution in a study where follow-up continues after the last screen. These guidelines ensure good practice in the design and analysis of randomized trials of cancer screening. [Abstract] [Full text] [PDF]
The influence of semen analysis parameters on the fertility potential of infertile couples. C. Ayala, E. Steinberger, D. P. Smith. Journal of Andrology 1996: 17(6); 718-25. The objective of this study was to investigate the relationship between couples' fertility potential and several parameters of semen analysis (from a single semen sample/male partner) in a cohort of 1,055 infertile couples seen at the Texas Institute for Reproductive Medicine and Endocrinology for a total of 9,409 follow-up months. The medians of sperm concentrations (SC), total sperm counts (TSC), percent motility (MOT), motile sperm concentrations (MSC), and total motile sperm counts (TMSC) were significantly higher (P < 0.0001) in the group that achieved pregnancy. When the entire group was divided into "high" and "low" groups on the basis of the various parameters of semen analysis, the relative risk ratios for conception for the "high" groups were as follows: SC, 1.5; MOT, 8.5; TSC, 8.1; MSC, 5.8; and TMSC, 6.1. Life table analysis showed a statistically significant difference (P < 0.0001) in the initial rise and overall slope of the conception rates between the two groups for a number of the semen analysis parameters (TSC, MOT, MSC, and TMSC). This study showed that certain semen analysis parameters are positively correlated, with a high degree of statistical probability, with the time required for the occurrence of conception. The quantitative impact of the male fertility potential on conception rates was shown to correlate not solely with the SC or MOT values, but even more so with their derivatives (i.e., MSC and TMSC). Therefore, in an in vivo environment it is not only the number of sperm and their motility but also their derivatives that provide a quantitative insight into the male fertility potential. The data may provide a quantitative expression of the relative risk ratio for conception to occur and the time required until conception is achieved. Further studies will be necessary to clarify the effect of the other semen analysis parameters (i.e., morphology, velocity, linearity, and "efficient" MSC) on conception rates, cumulative conception rates, relative risk ratio for conception, and time until conception in a large population of infertile couples. [Medline]
Reporting on quality of life in randomised controlled trials: bibliographic study. C. Sanders, M. Egger, J. Donovan, D. Tallon, S. Frankel. Bmj 1998: 317(7167); 1191-4. OBJECTIVES: To examine the frequency and quality of reporting on quality of life in randomised controlled trials. DESIGN: Search of the Cochrane Controlled Trials Register 1980 to 1997 to identify trials from all disciplines, from oncology, and from cardiovascular medicine that reported on quality of life. Assessment of abstracts from articles published from 1993 to 1996. Assessment of a sample of full reports with a standardised instrument. MAIN OUTCOME MEASURES: Prevalence of reporting on quality of life. Conditions and interventions studied in trials reporting on quality of life. Quality of reporting on quality of life. RESULTS: During 1980-97 reporting on quality of life increased from 0.63% to 4.2% for trials from all disciplines, from 1.5% to 8.2% for cancer trials, and from 0.34% to 3.6% for cardiovascular trials. Of 364 abstracts, 65% reported on drug interventions. Of a sample of 67 full reports, authors of 48 (72%) used 62 established quality of life instruments. In 15 reports (22%) authors developed their own measures, and in 2 (3%) methods were unclear. Response rates were given in 38 (57%), and complete reporting on all items and scales occurred in 31 (46%).CONCLUSIONS: Less than 5% of all randomised controlled trials reported on quality of life, and this proportion was below 10% even for cancer trials. A plethora of instruments was used in different studies, and the reporting of methods and results was often inadequate. Standards for the measurement and reporting of quality of life in clinical trials research need to be developed. [Medline] [Abstract] [Full text] [PDF]
Reporting on quality of life in RCTs. Susan P. Wright. British Medical Journal 1999: 318(7191); 1142. [Full text]
Outliers
Ozone Depletion, History and politics. Brien Sparling. Accessed on 2002-11-27-"Ground based measurements of Ozone were first started in 1956, in at Halley Bay, Antarctica. Satellite measurements of ozone started in the early 70's, but the first comprehensive worldwide measurements started in 1978 with the Nimbus-7 satellite. Nimbus-7 carried a TOMS (total ozone mapping spectrometer, and a SBUV(solar backscatter UV meter). The TOMS finally broke on May 7th,1993, but today there are several different satellites measuring concentrations of ozone and other atmosheric gases. Gases in the troposphere and lower stratosphere are sampled by weather balloons or by airplanes such as the ER-2 managed by NASA." www.nas.nasa.gov/About/Education/Ozone/history.html
Post hoc changes
Cancer Clusters: Finding Vs. Feelings. David Robinson, Medscape. Accessed on 2003-05-09. "Several challenges bedevil any cancer cluster investigation and can result in ambiguous or misleading conclusions. This report discusses the potential cancer clusters in Toms River, New Jersey and Long Island, New York, because they contain many elements typical of cancer cluster investigations and have received considerable media attention." Posted 11/06/2002. www.medscape.com/viewarticle/442554_1
Things to know and do about cancer clusters. T. Aldrich, T. Sinks. Cancer Invest 2002: 20(5-6); 810-6. Perceived cancer clusters present difficulties and opportunities for clinicians and public health officials alike. Public health officials receive reports of perceived cancer clusters, evaluate the validity of these reports, and/or launch investigations to identify potential causes. Clinicians interact directly with the affected patients, families, or community representatives who question the occurrence of cancer and the underlying causes. Clinicians may identify cancer clusters when they question the unusual occurrence of a rare form of cancer within their practice or community. In addition, clinicians may be asked to discuss cancer clusters and inform local debates. In this paper, we describe the public health practice experience with cancer clusters and identify cancer prevention and control opportunities for clinicians and public health officials. Scientific investigations of cancer clusters rarely uncover new knowledge about the causes of cancer. However, a set of common characteristics, unique to etiologic cluster investigations have uncovered new information about the causes of cancer or demonstrated a preventable link to a known carcinogen. These characteristics may provide useful clues for sorting out the small number of clusters worthy of further scientific investigation. Public awareness of cancer clusters may promote an opportunity to inform and motivate people about the preventable causes of cancer and effective cancer screening methods.
Dangers of using "optimal" cutpoints in the evaluation of prognostic factors. D. G. Altman, B. Lausen, W. Sauerbrei, M. Schumacher. Journal of the National Cancer Institute 1994: 86(11); 829-35. Abstract not available yet.
Effects of selenium supplementation for cancer prevention in patients with carcinoma of the skin. A randomized controlled trial. Nutritional Prevention of Cancer Study Group. L. C. Clark, G. F. Combs, Jr., B. W. Turnbull, E. H. Slate, D. K. Chalker, J. Chow, L. S. Davis, R. A. Glover, G. F. Graham, E. G. Gross, A. Krongrad, J. L. Lesher, Jr., H. K. Park, B. B. Sanders, Jr., C. L. Smith, J. R. Taylor. Jama 1996: 276(24); 1957-63. OBJECTIVE: To determine whether a nutritional supplement of selenium will decrease the incidence of cancer. DESIGN: A multicenter, double-blind, randomized, placebo-controlled cancer prevention trial. SETTING: Seven dermatology clinics in the eastern United States. PATIENTS: A total of 1312 patients (mean age, 63 years; range, 18-80 years) with a history of basal cell or squamous cell carcinomas of the skin were randomized from 1983 through 1991. Patients were treated for a mean (SD) of 4.5 (2.8) years and had a total follow-up of 6.4 (2.0) years. INTERVENTIONS: Oral administration of 200 microg of selenium per day or placebo. MAIN OUTCOME MEASURES: The primary end points for the trial were the incidences of basal and squamous cell carcinomas of the skin. The secondary end points, established in 1990, were all-cause mortality and total cancer mortality, total cancer incidence, and the incidences of lung, prostate, and colorectal cancers. RESULTS: After a total follow-up of 8271 person-years, selenium treatment did not significantly affect the incidence of basal cell or squamous cell skin cancer. There were 377 new cases of basal cell skin cancer among patients in the selenium group and 350 cases among the control group (relative risk [RR], 1.10; 95% confidence interval [CI], 0.95-1.28), and 218 new squamous cell skin cancers in the selenium group and 190 cases among the controls (RR, 1.14; 95% CI, 0.93-1.39). Analysis of secondary end points revealed that, compared with controls, patients treated with selenium had a nonsignificant reduction in all-cause mortality (108 deaths in the selenium group and 129 deaths in the control group [RR; 0.83; 95% CI, 0.63-1.08]) and significant reductions in total cancer mortality (29 deaths in the selenium treatment group and 57 deaths in controls [RR, 0.50; 95% CI, 0.31-0.80]), total cancer incidence (77 cancers in the selenium group and 119 in controls [RR, 0.63; 95% CI, 0.47-0.85]), and incidences of lung, colorectal, and prostate cancers. Primarily because of the apparent reductions in total cancer mortality and total cancer incidence in the selenium group, the blinded phase of the trial was stopped early. No cases of selenium toxicity occurred. CONCLUSIONS: Selenium treatment did not protect against development of basal or squamous cell carcinomas of the skin. However, results from secondary end-point analyses support the hypothesis that supplemental selenium may reduce the incidence of, and mortality from, carcinomas of several sites. These effects of selenium require confirmation in an independent trial of appropriate design before new public health recommendations regarding selenium supplementation can be made.
Journals should see original protocols for clinical trials. C J Hawkey. BMJ 2001: 323(7324); 1309-. Abstract not available yet. [Medline] [Full text]
Randomised controlled trial of cardiotocography versus Doppler auscultation of fetal heart at admission in labour in low risk obstetric population. G. Mires, F. Williams, P. Howie. British Medical Journal 2001: 322(7300); 1457-60; discussion 1460-2. (See "Commentary: changes between protocol and manuscript should be declared at submission" at the end of this article.) OBJECTIVE: To compare the effect of admission cardiotocography and Doppler auscultation of the fetal heart on neonatal outcome and levels of obstetric intervention in a low risk obstetric population. DESIGN: Randomised controlled trial. SETTING: Obstetric unit of teaching hospital PARTICIPANTS: Pregnant women who had no obstetric complications that warranted continuous monitoring of fetal heart rate in labour. INTERVENTION: Women were randomised to receive either cardiotocography or Doppler auscultation of the fetal heart when they were admitted in spontaneous uncomplicated labour. MAIN OUTCOME MEASURES: The primary outcome measure was umbilical arterial metabolic acidosis. Secondary outcome measures included other measures of condition at birth and obstetric intervention. RESULTS: There were no significant differences in the incidence of metabolic acidosis or any other measure of neonatal outcome among women who remained at low risk when they were admitted in labour. However, compared with women who received Doppler auscultation, women who had admission cardiotocography were significantly more likely to have continuous fetal heart rate monitoring in labour (odds ratio 1.49, 95% confidence interval 1.26 to 1.76), augmentation of labour (1.26, 1.02 to 1.56), epidural analgesia (1.33, 1.10 to 1.61), and operative delivery (1.36, 1.12 to 1.65). CONCLUSIONS: Compared with Doppler auscultation of the fetal heart, admission cardiotocography does not benefit neonatal outcome in low risk women. Its use results in increased obstetric intervention, including operative delivery. [Medline] [Abstract] [Full text] [PDF]
Celestial determinants of success in research. R. Pollex, B. Hegele, M.R. Ban. Cmaj 2001: 165(12); 1584. Abstract not available. [Medline] [Full text] [PDF]
Quality
Bias. Bandolier. Accessed on 2003-03-25. "Bandolier has been struck of late, 'many a time and oft', by the continuing and cavalier attitude towards bias in clinical trials. We know that the way that clinical trials are designed and conducted can influence their results. Yet people still ignore known sources of bias when making decisions about treatments at all levels." www.jr2.ox.ac.uk/bandolier/band80/b80-2.html
Poor-quality medical research: what can journals do? D. G. Altman. Jama 2002: 287(21); 2765-7. The aim of medical research is to advance scientific knowledge and hence--directly or indirectly--lead to improvements in the treatment and prevention of disease. Each research project should continue systematically from previous research and feed into future research. Each project should contribute beneficially to a slowly evolving body of research. A study should not mislead; otherwise it could adversely affect clinical practice and future research. In 1994 I observed that research papers commonly contain methodological errors, report results selectively, and draw unjustified conclusions. Here I revisit the topic and suggest how journal editors can help.
Assessing the quality of randomized controlled trials. Current issues and future directions. D. Moher, A. R. Jadad, P. Tugwell. Int J Technol Assess Health Care 1996: 12(2); 195-208. Assessing the quality of randomized controlled trials is a relatively new and important development. Three approaches have been developed: component, checklist, and scale assessment. Component approaches evaluate selected aspects of trials, such as masking. Checklists and scales involve lists of items thought to be integral to study quality. Scales, unlike the other methods, provide a summary numeric score of quality, which can be formally incorporated into a systematic review. Most scales to date have not been developed with sufficient rigor, however. Empirical evidence indicates that differences in scale development can lead to important differences in quality assessment. Several methods for including quality scores in systematic reviews have been proposed, but since little empirical evidence supports any given method, results must be interpreted cautiously. Future efforts may be best focused on gathering more empirical evidence to identify trial characteristics directly related to bias in the estimates of intervention effects and on improving the way in which trials are reported.
Clinical trials in general surgical journals: are methods better reported? L. P. Schumm, J. S. Fisher, R. A. Thisted, J. Olak. Surgery 1999: 125(1); 41-5. BACKGROUND: Reports of clinical trials often lack adequate descriptions of their design and analysis. Thus readers cannot properly assess the strength of the findings and are limited in their ability to draw their own conclusions. A review of 6 surgical journals in 1984 revealed that the frequency of reporting 11 basic elements of design and analysis in clinical trials was only 59%. This study attempted to identify areas that still need improvement. METHODS: Eligible studies published from July 1995 through June 1996 included all reports of comparative clinical trials on human subjects that were prospective and had at least 2 treatment arms. A total of 68 articles published in 6 general surgery journals were reviewed. The frequency that the previously identified 11 basic elements of design and analysis were reported was determined. RESULTS: Seventy-four percent of all items were reported accurately (a 15% increase from the previous study), 4% were reported ambiguously, and 23% were not reported; improvement was seen in every journal. The reporting of eligibility criteria and statistical power improved the most. For 3 items, reporting was still not adequate; 32% of reports provided information about statistical power, 40% about the method of randomization, and 49% about whether the person assessing outcomes was blind to the treatment assignment. CONCLUSIONS: Improvements have been made in reporting surgical clinical trials, but in general methodologic questions poorly answered in the 1980s continue to be answered poorly in the 1990s. Editors of surgical journals are urged to provide authors with guidelines on how to report clinical trial design and analysis.
Many reports of RCTs give insufficient data for Cochrane reviewers. EH Walters, JA Walters. BMJ 1999: 319(7204); 257. ("Editors of journals should insist that, rather than giving the general statement that the design was randomised and double blind, reports should give a short description of the randomisation method used." and "In our series we have been able to extract fully all the data on reported outcomes in only six of the 30 papers; 15 yielded none, because what was presented was derivative (such as the change from baseline) or merely the P value for some statistical comparison.") Abstract not available. [Full text]
Randomization
Minimisation: the platinum standard for trials? Randomisation doesn't guarantee similarity of groups; minimisation does [editorial] [see comments]. T Treasure, KD MacRae. BMJ 1998: 317(7155); 362-63. Abstract not available.
The effect of vitamin E and beta carotene on the incidence of lung cancer and other cancers in male smokers. Beta Carotene Cancer Prevention Study Group The Alpha-Tocopherol. NEJM 1994: 330(15); 1029-35. ABSTRACT: BACKGROUND. Epidemiologic evidence indicates that diets high in carotenoid-rich fruits and vegetables, as well as high serum levels of vitamin E (alpha-tocopherol) and beta carotene, are associated with a reduced risk of lung cancer. METHODS. We performed a randomized, double-blind, placebo-controlled primary-prevention trial to determine whether daily supplementation with alpha-tocopherol, beta carotene, or both would reduce the incidence of lung cancer and other cancers. A total of 29,133 male smokers 50 to 69 years of age from southwestern Finland were randomly assigned to one of four regimens: alpha-tocopherol (50 mg per day) alone, beta carotene (20 mg per day) alone, both alpha-tocopherol and beta carotene, or placebo. Follow-up continued for five to eight years. RESULTS. Among the 876 new cases of lung cancer diagnosed during the trial, no reduction in incidence was observed among the men who received alpha-tocopherol (change in incidence as compared with those who did not, -2 percent; 95 percent confidence interval, -14 to 12 percent). Unexpectedly, we observed a higher incidence of lung cancer among the men who received beta carotene than among those who did not (change in incidence, 18 percent; 95 percent confidence interval, 3 to 36 percent). We found no evidence of an interaction between alpha-tocopherol and beta carotene with respect to the incidence of lung cancer. Fewer cases of prostate cancer were diagnosed among those who received alpha-tocopherol than among those who did not. Beta carotene had little or no effect on the incidence of cancer other than lung cancer. Alpha-tocopherol had no apparent effect on total mortality, although more deaths from hemorrhagic stroke were observed among the men who received this supplement than among those who did not. Total mortality was 8 percent higher (95 percent confidence interval, 1 to 16 percent) among the participants who received beta carotene than among those who did not, primarily because there were more deaths from lung cancer and ischemic heart disease. CONCLUSIONS. We found no reduction in the incidence of lung cancer among male smokers after five to eight years of dietary supplementation with alpha-tocopherol or beta carotene. In fact, this trial raises the possibility that these supplements may actually have harmful as well as beneficial effects.
Randomized Controlled Trials: Evidence Biased Psychiatry. David Healy, Alliance for Human Research Protection. Accessed on 2002-"Abstract not available" "A new drug gets introduced to the market. It has been approved after stringent scrutiny by the FDA, which requires ever more convincing evidence that it works and that its safe. The new treatment will always cost more than the old treatments, but even on the cost front, many would argue that we have entered an era where placebo controlled clinical trials demonstrate that new in contrast to older treatments actually do work, and if we just stick to treatments that really work costs should fall. Besides it always seems to happen these days that when new and costly antidepressants or antipsychotics are put through an economic model based on the figures from clinical trials and a range of assumptions provided by experts, the model demonstrates that these new drugs costing thousand of dollars a year are in fact cheaper than treatments costing $100 per year or less. So where could the problems lie? Why do we seem to be so slow in reaching the new medical utopia towards which companies and others assure us we are heading?" www.researchprotection.org/COI/healy0802.html
Issues to Consider When Designing RCTs for CAM Therapies. House of Lords United Kingdom Parliament. Accessed on 2002-12-23-www.parliament.the-stationery-office.co.uk/pa/ld199900/ldselect/ldsctech/123/12315.htm#a68
Difficulties of Randomised Controlled Trials. House of Lords United Kingdom Parliament. Accessed on 2002-12-31-"Concerns over RCTs distorting a therapy or disguising its efficacy are not the unique concerns of CAM practitioners. Vincent & Furnham suggest that as attempts to apply the RCT to a wider and wider range of treatments have occurred, more and more problems have been uncovered. They list 10 such problems." www.parliament.the-stationery-office.co.uk/pa/ld199900/ldselect/ldsctech/123/12323.htm
Coronary artery surgery study (CASS): a randomized trial of coronary artery bypass surgery. Comparability of entry characteristics and survival in randomized patients and nonrandomized patients meeting randomization criteria. CASS Principal Investigators and Their Associates. Journal of the American College of Cardiology 1984: 3(1); 114-28. The Coronary Artery Surgery Study (CASS) includes a randomized trial of coronary artery bypass surgery and medical therapy in the management of patients with mild or moderate stable angina pectoris or free of angina but with a documented history of myocardial infarction. While 780 patients at 11 participating institutions entered the randomized trial, 1,315 patients at the same institutions met randomization criteria but declined participation in the randomized study; they constitute the "randomizable" patients. Half the randomized patients were assigned to surgery and half to the medical group. Of the 1,315 randomizable patients, 43% started with surgical therapy and 57% constitute the medical group. Follow-up periods average 64 months (range 46 to 92). The only entry characteristic in which the randomized and randomizable medical groups differ importantly is the extent of coronary artery disease, which is less extensive in the latter. The two surgical groups also differ in this respect, but with more extensive disease in the randomizable group. At 5 year follow-up, 24% of the medically-assigned randomized patients and 22% of the medically-started randomizable patients have had coronary bypass surgery. Survival in the medically-randomized and randomizable patient groups is similar in the aggregate (both 92% at 5 years) and also in all subgroups based on clinical classification, the number of diseased vessels, the presence of proximal left anterior descending coronary artery disease and ejection fraction. Survival for the surgically-assigned randomized patients and the surgically-started randomizable patients is also similar in the aggregate (95 and 94%, respectively) and in all subgroups. It is concluded that the randomized patients in CASS are not a special or atypical subset of those eligible for randomization. The data from the randomizable patients thus support and extend the inference of the generally very good survival of both the medically- and surgically-assigned patients of the randomized trial. [Medline]
Evidence from randomised trials on the long-term effects of hormone replacement therapy. V. Beral, E. Banks, G. Reeves. Lancet 2002: 360(9337); 942-4. CONTEXT: Over the past few decades hormone replacement therapy (HRT) has been used increasingly by post-menopausal women in western countries. The need for objective data on long-term effects prompted the setting up of randomised trials to compare cancer and cardiovascular disease endpoints in HRT users and non-users. With the early termination of part of the Women's Health Initiative trial (JAMA 2002; 288: 321-33), it is timely to review the evidence from such studies. STARTING POINT: Four randomised trials including over 20000 women followed up for 4.9 years, on average, have now reported on the effect of HRT for major, potentially fatal, conditions. Overall, HRT users had a significantly increased incidence of breast cancer, stroke, and pulmonary embolism; a significantly reduced incidence of colorectal cancer and fractured neck of femur; but no significant change in endometrial cancer or coronary heart disease.There was no significant variation across the trials in the results for any condition. Three trials had recruited women with previous cardiovascular disease and the fourth, the Women's Health Initiative, had recruited healthy women.Combined oestrogen/progestagen HRT was used in three trials and oestrogen alone in one. Use of HRT over a 5-year period by healthy postmenopausal women in western countries is estimated to cause an extra breast cancer,stroke, or pulmonary embolus in about 6 per 1000 users aged 50-59 and 12 per 1000 aged 60-69. Over the same period, the estimated reduction in incidence of colorectal cancer or fractured neck of femur is 1.7 per 1000 users aged 50-59 and 5.5 per 1000 aged 60-69. The increased incidence of any one of these conditions is greater than any reduction, the estimated net excess over 5 years being 1 per 230 users aged 50-59, and 1 per 150 aged 60-69. WHERE NEXT: Substantial new data should soon be available from randomised trials of oestrogen-alone HRT versus placebo, whereas few additional trial data on combined HRT are expected for about a decade. Existing randomised trials are too small to describe reliably the effect of HRT on important but rarer conditions, such as ovarian cancer, or on cause-specific mortality. Nor will they provide information about other types of oestrogen or progestagen. Answers to such questions will require judicious analysis and interpretation of data from observational studies.
Comparing like with like: some historical milestones in the evolution of methods to create unbiased comparison groups in therapeutic experiments. I. Chalmers. Int J Epidemiol 2001: 30(5); 1156-64. Histories of clinical trials have recorded and analysed the development of quantification in therapeutic evaluation, the emergence of probabilistic thinking, the application of statistical methods and theory, and the sociology, ethics and politics of clinical trials; but it is surprising that they only rarely identify as a distinct theme the development of efforts to control biases. An exception is Kaptchuk's recent account of the history of blinding and placebos for reducing observer biases. In this complementary paper I introduce and discuss some milestones between 1662 and 1948 in the development of methods to control selection biases when assembling therapeutic comparison groups, to ensure, as far as possible, that 'like is compared with like'. In the paper I note (i) that treatment allocation based on strict alternation abolishes selection bias as effectively as treatment allocation based on strict random allocation; (ii) that use of schedules based on random numbers is more likely to prevent foreknowledge of allocation schedules, and thus the risk of introducing selection bias at the point of recruitment to trials; (iii) that a concern to conceal allocation schedules was the rationale for using schedules based on random numbers in the Medical Research Council trials of vaccination for whooping cough and streptomycin for pulmonary tuberculosis; and (iv) that the introduction of allocation concealment more than half a century ago remains the most recent substantive milestone in the history of efforts to control selection biases in therapeutic experiments.
Comparison of evidence of treatment effects in randomized and nonrandomized studies. J. P. Ioannidis, A. B. Haidich, M. Pappa, N. Pantazis, S. I. Kokori, M. G. Tektonidou, D. G. Contopoulos-Ioannidis, J. Lau. Jama 2001: 286(7); p821-30. CONTEXT: There is substantial debate about whether the results of nonrandomized studies are consistent with the results of randomized controlled trials on the same topic. OBJECTIVES: To compare results of randomized and nonrandomized studies that evaluated medical interventions and to examine characteristics that may explain discrepancies between randomized and nonrandomized studies. DATA SOURCES: MEDLINE (1966-March 2000), the Cochrane Library (Issue 3, 2000), and major journals were searched. STUDY SELECTION: Forty-five diverse topics were identified for which both randomized trials (n = 240) and nonrandomized studies (n = 168) had been performed and had been considered in meta-analyses of binary outcomes. DATA EXTRACTION: Data on events per patient in each study arm and design and characteristics of each study considered in each meta-analysis were extracted and synthesized separately for randomized and nonrandomized studies. DATA SYNTHESIS: Very good correlation was observed between the summary odds ratios of randomized and nonrandomized studies (r = 0.75; P<.001); however, nonrandomized studies tended to show larger treatment effects (28 vs 11; P =.009). Between-study heterogeneity was frequent among randomized trials alone (23%) and very frequent among nonrandomized studies alone (41%). The summary results of the 2 types of designs differed beyond chance in 7 cases (16%). Discrepancies beyond chance were less common when only prospective studies were considered (8%). Occasional differences in sample size and timing of publication were also noted between discrepant randomized and nonrandomized studies. In 28 cases (62%), the natural logarithm of the odds ratio differed by at least 50%, and in 15 cases (33%), the odds ratio varied at least 2-fold between nonrandomized studies and randomized trials. CONCLUSIONS: Despite good correlation between randomized trials and nonrandomized studies-in particular, prospective studies-discrepancies beyond chance do occur and differences in estimated magnitude of treatment effect are very common.
Evaluating complementary medicine: methodological challenges of randomised controlled trials. S. Mason, P. Tovey, A. F. Long. Bmj 2002: 325(7368); 832-4.
Research into complementary and alternative medicine: problems and potential. R. L. Nahin, S. E. Straus. British Medical Journal 2001: 322(7279); 161-4. [Full text]
Sequential treatment assignment with balancing for prognostic factors in the controlled clinical trial. S. J. Pocock, R. Simon. Biometrics 1975: 31(1); 103-15. In controlled clinical trials there are usually several prognostic factors known or thought to influence the patient's ability to respond to treatment. Therefore, the method of sequential treatment assignment needs to be designed so that treatment balance is simultaneously achieved across all such patients factor. Traditional methods of restricted randomization such as "permuted blocks within strata" prove inadequate once the number of strata, or combinations of factor levels, approaches the sample size. A new general procedure for treatment assignment is described which concentrates on minimizing imbalance in the distributions of treatment numbers within the levels of each individual prognostic factor. The improved treatment balance obtained by this approach is explored using simulation for a simple model of a clinical trial. Further discussion centers on the selection, predictability and practicability of such a procedure.
Randomised block design is more powerful than minimisation [letter]. N. Ross. British Medical Journal 1999: 318(7178); 263-4. Abstract not available.
Patients' preferences and randomised trials. W. A. Silverman, D. G. Altman. Lancet 1996: 347(8995); p171-4. Abstract not available.
Casting and Drawing Lots. W.A. Silverman, I. Chalmers. In: ed. Controlled Trials from History. By, I Chalmers, I. Milne, and U. Trohler. 2001; Vol.
Patient Heterogeneity in Clinical Trials. Richard Simon. Cancer Treatment Reports 1980: 64(2-3); 405-410. (Valuable comments on stratification and generalizability.) ABSTRACT: Interpretation of therapeutic results is complicated by variability in response among patients. This paper reviews fundamental statistical principles for the design of clinical trials. These methods seek to evaluate relative therapeutic efficacy in the presence of patient heterogeneity. Statistical science has more to offer therapeutics than significance tests among "comparable" treatment groups. The role of randomization and stratification is reviewed. The importance of study design, including patient eligibility and therapeutic standardization, to the generalization of conclusions is discussed.
Minimization: A new method of assigning patients to treatment and control groups. Donald R. Taves. Clinical Pharmacology and Therapeutics 1974: 15(5); 443-453. Abstract not available.
Use of unequal randomisation to aid the economic efficiency of clinical trials. David J Torgerson, Marion K Campbell. BMJ 2000: 321759. Abstract not available yet. [Full text] [PDF]
Minimisation is much better than the randomised block design in certain cases. Tom Treasure, KD MacRae. British Medical Journal 1999: 318(7195); 1420. Abstract not available.
Investigating Therapies of Potentially Great Benefit: ECMO. J.H. Ware. Statistical Science 1989: 4(4); 298-317.
Mammography and the politics of randomised controlled trials. J. Wells. Bmj 1998: 317(7167); 1224-9. [Full text] [PDF]
Randomised controlled trial of laparoscopic versus open mesh repair for inguinal hernia: outcome and cost. J. Wellwood, M. J. Sculpher, D. Stoker, G. J. Nicholls, C. Geddes, A. Whitehead, R. Singh, D. Spiegelhalter. Bmj 1998: 317(7151); 103-10. OBJECTIVE: To compare tension-free open mesh hernioplasty under local anaesthetic with transabdominal preperitoneal laparoscopic hernia repair under general anaesthetic. DESIGN: A randomised controlled trial of 403 patients with inguinal hernias. SETTING: Two acute general hospitals in London between May 1995 and December 1996. SUBJECTS: 400 patients with a diagnosis of groin hernia, 200 in each group. Main outcome measures: Time until discharge, postoperative pain, and complications; patients' perceived health (SF-36), duration of convalescence, and patients' satisfaction with surgery; and health service costs. RESULTS: More patients in the open group (96%) than in the laparoscopic group (89%) were discharged on the same day as the operation (chi2 = 6.7; 1 df; P=0.01). Although pain scores were lower in the open group while the effect of the local anaesthetic persisted (proportional odds ratio at 2 hours 3.5 (2.3 to 5.1)), scores after open repair were significantly higher for each day of the first week (0.5 (0.3 to 0.7) on day 7) and during the second week (0.7 (0.5 to 0.9)). At 1 month there was a greater improvement (or less deterioration) in mean SF-36 scores over baseline in the laparoscopic group compared with the open group on seven of eight dimensions, reaching significance on five. For every activity considered the median time until return to normal was significantly shorter for the laparoscopic group. Patients randomised to laparoscopic repair were more satisfied with surgery at 1 month and 3 months after surgery. The mean cost per patient of laparoscopic repair was 335 pounds (95% confidence interval 228 pounds to 441 pounds) more than the cost of open repair. CONCLUSION: This study confirms that laparoscopic hernia repair has considerable short term clinical advantages after discharge compared with open mesh hernioplasty, although it was more expensive.
The protective effect of auto-immune buccal urine therapy (AIBUT) against the Raynaud phenomenon. C. W. Wilson. Med Hypotheses 1984: 13(1); 99-107. The efficacy of Auto-Immune Buccal Urine Therapy (AIBUT) against allergic symptoms depends upon sublingual administration of the correct dose of urine as determined by bio-assay in individual patients. Succeeding effective turn-off doses occur at the troughs of a sinusoidal dose-response curve. Efficacy of the administered dose is confirmed by reduction in the severity and duration of Cold-water-induced Raynaud symptoms after administration of effective doses of unboiled urine in AIBUT. Boiled urine does not affect the Raynaud phenomenon.
Randomised controlled trials in primary care: case study. Sue Wilson. British Medical Journal 2000: 32124-27. Abstract not available yet.
A new design for randomized clinical trials. M. Zelen. N Engl J Med 1979: 300(22); p1242-5. This paper proposes a new method for planning randomized clinical trials. This method is especially suited to comparison of a best standard or control treatment with an experimental treatment. Patients are allocated into two groups by a random or chance mechanism. Patients in the first group receive standard treatment; those in the second group are asked if they will accept the experimental therapy; if they decline, they receive the best standard treatment. In the analyses of results, all those in the second group, regardless of treatment, are compared with those in the first group. Any loss of statistical efficiency can be overcome by increased numbers. This experimental plan is indeed a randomized clinical trial and has the advantage that, before providing consent, a patient will know whether an experimental treatment is to be used.
The randomization and stratification of patients to clinical trials. M. Zelen. Journal of Chronic Diseases 1974: 27(7-8); 365-75. Abstract not available yet.
The orthomolecular treatment of cancer. II. Clinical trial of high-dose ascorbic acid supplements in advanced human cancer. E. Cameron, A. Campbell. Chem Biol Interact 1974: 9(4); 285-315. Abstract not available yet.
Failure of high-dose vitamin C (ascorbic acid) therapy to benefit patients with advanced cancer. A controlled trial. E. T. Creagan, C. G. Moertel, J. R. O'Fallon, A. J. Schutt, M. J. O'Connell, J. Rubin, S. Frytak. New England Journal of Medical 1979: 301(13); 687-90. One hundred and fifty patients with advanced cancer participated in a controlled double-blind study to evaluate the effects of high-dose vitamin C on symptoms and survival. Patients were divided randomly into a group that received vitamin C (10 g per day) and one that received a comparably flavored lactose placebo. Sixty evaluable patients received vitamin C and 63 received a placebo. Both groups were similar in age, sex, site of primary tumor, performance score, tumor grade and previous chemotherapy. The two groups showed no appreciable difference in changes in symptoms, performance status, appetite or weight. The median survival for all patients was about seven weeks, and the survival curves essentially overlapped. In this selected group of patients, we were unable to show a therapeutic benefit of high-dose vitamin C treatment.
Unconventional cancer therapies: What we need is rigorous research, not closed minds. E. Ernst. Chest 2000: 117(2); 307-8.
New insights into the physiology and pharmacology of vitamin C. S. J. Padayatty, M. Levine. Cmaj 2001: 164(3); 353-5.
Retrospecitve data
Recall bias in a case-control surveillance system on the use of medicine during pregnancy. M. Rockenbauer, J. Olsen, A. E. Czeizel, L. Pedersen, H. T. Sorensen. Epidemiology 2001: 12(4); p461-6. It is important to study possible teratogenic effects of drugs used during pregnancy. Many studies of this type rely upon case-control designs in which drug intake is recalled by the mothers after having given birth. Recall bias in this situation may lead to spurious associations. We looked for indicators of recall bias by comparing self-reported drug intake with medically notified intake for specific diseases in the Hungarian Case-Control Surveillance System of Congenital Abnormalities, which includes 22,865 cases with congenital abnormalities and 39,151 controls. Recall error was present, especially for drugs used for a short time period. Furthermore, the timing of drug intake was reported slightly closer to the time of interview for cases compared than for controls. Severe or visible congenital abnormalities did not appear to be more conducive to recall bias than other abnormalities under study. A case-control surveillance system of this type may frequently cause spurious associations, with biased odds ratios up to a factor of 1.9.
Sample size
Economic evaluation and clinical trials: size matters. A. Briggs. Bmj 2000: 321(7273); 1362-3.
Negative results of randomized clinical trials published in the surgical literature: equivalency or error? J. B. Dimick, M. Diener-West, P. A. Lipsett. Arch Surg 2001: 136(7); 796-800. HYPOTHESIS: We hypothesized that review of randomized controlled clinical trials (RCTs) with nonstatistically significant or "negative" results published in the surgical literature do not have appropriate statistical power to demonstrate equivalency between treatment arms. DATA SOURCES AND STUDY SELECTION: The MEDLINE database was searched to obtain reports of all RCTs with negative results published in 3 surgical journals from 1988 to 1998. Manual review of one year (1997) of publications for each journal was performed to validate our search strategy. Equivalency was evaluated using the Two One-Sided Tests Procedure and post hoc power calculations. DATA SYNTHESIS: Ninety reports of RCTs with negative results were identified in the surgical literature between 1988 and 1998. The manual review of 1997 showed a 100% retrieval rate for our search strategy. After applying the Two One-Sided Tests Procedure, 35 reports (39%) met the criteria for demonstrating equivalency. The other 55 reports (61%) contained at least a 10% absolute difference in the 90% confidence interval of Delta. Using the power calculation method, only 22 (24%) articles had a power greater than.80 to detect a 50% difference in therapeutic effect. Only 29% of the reports included a formal sample size calculation and these studies were more likely to demonstrate equivalency than those without a sample size estimate (P<.01). CONCLUSIONS: Many reports from negative RCTs published in the surgical literature lack sufficient statistical power to establish that clinically important differences are not present. Surgeons should perform appropriate sample size calculations when designing RCTs and recognize the utility of confidence intervals when reporting negative results.
Putting trials on trial--the costs and consequences of small trials in depression: a systematic review of methodology. M. Hotopf, G. Lewis, C. Normand. J Epidemiol Community Health 1997: 51(4); p354-8. STUDY OBJECTIVE: To determine why, despite 122 randomised controlled trials, there is no consensus about whether the selective serotonin reuptake inhibitors or tricyclic and related antidepressants should be used as first line treatment of depression. DESIGN: Systematic review of all RCTs comparing selective serotonin reuptake inhibitors and tricyclic or heterocyclic antidepressants. MAIN RESULTS: The shortcomings identified in the 122 trials were as follows: (1) there was inadequate description of randomisation, (2) the outcomes used were mainly observer rated measurements of depression, and studies failed to use quality of life measures or perform economic evaluations, (3) doses of tricyclic antidepressants were inadequate, (4) generalisability of studies was poor (including a reliance on secondary care settings and inadequate follow up), and (5) there were statistical shortcomings such as low statistical power, failure to use intention to treat analyses, and the tendency to make multiple comparisons. CONCLUSIONS: Future RCTs should be designed to inform policy makers and address these methodological shortcomings.
Distinguishing between "no evidence of effect" and "evidence of no effect" in randomised controlled trials and other comparisons. William Odita Tarnow-Mordi, MJ Healy. Arch Dis Child 1999: 80(3); 210-213. Abstract not available.
Cost effectiveness calculations and sample size. David J Torgerson, Marion K Campbell. BMJ 2000: 321697. Abstract not available yet.
Elevated blood lead levels in children of construction workers. EA Whelan, GM Piacitelli, B Gerwel, TM Schnorr, CA Mueller, J Gittleman, TD Matte. American Journal of Public Health 1997: 87(8); 1352-55. ABSTRACT: OBJECTIVES: This study examined whether children of lead-exposed construction workers had higher blood lead levels than neighborhood control children. METHODS: Twenty-nine construction workers were identified from the New Jersey Adult Blood Lead Epidemiology and Surveillance (ABLES) registry. Eighteen control families were referred by workers. Venous blood samples were collected from 50 children (31 exposed, 19 control subjects) under age 6. RESULTS: Twenty-six percent of workers children had blood lead levels at or over the Centers for Disease Control and Prevention action level of 0.48 mumol/L (10 micrograms/dL), compared with 5% of control children (unadjusted odds ratio = 6.1; 95% confidence interval = 0.9, 147.2). CONCLUSIONS: Children of construction workers may be at risk for excessive lead exposure. Health care providers should assess parental occupation as a possible pathway for lead exposure of young children.
Selection bias
Effect of UK national guidelines on services to treat patients with acute low back pain: follow up questionnaire survey. A. G. Barnett, M. R. Underwood, M. R. Vickers. British Medical Journal 1999: 318(7188); 919-20. (This study obtained survey response rates of 87% and 85%. Larger practices were overrepresented.) Abstract not available yet. [Full text] [PDF]
Uptake of screening and prevention in women at very high risk of breast cancer. D. Evans, F. Lalloo, A. Shenton, C. Boggis, A. Howell. Lancet 2001: 358(9285); 889-90. Management of women at high lifetime risk of familial breast cancer is hampered because of limited data concerning the appropriateness of treatment options. Over the past 8 years women at very high (>40%) lifetime risk of breast cancer have had the option of entering two chemoprevention treatment trials, a magnetic resonance imaging (MRI) breast screening study, or a risk-reducing mastectomy (RRM) study. Only 10% of eligible women have entered one of the chemotherapy trials with a similar proportion opting for RRM (>50% in mutation carriers) compared with 60% opting for MRI screening. Future chemotherapy trials will have to be designed to address this poor recruitment.
Subgroup analysis
Analysis of clinical trial outcomes: some comments on subgroup analyses. M. E. Buyse. Controlled Clinical Trials 1989: 10(4 Suppl); 187S-194S. This article briefly discusses the various ways in which prognostic information can be included in the analysis of treatment effect in clinical trials. Adjustments in the treatment comparison are usually not warranted, as they do not substantially improve precision, but they may be useful, in addition to the unadjusted comparison, if a potent covariate is by chance maldistributed among the treatment groups. Estimation of interactions between treatment and covariates is usually plagued by insufficient statistical power. Estimation of treatment effect within individual subgroups is also subject to large random errors as well as to the problem of multiplicity, but with these caveats in mind it is an informative and needed complement to an analysis of overall treatment effect.
Analysis and Interpretation of Treatment Effects in Subgroups of Patients in Randomized clinical trials. S Yusuf. JAMA 1991: 266(1); 93-98. ABSTRACT: A key principle for interpretation of subgroup results is that quantitative interactions (differences in degree) are much more likely than qualitative interactions (differences in kind). Quantitative interactions are likely to be truly present whether or not they are apparent, whereas apparent qualitative interactions should generally be disbelieved as they have usually not been replicated consistently. Therefore, the overall trial result is usually a better guide to the direction of effect in subgroups than the apparent effect observed within a subgroup. Failure to specify prior hypotheses, to account for multiple comparisons, or to correct P values increases the chance of finding spurious subgroup effects. Conversely, inadequate sample size, classification of patients into the wrong subgroup, and low power of tests of interaction make finding true subgroup effects difficult. We recommend examining the architecture of the entire set of subgroups within a trial, analyzing similar subgroups across independent trials, and interpreting the evidence in the context of known biologic mechanisms and patient prognosis.
The miracle of DICE therapy for acute stroke: fact or fictional product of subgroup analysis? Carl E Counsell, Mike J Clarke, Jim Slattery, Peter A G Sandercock. British Medical Journal 1994: 309(6970); 1677-1681. ABSTRACT: OBJECTIVE--To determine whether inappropriate subgroup analysis together with chance could change the conclusion of a systematic review of several randomised trials of an ineffective treatment. DESIGN--44 randomised controlled trials of DICE therapy for stroke were performed (simulated by rolling different coloured dice; two trials per investigator). Each roll of the dice yielded the outcome (death or survival) for that "patient." Publication bias was also simulated. The results were combined in a systematic review. SETTING--Edinburgh. MAIN OUTCOME MEASURE--Mortality. RESULTS--The "hypothesis generating" trial suggested that DICE therapy provided complete protection against death from acute stroke. However, analysis of all the trials suggested a reduction of only 11% (SD 11) in the odds of death. A predefined subgroup analysis by colour of dice suggested that red dice therapy increased the odds by 9% (22). If the analysis excluded red dice trials and those of poor methodological quality the odds decreased by 22% (13, 2P = 0.09). Analysis of "published" trials showed a decrease of 23% (13, 2P = 0.07) while analysis of only those in which the trialist had become familiar with the intervention showed a decrease of 39% (17, 2P = 0.02). CONCLUSION--The early benefits of DICE therapy were not confirmed by subsequent trials. A plausible (but inappropriate) subset analysis of the effects of treatment led to the qualitatively different conclusion that DICE therapy reduced mortality, whereas in truth it was ineffective. Chance influences the outcome of clinical trials and systematic reviews of trials much more than many investigators realise, and its effects may lead to incorrect conclusions about the benefits of treatment.
Surrogate analysis
Surrogate Endpoints and Neuromuscular Recovery. Aaron F. Kopman. Anesthesiology 1997: 87(5); 1027-1031. Abstract not available.
Validity
Prospective cohort study of routine use of risk assessment scales for prediction of pressure ulcers. L. Schoonhoven, J. R. Haalboom, M. T. Bousema, A. Algra, D. E. Grobbee, M. H. Grypdonck, E. Buskens. Bmj 2002: 325(7368); 797. OBJECTIVE: To evaluate whether risk assessment scales can be used to identify patients who are likely to get pressure ulcers. DESIGN: Prospective cohort study. SETTING: Two large hospitals in the Netherlands. PARTICIPANTS: 1229 patients admitted to the surgical, internal, neurological, or geriatric wards between January 1999 and June 2000. MAIN OUTCOME MEASURE: Occurrence of a pressure ulcer of grade 2 or worse while in hospital. RESULTS: 135 patients developed pressure ulcers during four weeks after admission. The weekly incidence of patients with pressure ulcers was 6.2% (95% confidence interval 5.2% to 7.2%). The area under the receiver operating characteristic curve was 0.56 (0.51 to 0.61) for the Norton scale, 0.55 (0.49 to 0.60) for the Braden scale, and 0.61 (0.56 to 0.66) for the Waterlow scale; the areas for the subpopulation, excluding patients who received preventive measures without developing pressure ulcers and excluding surgical patients, were 0.71 (0.65 to 0.77), 0.71 (0.64 to 0.78), and 0.68 (0.61 to 0.74), respectively. In this subpopulation, using the recommended cut-off points, the positive predictive value was 7.0% for the Norton, 7.8% for the Braden, and 5.3% for the Waterlow scale. CONCLUSION: Although risk assessment scales predict the occurrence of pressure ulcers to some extent, routine use of these scales leads to inefficient use of preventive measures. An accurate risk assessment scale based on prospectively gathered data should be developed.
Differential recall bias and spurious associations in case/control studies. D. Barry. Statistics in Medicine 1996: 15(23); 2603-16. Consider a case/control study designed to investigate a possible association between exposure to a putative risk factor and development of a particular disease. Let E denote the information required to specify a subject's exposure to the risk factor. We examine the effect that errors in the recorded values of E (which we denote by E*) have on inferences of an association between disease and the risk factor. We concentrate on situations where the errors in recorded exposure are such that exposure is underestimated for controls and overestimated for cases. This phenomenon is referred to as differential recall bias and may lead to spurious inferences of an association between exposure and disease. We describe how the standard inferential techniques used in the analysis of data from case/control studies may be adjusted to take account of specified mechanisms whereby E is distorted to produce E*. Such adjustments may be used to determine the sensitivity of an analysis to the phenomenon of differential recall bias and to quantify the extent of such bias that would be required to overturn the conclusions of the analysis. There remains the matter of judging whether a given distortion mechanism is reasonable in a particular context. This emphasizes the need for investigators to take account of differential recall bias in validation studies of exposure assessment techniques. The methodology developed here is applied to a recent major study investigating the possible association between lung cancer and exposure to environmental tobacco smoke. The log-odds ratio of 0.23 based on recorded exposure differs significantly from 0 (P < 0.02). However, the association is rendered non-significant by a very modest degree of differential recall bias. For example, if 3.8 per cent of exposed controls report no exposure, 3.8 per cent of unexposed cases report exposure, and all other subjects report exposure accurately, the log-odds ratio drops to 0.07 and the corresponding p-value increases to 0.49.
Relation between tumour response to first-line chemotherapy and survival in advanced colorectal cancer: a meta-analysis. Meta-Analysis Group in Cancer. M. Buyse, P. Thirion, R. W. Carlson, T. Burzykowski, G. Molenberghs, P. Piedbois. Lancet 2000: 356(9227); 373-8. BACKGROUND: Treatment of advanced colorectal cancer has progressed substantially. However, improvements in response rates have not always translated into significant survival benefits. Doubts have therefore been raised about the usefulness of tumour response as a clinical endpoint. METHODS: This meta-analysis was done on individual data from 3791 patients enrolled in 25 randomised trials of first-line treatment with standard bolus intravenous fluoropyrimidines versus experimental treatments (fluorouracil plus leucovorin, fluorouracil plus methotrexate, fluorouracil continuous infusion, or hepatic-arterial infusion of floxuridine). Analyses were by intention to treat. FINDINGS: Compared with bolus fluoropyrimidines, experimental fluoropyrimidines led to significantly higher tumour response rates (454 responses among 2031 patients vs 209 among 1760; odds ratio 0.48 [95% CI 0.40-0.57], p<0.0001) and better survival (1808 deaths among 2031 vs 1580 among 1760; hazard ratio 0.90 [0.84-0.97], p=0.003). The survival benefits could be explained by the higher tumour response rates. However, a treatment that lowered the odds of failure to respond by 50% would be expected to decrease the odds of death by only 6%. In addition, less than half of the variability of the survival benefits in the 25 trials could be explained by the variability of the response benefits in these trials. INTERPRETATION: These analyses confirm that an increase in tumour response rate translates into an increase in overall survival for patients with advanced colorectal cancer. However, in the context of individual trials, knowledge that a treatment has benefits on tumour response does not allow accurate prediction of the ultimate benefit on survival.
Comparison of the Block and the Willett self-administered semiquantitative food frequency questionnaires with an interviewer-administered dietary history. BJ Caan, ML Slattery, J Potter, CP Jr Quesenberry, AO Coates, DM Schaffer. AJE 1998: 148(12); 1137-47. ABSTRACT: The performances of two commonly used diet instruments, the Block and the Willett food frequency questionnaires, were compared with a longer, interviewer-administered diet history. Participants in a case-control study on diet and colon cancer were interviewed between 1990 and 1994 in northern California, Utah, and Minnesota by trained nutritionists using a validated diet history. Two separate subsamples of participants were asked to complete either the Block or the Willett questionnaire exactly 5 days after they completed the original diet history. Data were analyzed separately by subsample comparing either the Block or the Willett questionnaire with the original diet history by using means, correlations, quintile agreement, and odds ratios for the relation between several nutrients and colon cancer. The Block and the Willett questionnaires generally provided lower absolute intake estimates than did the original diet history; however, the Block questionnaire underestimated more than did that by Willett. Both correlations and quintile agreement were slightly better for the Willett questionnaire than for that by Block when compared with the original diet history. In general, point estimates obtained from either the Block or the Willett questionnaire fell within the confidence intervals of the estimates of the odds ratios obtained from the original diet history, and no real difference in significance levels appeared. Although the Block and Willett questionnaires differed slightly from each other and from our original diet history in estimating absolute nutrients and ranking or classifying individuals, they were very similar in their ability to predict disease outcome.
The Forer effect (a.k.a. the P.T. Barnum effect and subjective validation). Robert Todd Carroll, The Skeptic's Dictionary. Accessed on 2003-03-10. "The Forer or Barnum effect is also known as the subjective validation effect or the personal validation effect. (The expression, "the Barnum effect," seems to have originated with psychologist Paul Meehl, in deference to circus man P.T. Barnum's reputation as a master psychological manipulator.) Psychologist B.R. Forer found that people tend to accept vague and general personality descriptions as uniquely applicable to themselves without realizing that the same description could be applied to just about anyone." A critical look at the validity and reliability of the Myers-Briggs Type Indicator. www.skepdic.com/myersb.html
Myers-Briggs Type Indicator®. Robert Todd Carroll, The Skeptic's Dictionary. Accessed on 2003-03-10. A critical look at the validity and reliability of the Myers-Briggs Type Indicator. www.skepdic.com/myersb.html
The visual analogue pain intensity scale: what is moderate pain in millimetres? S. L. Collins, R. A. Moore, H. J. McQuay. Pain 1997: 72(1-2); 95-7. One way to ensure adequate sensitivity for analgesic trials is to test the intervention on patients who have established pain of moderate to severe intensity. The usual criterion is at least moderate pain on a categorical pain intensity scale. When visual analogue scales (VAS) are the only pain measure in trials we need to know what point on a VAS represents moderate pain, so that these trials can be included in meta-analysis when baseline pain of at least moderate intensity is an inclusion criterion. To investigate this we used individual patient data from 1080 patients from randomised controlled trials of various analgesics. Baseline pain was measured using a 4-point categorical pain intensity scale and a pain intensity VAS under identical conditions. The distribution of the VAS scores was examined for 736 patients reporting moderate pain and for 344 reporting severe pain. The VAS scores corresponding to moderate or severe pain were also examined by gender. Baseline VAS scores recorded by patients reporting moderate pain were significantly different from those of patients reporting severe pain. Of the patients reporting moderate pain 85% scored over 30 mm on the corresponding VAS, with a mean score of 49 mm. For those reporting severe pain 85% scored over 54 mm with a mean score of 75 mm. There was no difference between the corresponding VAS scores of men and women. Our results indicate that if a patient records a baseline VAS score in excess of 30 mm they would probably have recorded at least moderate pain on a 4-point categorical scale.
Underascertainment of child maltreatment fatalities by death certificates, 1990-1998. T. L. Crume, C. DiGuiseppi, T. Byers, A. P. Sirotnak, C. J. Garrett. Pediatrics 2002: 110(2 Pt 1); e18. OBJECTIVE: Child fatality review teams have emerged across the United States in the past decade to address the concern that systems of child protection, law enforcement, criminal justice, and medicine do not adequately assess the circumstances surrounding child fatality as a result of maltreatment. METHODS: We compared data collected by a multidisciplinary child fatality review team with vital records for all children who were aged birth to 16 years and died in Colorado between January 1, 1990, and December 1, 1998. Odds ratios and 95% confidence intervals for ascertainment by the death certificate were estimated using logistic regression. RESULTS: Only half of the children who died as a result of maltreatment had death certificates that were coded consistently with maltreatment. Black race and female gender were associated with higher ascertainment, whereas death in a rural county was associated with lower ascertainment. Deaths resulting from violent causes (eg, shaking, blunt force trauma, striking) were more likely to be ascertained than those that involved acts of omission (eg, neglect and abandonment, drowning, fire). The most common perpetrators of maltreatment were parents. However, maltreatment by an unrelated perpetrator was 8.71 times (95% confidence interval: 3.52-21.55) more likely to be ascertained than maltreatment by a parent. CONCLUSIONS: The degree of underascertainment found in this study is of concern because most national estimates of child maltreatment fatality in the United States are derived from coding on death certificates. In addition, the patterns recognized in this study raise concern about systematic underascertainment that may affect children of specific sociodemographic groups.
Treatment of acute childhood diarrhea with homeopathic medicine: a randomized clinical trial in Nicaragua. J. Jacobs, L. M. Jimenez, S. S. Gloyd, J. L. Gale, D. Crothers. Pediatrics 1994: 93(5); 719-25. OBJECTIVE. Acute diarrhea is the leading cause of pediatric morbidity and mortality worldwide. Oral rehydration treatment can prevent death from dehydration, but does not reduce the duration of individual episodes. Homeopathic treatment for acute diarrhea is used in many parts of the world. This study was performed to determine whether homeopathy is useful in the treatment of acute childhood diarrhea. METHODOLOGY. A randomized double-blind clinical trial comparing homeopathic medicine with placebo in the treatment of acute childhood diarrhea was conducted in Leon, Nicaragua, in July 1991. Eighty-one children aged 6 months to 5 years of age were included in the study. An individualized homeopathic medicine was prescribed for each child and daily follow-up was performed for 5 days. Standard treatment with oral rehydration treatment was also given. RESULTS. The treatment group had a statistically significant (P < .05) decrease in duration of diarrhea, defined as the number of days until there were less than three unformed stools daily for 2 consecutive days. There was also a significant difference (P < .05) in the number of stools per day between the two groups after 72 hours of treatment. CONCLUSIONS. The statistically significant decrease in the duration of diarrhea in the treatment group suggests that homeopathic treatment might be useful in acute childhood diarrhea. Further study of this treatment deserves consideration.
Maternal nutrition, pregnancy outcome and public health policy. M. S. Kramer. Cmaj 1998: 159(6); 663-5. ("From a clinical, etiologic or prognostic perspective, however, low birth weight is not a very useful outcome. Birth weight is a function of 2 factors: duration of gestation and rate of fetal growth. Thus, the weight of newborns can be low either because they are born early (preterm birth) or because they are small for their gestational age or both.") [Full text]
Psychological stress and cardiovascular disease: empirical demonstration of bias in a prospective observational study of Scottish men * Commentary: Psychosocial factors and health---strengthening the evidence base. John Macleod, George Davey Smith, Pauline Heslop, Chris Metcalfe, Douglas Carroll, Carole Hart, John Lynch. British Medical Journal 2002: 324(7348); 1247-. Objectives: To examine the association between self perceived psychological stress and cardiovascular disease in a population where stress was not associated with social disadvantage. Design: Prospective observational study with follow up of 21 years and repeat screening of half the cohort 5 years from baseline. Measures included perceived psychological stress, coronary risk factors, self reported angina, and ischaemia detected by electrocardiography. Setting: 27 workplaces in Scotland. Participants: 5606 men (mean age 48 years) at first screening and 2623 men at second screening with complete data on all measures Main outcome measures: Prevalence of angina and ischaemia at baseline, odds ratio for incident angina and ischaemia at second screening, rate ratios for cause specific hospital admission, and hazard ratios for cause specific mortality. Results: Both prevalence and incidence of angina increased with increasing perceived stress (fully adjusted odds ratio for incident angina, high versus low stress 2.66, 95% confidence interval 1.61 to 4.41; P for trend <0.001). Prevalence and incidence of ischaemia showed weak trends in the opposite direction. High stress was associated with a higher rate of admissions to hospital generally and for admissions related to cardiovascular disease and psychiatric disorders (fully adjusted rate ratios for any general hospital admission 1.13, 1.01 to 1.27, cardiovascular disease 1.20, 1.00 to 1.45, and psychiatric disorders 2.34, 1.41 to 3.91). High stress was not associated with increased admission for coronary heart disease (1.00, 0.76-1.32) and showed an inverse relation with all cause mortality, mortality from cardiovascular disease, and mortality from coronary heart disease, that was attenuated by adjustment for occupational class (fully adjusted hazard ratio for all cause mortality 0.94, 0.81 to 1.11, cardiovascular mortality 0.91, 0.78 to 1.06, and mortality from coronary heart disease 0.98, 0.75 to 1.27). Conclusions: The relation between higher stress, angina, and some categories of hospital admissions probably resulted from the tendency of participants reporting higher stress to also report more symptoms. The lack of a corresponding relation with objective indices of heart disease suggests that these symptoms did not reflect physical disease. The data suggest that associations between psychosocial measures and disease outcomes reported from some other studies may be spurious. [Abstract] [Full text] [PDF]
Reliability of death certificate diagnoses. M. A. Moussa, M. Z. Shafie, M. M. Khogali, A. M. el-Sayed, T. N. Sugathan, G. Cherian, A. Z. Abdel-Khalik, M. T. Garada, D. Verma. J Clin Epidemiol 1990: 43(12); 1285-95. Consistency between death certificates and clinical records from 5 general hospitals in Kuwait was studied for 470 deaths with the following underlying or associated causes: hypertensive (HYP), ischaemic heart diseases (IHD), cerebrovascular diseases (CVD) and diabetes mellitus (DM). Direct causes were not considered since they are of little interest analytically. Only deaths with definite or most probable ascertainment were included. One cardiologist, who was provided with the WHO criteria and relevant documents on death certification, independently reviewed the records. To test the reviewer's bias and the reliability of his judgement, an adjudication process was effected by having one senior cardiologist re-review a random subsample of 140 records. The two reviewers showed good agreement. Specific diagnoses criteria for deciding the underlying cause of death in multiple morbid conditions by the reviewer were followed. Due to possible reviewer bias, we aimed at measuring the difference between initial certifiers and the reviewer rather than measuring the diagnostic accuracy of initial certifiers in reference to the reviewer. The agreement index kappa showed poor agreement between original and revised certificates. The original certificates under-estimated CVD as an underlying cause of death by 69.2%, DM by 60%, IHD by 33.5% and HYP by 31.8% in our sample. Associated causes were also consistently under-estimated by initial certifiers as compared with the reviewer. This bias calls for basing mortality statistics in Kuwait on hospital death committees' reports rather than on initial certifier death certificates, use of multiple-causes of death instead of one underlying cause and adequate training of the medical profession on the value and process of death certification.
Misclassification rates for current smokers misclassified as nonsmokers. AJ Wells, PB English, SF Posner, LE Wagenknecht, EJ Perez-Stable. American Journal of Public Health 1998: 88(10); 1503-09. ABSTRACT: OBJECTIVES: This paper provides misclassification rates for current cigarette smokers who report themselves as nonsmokers. Such rates are important in determining smoker misclassification bias in the estimation of relative risks in passive smoking studies. METHODS: True smoking status, either occasional or regular, was determined for individual current smokers in 3 existing studies of nonsmokers by inspecting the cotinine levels of body fluids. The new data, combined with an approximately equal amount in the 1992 Environmental Protection Agency (EPA) report on passive smoking and lung cancer, yielded misclassification rates that not only had lower standard errors but also were stratified by sex and US minority majority status. RESULTS: The misclassification rates for the important category of female smokers misclassified as never smokers were, respectively, 0.8%, 6.0%, 2.8%, and 15.3% for majority regular, majority occasional, US minority regular, and US minority occasional smokers. Misclassification rates for males were mostly somewhat higher. CONCLUSIONS: The new information supports EPA's conclusion that smoker misclassification bias is small. Also, investigators are advised to pay attention to minority/majority status of cohorts when correcting for smoker misclassification bias.
Reporting accuracy among mothers of malformed and nonmalformed infants. M. M. Werler, B. R. Pober, K. Nelson, L. B. Holmes. Am J Epidemiol 1989: 129(2); p415-21. The potential for recall bias in case-control studies is a common concern. The authors assessed whether recall bias was present in exposure information reported at postpartum interview by mothers of malformed and nonmalformed infants who delivered at Brigham and Women's Hospital, Boston, during 1984. Accuracy of exposure reporting was measured by comparing interview data with exposure information documented during pregnancy in obstetric records. The authors' measure of recall bias, relative sensitivity (RS), is the ratio of exposure-reporting accuracy for mothers of malformed infants to that of mothers of nonmalformed infants. Relative sensitivity estimates that are greater than 1.0 indicate that mothers of malformed infants are more accurate reporters than mothers of nonmalformed infants. Relative sensitivity was estimated for eight exposure factors: antibiotic or antifungal drug use (RS = 1.2), urinary tract or yeast infection (RS = 2.7), history of infertility (RS = 1.4), use of birth control after conception (RS = 7.6), elective abortion history (RS = 1.1), any over-the-counter drug use (RS = 1.0), spotting or bleeding (RS = 1.2), and nausea or vomiting (RS = 0.8) These data suggest the presence of recall bias for some exposure factors. The authors advise the use of malformed controls to reduce potential recall bias in case-control studies of selected malformations and many etiologic factors.
Comparison of food frequency questionnaires: the reduced block and Willett questionnaires differ in ranking on nutrient intakes. AK Wirfait, RW Jeffery, PJ Elmer. AJE 1998: 148(12); 1148-56. ABSTRACT: Food frequency questionnaires, major tools in epidemiologic studies, are often criticized for biased and imprecise intake estimates. The aim of this study was to compare the performance of two widely used food frequency questionnaires, a reduced 60-item Block questionnaire and a 153-item Willett food frequency questionnaire, relative to three 24-hour recalls administered by telephone. The dietary data were collected in 1991 from a group of healthy women age 25-49 years (n=101) during the baseline period of a weight-loss intervention study in Minneapolis, Minnesota. Total energy and macro- and micronutrient intakes were compared across methods by using four analytic approaches: comparison of means and correlation coefficients, regression analysis, and estimation of percent agreement between each questionnaire and recalls. The Block instrument showed an overall underestimation bias, but was more successful in categorizing individuals on percent energy from fat and carbohydrate intakes than was the Willett instrument. The Willett instrument showed no overall underestimation bias and was more successful in classifying individuals on vitamin A and calcium intakes. Diverging performance characteristics of diet assessment methods have an implication for the design of studies, interpretation of results, and comparison of findings across studies.
"Supplemental ascorbate in the supportive treatment of cancer: Prolongation of survival times in terminal human cancer." Cameron E, Pauling L. Proceedings of the National Academy of Sciences (USA), 73: 3685-3689 (1976) .
"Smoking and Carcinoma of the Lung: Preliminary Report" Doll R, Hill AB. British Medical Journal, 1: 1451-1455 (1950).
"Secondhand Smoke and Cholesterol in Children" Neufeld EJ, Mietus-Snyder M, Beisner AS, Baker AL, Newburger JW. Circulation, 96: 1403-1407 (1997).
"Meta-Analyses of Randomized Control Trials: An Update of the Quality and Methodology" Sacks HS, Berrier J, Reitman D, PAgano D, Chalmers TC. pp. 427-442, in Medical Uses of Statistics: 2nd Edition, Bailar JC and Mosteller F (editors), Boston MA: NEJM Books (1992).
"Nicotine Patch Therapy in Adolescent Smokers." Smith TA, House Jr RF, Croghan IT,Gauvin TR, Colligan RC, Offord KP, Gomez-Dahl LC, Hurt RD. Pediatrics, 98(4): 659-667 (1996).
This webpage was written by Steve Simon on 2003-07-01, edited by Steve Simon, and was last modified on 2008-07-08. Send feedback to ssimon at cmh dot edu or click on the email link at the top of the page. Category: Statistical evidence