How to read a medical journal article (July 2003 version).

"Still, it is an error to argue in front of your data. You find yourself insensibly twisting them around to fit your theories." Sherlock Holmes in The Adventure of Wisteria Lodge.

Reading medical research is hard work. I'm not talking about the medical terminology, though that is often quite bad (if I hear the word "emesis" one more time, I'm going to throw up!). The hard part is assessing the strength of the evidence. When you read a journal article, you have to decide if the authors present a case that is persuasive enough to get you to change your practice.

Some evidence is so strong that it stands on its own. Other evidence is weaker and requires support from other studies, from mechanistic arguments, and so forth. Still other evidence is so weak, that you should not consider any changes in your practice until the study is replicated using a more rigorous approach.

What you should look for

When you are assessing the quality of the evidence, it's not how the data are analyzed that's important. Far more important is HOW THE DATA ARE COLLECTED. Don't agonize over whether the researchers should have used a non-parametric test or whether a random effects meta-analysis is appropriate (just to cite two obscure examples). These are important issues and they generate a lot of debate. But in most cases, the use of one statistical analysis or another is unlikely to make a substantial difference in the conclusions.

The more common and more important threat to the validity of the study relates to how the data are COLLECTED, not how they area ANALYZED. After all, if you collect the wrong data, it doesn't matter how fancy the analysis is. This is good news, because you don't need a lot of statistical training or a lot of mathematical sophistication to assess how the data are collected.

I don't want to imply that data analysis is irrelevant. There are good examples of where a better data analysis led to a different conclusion (Vickers 2001, Skegg 2000). Analysis errors are less frequent and less serious, however, than design errors.

In this presentation, I want to show you what to look for and why. Here are five questions you should ask yourself when reading a journal article.

  1. Was there a good comparison group?

  2. Was there a plan?

  3. Who knew what when?

  4. Who was left out?

  5. How much did things change?

In this presentation, I will justify these questions using anecdotal evidence at times and solid empirical research at other times. I will also highlight real research articles and use them as examples.

Important Disclaimer.

This presentation will review several published journal articles. The intent is to gauge how much evidence each article presents in favor of the efficacy of a new therapy. Some articles will provide a greater level of evidence and some will provide a lesser level of evidence. But articles which provide lesser levels of evidence are still valuable and important.

Nothing stated in this presentation about a particular journal article should be construed as a statement about the quality of that article. The very nature of research requires a series of steps from very preliminary and speculative levels of evidence to more definitive levels of evidence.

Furthermore, when I point out limitations in the evidence presented in a journal article, more often than not, the authors of the article delineate these same limitations in their discussion. But in general, you need to be aware of these limitations because not every journal author is going to be open and honest about the limitations of their research.

Additional resources

Pitfalls of pharmacoepidemiology. D. C. Skegg. Bmj 2000: 321(7270); p1171-2. [Full text] [PDF]

Acupuncture for treatment of chronic neck pain. Andrew Vickers, Dominik Irnich, Martin Krauss. BMJ 2001: 323(7324); 1306-. [Full text]


Chapter 1: Was there a good comparison group?

Introduction

Almost all research involves comparison. Do woman who take Tamoxifen have a lower rate of breast cancer recurrence than women who take a placebo? Do left handed people die at an earlier age than right handed people? Are men with severe vertex balding more likely to develop heart disease than men with no balding?

When you make such a comparison between an exposure/treatment group and a control group, you want it to be a fair comparison. You want the control group to be identical to the exposure/treatment group in all respects, except for the exposure/treatment in question. You want an apples to apples comparison.

To ensure that the researchers made an apples to apples comparison, ask the following three questions:

Did the authors use randomization?
Did the authors use matching?
Did the authors use statistical adjustments?

Case study: Vitamin C and Cancer

Paul Rosenbaum, in the first chapter of his book, Observational Studies, gives a fascinating example of an apples to oranges comparison. Cameron and Pauling published an observational study of Vitamin C as a treatment for advanced cancer. For each patient, ten matched controls were selected with the same age, gender, cancer site, and histological tumor type. Patients receiving Vitamin C survived four times longer than the controls (p < 0.0001).

Cameron and Pauling minimize the lack of randomization. "Even though no formal process of randomization was carried out in the selection of our two groups, we believe that they come close to representing random subpopulations of the population of terminal cancer patients in the Vale of Leven Hospital."

Ten years later, the Mayo Clinic conducted a randomized experiment which showed no statistically significant effect of Vitamin C. Why did the Camoeron and Pauling study differ from the Mayo study?

The first limitation of the Cameron and Pauling study was that all of their patients received Vitamin C and followed prospectively. The control group represented a retrospective chart review. You should be cautious about any comparison of prospective data to retrospective data.

But there was a more important issue. The treatment group represented patients newly diagnosed with terminal cancer. The control group was selected from death certificate records. So this was clearly an apples versus oranges comparison. It doesn't matter how bad the prognosis was for a patient diagnosed with terminal cancer; it can't be as bad as the prognosis of a patient who has a death certificate.

Surgical trial without controls

There's another story, unfortunately fictional, which also highlights the importance of a good comparison group.

A prominent surgeon came to give a special lecture at the School of Medicine. He expounded about the great advance that he had made in a specific surgical procedure. At the end of the lecture he drew thunderous applause from the audience. At first it seemed like there would be no questions, but then a young student in the front row raised her hand. "Did you use any controls?" she asked. The surgeon seemed to be offended by this question. "Controls?" he asked. "Are you suggesting that I should have denied my surgical advance to half of my patients?" The rest of the audience grew very quiet. But the young woman was not intimidated. "Yes," she said, "that's exactly what I meant." The surgeon grew even angrier at this, slammed his fist on the podium and shouted "Why that would have condemned half of my patients to certain death!" There was silence for a few seconds. Then the entire auditorium burst out in laughter when the young woman asked "Which half?"

Covariate imbalance

If you want to judge how effective a new therapy is, you need a comparison group. The comparison group would be a group of subjects who receive either the standard therapy or, in some cases, no therapy (e.g., a placebo comparison).

The ideal comparison group should be similar in all respects to the new therapy group except for the therapy itself. For example, the two groups should have a similar range of ages and weights and should be composed of roughly the same proportions in gender and race/ethnicity. The groups should be evaluated concurrently.

Sometimes the groups are dissimilar on some important characteristics. This is known as covariate imbalance. Covariate imbalance is not an insurmountable problem, but it does make a study less authoritative.

In a yet to be published research study here at Children's Mercy Hospital, pre-term infants were randomized either to a group that received normal bottle feeding while they were in the hospital or to a nasogastric (ng) tube feeding group. The researchers wanted to see if the latter group of infants, because they had not become habituated to bottle feeding, would be more likely to breastfeed after discharge from the hospital.

The randomization was only partially effective at preventing covariate imbalance. The infants had comparable birth weights, gestational ages, and Apgar scores. There were similar proportions of caesarian section and vaginal births in both groups. But the mothers in the ng tube group were older on average than the mothers in the bottle fed group.

Since older mothers are more likely to breast feed than younger mothers, we had to include mother's age in an analysis of covariance model so that the effect of ng tube feeding could be estimated independent of mother's age.

Beware of situations where the two treatment groups are handled differently. An example of this would be the study of women who use oral contraceptives. These women visit a doctor at least every six months to get their prescriptions renewed. If these women are compared to a women who do not use oral contraceptives, then the former group will probably be evaluated by a doctor more frequently. An increase in the prevalence of certain diseases may actually reflect the fact these diseases are diagnosed earlier because of the frequency of hospital visits.

Similarly, if a certain drug is suspected to have certain side effects, doctor may question more closely those patients who are on that medication, creating a self-fulfilling prophecy.

Concurrent controls versus historical controls.

Sometimes researchers will assign all of the research subjects to the new therapy. The outcomes of these subjects are compared to historical records representing the standard therapy. This type of study is sometimes called a historical controls study. The very nature of a historical controls study guarantees that there will be a major discrepancy in timing. Thus, you have to consider any factors that have changed over time that might be related to the outcome. To what extent might these factors affect the outcome differentially?

The one exception is when a disease has close to 100% mortality (Silverman 1998, page 67). In that situation, there is no need for a concurrent control group, since any therapy that is remotely effective can be detected readily.

Did the authors use randomization?

If the authors of the study decided who would get the new therapy and who would get the standard therapy, we have an experimental design. When the authors of the study do have this level of control, they will almost always assign patients randomly.

If the patient did the choosing, if the patient’s doctor did the choosing, or if the groups were intact prior to the start of the research, then we have an observational design. In an observational design, it is impossible to assign patients randomly.

Here are some examples of experimental designs and observational studies.

In Adkinson (1997), 121 children with moderate-to-severe asthma were "randomly assigned to receive subcutaneous injections of either a mixture of seven aeroallergen extracts or a placebo." Since the researchers generated the sequence of random assignment, this is an experimental design.

In Bullock (1989), "80 severe recidivist alcoholics received accupuncture either at points specific for the treatment of substance abuse (treatment group) or at nonspecific points (control group)." Since the researchers controlled the nature of the accupuncture, this is an experimental design.

In Cardo (1997), 33 health care workers who became seropositive to HIV after percutaneous exposure to HIV-infected blood were compared to 665 health care workers with similar exposure who did not become seropositive. Since the researchers did not control who became seropositive, this is an observational study.

In Hu (1997), 80,082 women between the ages of 34 and 59 years were followed for 14 years to look for instances of non-fatal myocardial infarction or death from coronary heart disease. These women were divided into low, intermediate, and high groups on the basis of their consumption of dietary fat. Since the women themselves controlled their diets, rather than having a diet imposed on them by the researchers, this represents an observational design.

Information from an experimental design is generally considered more authoritative than information from an observational design because the researchers can use randomization. Randomization provides some level of assurance that the two groups are comparable in every way except for the therapy received.

Randomization requires the use of a random device, such as a coin flip or a table of random numbers. Systematic allocation (i.e., alternating between treatments) is not the same as randomization.

The simplest way to randomize is to layout the treatment schedule in a systematic (non-random) fashion, generate a random number for each value in the schedule and then sort the schedule by the random number.

Randomization ensures that both measurable and unmeasurable factors are balanced out across both the standard and the new therapy, assuring a fair comparison. It also guarantees that no conscious or subconscious efforts were used to allocate subjects in a biased way.

Randomization is not always possible or practical. When this is the case, we have to rely on observational data to draw any conclusions. But when randomization is possible, its use makes a research study more authoritative.

Although I do not have a bibliographic citation for this example, I heard an amusing story about a study of water toxicants on fish.

This research required that the fish be separated into five tanks, each of which would get a different level of the toxicant. The researchers caught one fifth of the fish and put then in one tank, then an additional one fifth and put them in a second tank and so forth. The outcome measurements were related not to the dosage, but to the order in which the tanks were filled, with the worst outcomes being in the first tank filled. and the best outcomes in the last tank filled.

What happened was that the slow-moving, easy-to-catch fish were all allocated to the first tank. The fast-moving, hard-to-catch fish ended up in the last tank. It turned out that the sicker fish were also the slow-moving, easy-to-catch fish, the healthiest fish swam faster and avoided early capture.

A better way to design this experiment was to allocate the fish into tanks randomly. This would ensure that each tank got a fair share of the fast-and-healthy and the slow-and-sick fish.

Studies without randomization often require either matching or statistical adjustments. While both matching and adjustments can help to some extent with covariate imbalance, these approaches do not work as well as randomization. In particular, some of the covariate imbalance may be due to factors that are difficult to measure. For example, patients may differ on the basis of

  1. psychological state

  2. severity of disease, and/or

  3. presence of comorbid conditions.

All of these factors can influence the outcome, but if you can't measure them easily, matching or adjustment is not possible.

So, all other things being equal, an experimental design with randomization is more persuasive than an observational design without randomization. Nevertheless, much can be learned from non-randomized. Almost everything we know about the risks of cigarette smoking came from observational designs (Gail 1996).

An editorial in the Journal of the American Medical Association (Sherwin 1997) tries to make sense of recent studies of the effect of dietary fat on obesity, heart disease, and stroke. After reviewing the results of numerous studies, the editorial comments:

"At present, most of this evidence in humans is observational and, consequently, an imperfect basis for causal inference. Large scale experimental studies that would provide more compelling data (such as the Women's Health Initiative) cost hundreds of millions of dollars and take decades to complete. Each study can only address the effects of a single nutritional change. Thus, it is still necessary to base advice to patients on dietary information that is less than certain and complete."

Randomized studies do have some weaknesses. These studies typically rely on the use of volunteers in a narrowly defined research setting. Such situations may not be reflective of how a typical patient behaves in a typical health care setting (Sackett 1997). In this particular aspect, a carefully planned observational design may provide a more relevant comparison.

Another problem with randomized designs is the limit to their size and scope. These limits may make it difficult to detect rare but important side effects. An observational approach like post marketing surveillance is more likely to be successful in these situations.

Studies of the potential harm caused by environmental exposures (such as lead based paint, second hand tobacco smoke, or electro-magnetic fields) are often impossible to randomize because of logistical and ethical issues.

These exceptions, however, do not diminish the value of experimental designs. In situations where observational and experimental studies can both be conducted, most researchers will give greater weight to the evidence in an experimental study.

Did the authors use matching?

Matching is the systematic selection, for every subject in the treatment/exposure group, of control subject with similar characteristics. For example, in a study of fetal exposure to cocaine, you might select infants born to a mother who abused cocaine during pregnacy. For every such infant, you would select a infant unexposed to cocaine in utero, but also who had the same sex, race, and socio-economic status.

Matching will prevent covariate imbalance for those variables used in matching. It will also reduce covariate imbalance for any variables closely related to the matching variables. It will not, however, protect against all covariate imbalance, especially for those covariates that are difficult to measure.

Matching often presents difficult logistical issues, because a matching control subject may not always be available. The logistics are especially difficult when there are several matching variables and when the pool of control subjects that you can draw from is not substantially larger than the pool of treatment/exposed subjects.

Matching is usually reserved for those variables that are known to be highly predictive of the outcome measure. In a cancer study, for example, matching is usually done on smoking. Many neonatology studies will match on gestational age.

Matching in a case control design

When you are selecting patients on the basis of disease and looking back at what exposure might have caused the disease, selection of matching control patients (patients without disease) can sometimes be tricky. You need to find a control that is similar to the case, except for the disease of interest. There are several possibilities, but none of them works perfectly.

  1. If the cases are people hospitalized for disease, you could choose people who are hospitalized for conditions other than the disease.

  2. You could ask each case to bring a friend with them. Their friend would be likely to be of simlar age and socioeconomic status.

  3. You could recruit controls from undiseased members of the same family.

You also have to be careful about the variable you use to match. If the matching variable is caused by the exposure or is a similar measure of exposure, then you might "over match" the data and remove the effect of the exposure. Marsh et al discuss an example of a study examining radiation exposure and the risk of leukemia at a nuclear reprocessing plant. In this study there were 37 workers diagnosed with leukemia (cases) and they were matched to four control workers. Each of the four control workers had to work at the same site, have the same gender, have the same job code, be born within two years of the case, and had to be hired within two years of the hire date of the case.

Unfortunately, there was a strong trend between hire date and exposure. Exposures were highest early in the plant's history and declined over time. So both hire date and exposure were measuring the same thing. When the data was matched on hire date, it artefactually controlled the exposure and pretty much ensured that the average radiation exposure would be the same among both the cases and the controls. This led to an estimate of radiation exposure that was actually slightly negative and not statistically significant.

When the data was rematched using all the variables except for hire date, the effect of radiation dose was large and positive and came close to approaching statistical significance.

Matching in a randomized design

In some randomized studies, matching will be used as well. Partly, this is a recognition that randomization will not totally remove covariate imbalance, just like a flip of 100 coins will not always result in exactly 50 heads and 50 tails.

More importantly, however, matching in a randomized study will provide extra precision. Matching creates pairs of subjects who will have greater homogeneity and therefore less variability.

The crossover design

The crossover design represents a special type of matching. In a crossover design, a subject is randomly assigned to a specific treatment order. Some subjects will receive the standard therapy first, followed by the new therapy (AB). Others will receive the new therapy first, followed by the standard therapy (BA).

Since the same subject receives both treatments, there is no possibility of covariate imbalance.

When therapies are applied in sequence, timing effects are of great concern. Are the therapies set far apart enough so that the effect of one therapy is unlikely to carryover into the other therapy? For example, if the two therapies represent different drugs, did the researchers allow enough time so that one drug was fully eliminated from the body before they administered the second drug?

The possibility of learning and fatigue effects are also potential problems in a crossover design.

Special problems arise when each subject receives the standard therapy first and then the new therapy (or vice versa). Many factors other than the change in therapy can cause a shift in the health of patients over time. Unless the researchers can point to other evidence that shows stability of the condition over time, information from this type of study is worthless.

Sometimes difficult circumstances (such as a general failure to respond to the standard therapy) will force the use of this type of design. Further discussion of lack of randomization or other issues with crossover designs can be found in Louis (1992).

Did the authors use statistical adjustments?

Statistical adjustments represent one way of correcting for covariate imbalance. There are several ways to make statistical adjustments.

First, there are regression adjustments. In a study of breastfeeding, there was an imbalance between the two groups in that one group was much older than the other group. From a regression model, we discover that older mothers breastfeed for longer periods of time, on average, than younger mothers. In fact, for each year of age, the duration of breastfeeding increases by 0.25 weeks on average. So we would adjust the difference of the two groups by 0.25 weeks for every year in discrepancy between the average mothers' ages.

Second, there are weighting adjustments. Suppose a group includes 25 males and 75 females, but in population we know that there should be a 50/50 split by gender. We could re-weight the data, so that each male has a weighting factor of 2.0 and each female has a weighting factor of 0.67. This artificially inflates the number of males to 50 and deflates the number of females to 50. A second group might have 40 males and 60 females. For this group, we would use weights of 1.25 and 0.83.

Both of these adjustments are imperfect, especially when the adjustment variable is imperfectly measured. And these adjustments are impossible if the researchers did not/could not measure the covariates.

Summary - Who did the choosing?

Did the authors use randomization? Randomization ensures balance among the two therapy groups with respect to both measurable and unmeasurable factors.

Did the authors use matching? Matching ensures comparable groups during the selection process.

Did the authors use statistical adjustments? Regression or weighting makes adjustments after the data are collected.

Bibliography

A controlled trial of immunotherapy for asthma in allergic children. N. F. Adkinson, Jr., P. A. Eggleston, D. Eney, E. O. Goldstein, K. C. Schuberth, J. R. Bacon, R. G. Hamilton, M. E. Weiss, H. Arshad, C. L. Meinert, J. Tonascia, B. Wheeler. New England Journal of Medicine 1997: 336(5); 324-31. [Abstract] [Full text] [PDF]

Controlled trial of acupuncture for severe recidivist alcoholism. M. L. Bullock, P. D. Culliton, R. T. Olander. Lancet 1989: 1(8652); 1435-9.

The orthomolecular treatment of cancer. II. Clinical trial of high-dose ascorbic acid supplements in advanced human cancer. E. Cameron, A. Campbell. Chem Biol Interact 1974: 9(4); 285-315.

A case-control study of HIV seroconversion in health care workers after percutaneous exposure. Centers for Disease Control and Prevention Needlestick Surveillance Group. D. M. Cardo, D. H. Culver, C. A. Ciesielski, P. U. Srivastava, R. Marcus, D. Abiteboul, J. Heptonstall, G. Ippolito, F. Lot, P. S. McKibben, D. M. Bell. N Engl J Med 1997: 337(21); 1485-90. [Abstract] [Full text] [PDF]

Statistics in Action. M.H. Gail. Journal of the American Statistical Association 1996: 91(433); 1-13.

Dietary Fat Intake and the Risk of Coronary Heart Disease in Women. Frank B. Hu, Meir J. Stampfer, JoAnn E. Manson, Eric Rimm, Graham A. Colditz, Bernard A. Rosner, Charles H. Hennekens, Walter C. Willett. N Engl J Med 1997: 337(21); 1491-1499.  [Abstract] [Full text] [PDF]

Removal of radiation dose response effects: an example of over-matching. J. L. Marsh, J. L. Hutton, K. Binks. Bmj 2002: 325(7359); 327-30. [Medline] [Full text] [PDF]

Observational Studies. PR Rosenbaum (1995) New York: Springer-Verlag.

Evidence-based medicine and treatment choices. D. L. Sackett. Lancet 1997: 349(9051); 570; discussion 572-3. [Medline]

Fat chance: diet and ischemic stroke [editorial; comment]. R. Sherwin, T. R. Price. Jama 1997: 278(24); 2185-6.

Where's the Evidence? Debates in Modern Medicine. William A. Silverman (1998) New York: Oxford University Press.


Was there a plan?

Introduction

The presence of a plan developed before data collection and analysis adds to the quality of a publication.

Case study: Meat consumption and childhood cancer

Studies of the effects of diet on health often have difficulties with multiple endpoints. An example is a 1994 study of the effect of cured and broiled meat consumption on childhood cancer.

This study examined two types of cancer (acute lymphocytic leukemia and brain tumor). The authors examined five types of meat consumption (ham/bacon/sausage, hot dogs, hamburgers, lunch meats, and charcoal broiled foods). Finally, the authors looked at food consumption both of the child and of the mother during pregnancy.

In the analysis, the researchers used a cut-off to compare low meat consumption to high meat consumption. For example, they compare one or more hamburgers consumed per week to less than one per week. In the text, however, they went further and discussed results with a different cut-off, children who ate two or more hamburgers per week compared to children who ate one or less per week.

This study came under a lot of criticism for its scattershot approach to investigation, though it also had its share of defenders. There's a saying in statistics "if you torture your data long enough, it will confess to something." When a research study has a plan with limited number of precisely defined hypotheses, the results are more persuasive. When the research has no pre-planned hypotheses, then the results should be considered preliminary and exploratory in nature.

Did the research have a narrow focus?

A good research study has limited objectives that are specified in advance. Failure to limit the scope of a study leads to problems with multiple testing.

When there are a large number of comparisons being made, the study is considered a fishing expedition. There is a saying in Statistics circles "If you torture your data long enough, it will confess to something."

Swaen et al (2001) provides empirical evidence that specifying a hypothesis prior to data collection reduced the chances of a false positive finding by a factor of three.

Pollex et al also show a similar finding in a more light hearted research project. They established a statistically significant association between certain astrological signs to be associated with winning the Nobel prize (Geminis were more likely, Leos were less likely). The authors conclude that

foraging through databases using contrived study designs in the absence of biological mechanistic data seomtimes yields spurious results.

When is multiple testing likely to occur?

Multiple testing often occurs when a researcher examines a large number of subgroups or a large number of endpoints (Howel 1994). Multiple testing problems also occur when a study examines multiple side effects.

When multiple tests are done simultaneously within a paper, there is an increase in the overall Type I error. If 100 tests were performed at alpha=.05, you would expect that 5 of those tests would be significant, even if there was nothing at all going on. There are statistical adjustments for multiple comparisons, but these are controversial. Significant results from a large number of unplanned comparisons are useful mostly just for setting future research priorities.

Optimal cut points and the problem with multiple comparisons.

Researchers will often simplify analysis of a continuous outcome measure by dividing that measure into two or more distinct groups on the basis of cut points. For example, a researcher might categorize his/her subjects as high or low blood pressure when they are above or below a certain value.

An abuse of this approach, called the minimum p-value approach, was noted by Altman (1994). Researchers would examine a variety of cut points and select the one that yielded the most favorable statistics.

For example, some researchers have chosen the cut point from among a large number of possible cut points so as the make the difference in survival times between those patients above the cut point and those patients below the cut point as large as possible.

By examining a multiple number of cut points the chance of drawing a false conclusion (Type I Error) is inflated from the traditional 5% value to a value as large as 40%.

There are several objective ways to select a cut point. Perhaps the best way is to select the cut point prior to looking at the data. This would involve the use of medical judgment.

After the data has been collected, there are some neutral ways of selecting a cut point. The simplest is a median split. If you wanted to create a median split for blood pressure, you would combine the blood pressure data from both groups, and select a value so that half of the blood pressures are larger and half are smaller.

Subgroup analysis

Subgroup comparisons are a special case of multiple testing. Rather than looking at multiple endpoints, a subgroup analysis compares a single endpoint across several different subgroups within the data.

Subgroup comparisons suffer from three problems. First, the subgroup comparison is usually a non-randomized comparison. Second, the subgroup comparison has less precision because the sample size is smaller. Third, the sample size in a study could be swamped by the potential number of possible subgroups that could potentially be examined.

If you find a subgroup that behaves differently, then you need to ask yourself a few questions. Is this a subgroup that I would have studied a priori if I had been more careful during the planning stage? Is there a plausible mechanism to explain why this subgroup behaves differently? Are there other studies that have similar findings for this subgroup?

There are some technical issues with subgroup comparisons. You wouldn't want to declare that a therapy is effective one subgroup if the p-value for that subgroup was 0.043 and the p-value for the others was 0.062. The analysis of subgroups should be done as a formal test of interaction.

A recent publication in the International Journal of Epidemiology provides empirical evidence that post hoc analyses are more likely to lead to false positive findings.

Did the authors deviate from the plan?

Not all research is predictable, so deviations from a pre-designed plan are sometimes necessary. Nevertheless, be cautious about any major deviation from the original research protocol. Some examples of deviations from the plan include:

You need to ask yourself if the authors deviated from the protocol in a conscious or subconscious effort to manipulate the results. Did the authors add other end-points in order to salvage a largely negative study? Were new exclusion criteria targeted to keep "troublesome" subjects out? It is impossible, of course, to discern the motives of the researchers. Nevertheless, for any deviation or modification to the protocol, you can ask whether this change would have made sense to include in the protocol if it had been thought of before data collection began.

An example of a deviation from the research plan.

An interesting deviation from the research plan occurs in a randomized double blind control trial for the use of selenium supplements (Clark 1996). The study was initiated in 1983 with basal skin carcinoma and squamous skin carcinoma as the primary end points. The researchers also looked for signs of selenium toxicity.

In 1990, funding was obtained to look at additional secondary end points (total mortality, cancer mortality, and incidence of lung, colorectal, and prostate cancers). While it was relatively easy to add extra endpoints in the middle of the study, the authors acknowledged that this represented a deviation from the protocol.

Another deviation from the protocol occurred when the study was terminated early (January 1996). No statistical changes were found in the primary endpoints, nor was any evidence of selenium toxicity found.

Among the secondary endpoints, however, the authors found statistically significant declines in total cancer mortality and lung cancer mortality. The authors also found statistically significant declines in the incidence of prostate cancer, colorectal cancer, lung cancer and total carcinomas. There was also a decline in overall mortality, though it did not achieve statistical significance.

There were no significant changes in the incidence of nine other types of cancer, including breast cancer, bladder cancer, and leukemia.

Because the significant results occurred in areas that were not originally planned for study, the authors acknowledge that any results have to be considered preliminary. Furthermore, it is unclear what impact the early termination of the study had on the statistics. Early termination of a study can cause serious biases, unless specific rules for early termination are established at the start of the study.

Fraudulent changes in the protocol

Detecting fraud in a research study is extremely difficult for anyone, but especially difficult for the reader. A thorough peer review provides a limited level of protection from fraud. Hawkey (2001) proposes that journals should see the original protocols for research studies as part of the peer review process. This practice, which has not yet been widely adopted, would provide some level of protection against fraud.

Sometimes a careful review of the numbers in a study can highlight the possibility of fraud. If a study used randomization, for example, watch out if there is an unexpected and unexplained deviation from a 50-50 split between treatment and control.

Replication of research findings is also a good protection against fraud.

Did the authors discard outliers?

You should be skeptical of any study that removes outliers. Inappropriate removal of outliers can seriously bias the study results.

Sometimes the outliers are more interesting than the bulk of the data themselves. You may gain more insight by trying to uncover the cause of an outlying observation than you would by examining the relatively small effects that occur with the rest of the data.

It is generally a bad idea to remove data points on the basis of their data values alone. If an investigation of an outlier leads to a discovery of a typing error or the inclusion of a subject who did not meet the pre-specified inclusion criteria, then correction or removal of the outlier is appropriate.

If there is no such justification, then the best solution is to leave the outlier alone. Another alternative is reporting data analysis results both with and without the outlier.

An example of inappropriate outlier deletion.

The NASA web site has an interesting example of outlier deletion. Researchers in the 1980s first published information about the hole in the ozone layer above Antarctica. These researchers were nervous because the results from the British Antarctic survey did not match results from earlier years taken by an American satellite. The authors discovered, however, that the American satellite had a computer filter built in that automatically removed any large sudden changes in ozone concentration which it considered as instrument errors. When this filter was removed, the authors were able to trace the development of the ozone hole all the way back to 1976.

Further details about the history of the ozone hole can be found at Ozone Depletion, History and Politics. Brien Sparling. (Accessed on January 11, 2001) http://www.nas.nasa.gov/About/Education/Ozone/history.html

This site explains how the ozone hole was first discovered. It mentions the (inaccurate) claim that a computer filter on a previous satellite had discarded outliers which masked the discovery of the ozone hole for eight years. Although this makes a wonderful teaching example, the actual story is not quite that good.

Summary - Was there a plan?

The presence of a plan developed before data collection and analysis adds to the quality of a publication.

Did the research have a narrow focus? A large number of comparisons limits the amount of evidence that you can place on any single conclusion. Results from a limited number of planned comparisons are considered more authoritative.

Did the authors deviate from the plan? While minor deviations are expected, be cautious about major deviations from the research plan, such as developing new exclusion criteria during the course of the study. In particular, removing outliers without a sound scientific reason is dangerous.

Further reading

Responsibilities of sponsors are limited in premature discontinuation of trials. Richard Ashcroft. BMJ 2001: 323(7303); 53-. [Full text]

Risks and benefits of estrogen plus progestin in healthy postmenopausal women: principal results From the Women's Health Initiative randomized controlled trial. Jacques E. Rossouw, Garnet L. Anderson, Ross L. Prentice, Andrea Z. LaCroix, Charles Kooperberg, Marcia L. Stefanick, Rebecca D. Jackson, Shirley A. A. Beresford, Barbara V. Howard, Karen C. Johnson, Jane Morley Kotchen, Judith Ockene. Jama 2002: 288(3); 321-33. CONTEXT: Despite decades of accumulated observational evidence, the balance of risks and benefits for hormone use in healthy postmenopausal women remains uncertain. OBJECTIVE: To assess the major health benefits and risks of the most commonly used combined hormone preparation in the United States. DESIGN: Estrogen plus progestin component of the Women's Health Initiative, a randomized controlled primary prevention trial (planned duration, 8.5 years) in which 16608 postmenopausal women aged 50-79 years with an intact uterus at baseline were recruited by 40 US clinical centers in 1993-1998. INTERVENTIONS: Participants received conjugated equine estrogens, 0.625 mg/d, plus medroxyprogesterone acetate, 2.5 mg/d, in 1 tablet (n = 8506) or placebo (n = 8102). MAIN OUTCOMES MEASURES: The primary outcome was coronary heart disease (CHD) (nonfatal myocardial infarction and CHD death), with invasive breast cancer as the primary adverse outcome. A global index summarizing the balance of risks and benefits included the 2 primary outcomes plus stroke, pulmonary embolism (PE), endometrial cancer, colorectal cancer, hip fracture, and death due to other causes. RESULTS: On May 31, 2002, after a mean of 5.2 years of follow-up, the data and safety monitoring board recommended stopping the trial of estrogen plus progestin vs placebo because the test statistic for invasive breast cancer exceeded the stopping boundary for this adverse effect and the global index statistic supported risks exceeding benefits. This report includes data on the major clinical outcomes through April 30, 2002. Estimated hazard ratios (HRs) (nominal 95% confidence intervals [CIs]) were as follows: CHD, 1.29 (1.02-1.63) with 286 cases; breast cancer, 1.26 (1.00-1.59) with 290 cases; stroke, 1.41 (1.07-1.85) with 212 cases; PE, 2.13 (1.39-3.25) with 101 cases; colorectal cancer, 0.63 (0.43-0.92) with 112 cases; endometrial cancer, 0.83 (0.47-1.47) with 47 cases; hip fracture, 0.66 (0.45-0.98) with 106 cases; and death due to other causes, 0.92 (0.74-1.14) with 331 cases. Corresponding HRs (nominal 95% CIs) for composite outcomes were 1.22 (1.09-1.36) for total cardiovascular disease (arterial and venous disease), 1.03 (0.90-1.17) for total cancer, 0.76 (0.69-0.85) for combined fractures, 0.98 (0.82-1.18) for total mortality, and 1.15 (1.03-1.28) for the global index. Absolute excess risks per 10 000 person-years attributable to estrogen plus progestin were 7 more CHD events, 8 more strokes, 8 more PEs, and 8 more invasive breast cancers, while absolute risk reductions per 10 000 person-years were 6 fewer colorectal cancers and 5 fewer hip fractures. The absolute excess risk of events included in the global index was 19 per 10 000 person-years. CONCLUSIONS: Overall health risks exceeded benefits from use of combined estrogen plus progestin for an average 5.2-year follow-up among healthy postmenopausal US women. All-cause mortality was not affected during the trial. The risk-benefit profile found in this trial is not consistent with the requirements for a viable intervention for primary prevention of chronic diseases, and the results indicate that this regimen should not be initiated or continued for primary prevention of CHD. [Abstract]

Effects of selenium supplementation for cancer prevention in patients with carcinoma of the skin. A randomized controlled trial. Nutritional Prevention of Cancer Study Group. L. C. Clark, G. F. Combs, Jr., B. W. Turnbull, E. H. Slate, D. K. Chalker, J. Chow, L. S. Davis, R. A. Glover, G. F. Graham, E. G. Gross, A. Krongrad, J. L. Lesher, Jr., H. K. Park, B. B. Sanders, Jr., C. L. Smith, J. R. Taylor. Jama 1996: 276(24); 1957-63. OBJECTIVE: To determine whether a nutritional supplement of selenium will decrease the incidence of cancer. DESIGN: A multicenter, double-blind, randomized, placebo-controlled cancer prevention trial. SETTING: Seven dermatology clinics in the eastern United States. PATIENTS: A total of 1312 patients (mean age, 63 years; range, 18-80 years) with a history of basal cell or squamous cell carcinomas of the skin were randomized from 1983 through 1991. Patients were treated for a mean (SD) of 4.5 (2.8) years and had a total follow-up of 6.4 (2.0) years. INTERVENTIONS: Oral administration of 200 microg of selenium per day or placebo. MAIN OUTCOME MEASURES: The primary end points for the trial were the incidences of basal and squamous cell carcinomas of the skin. The secondary end points, established in 1990, were all-cause mortality and total cancer mortality, total cancer incidence, and the incidences of lung, prostate, and colorectal cancers. RESULTS: After a total follow-up of 8271 person-years, selenium treatment did not significantly affect the incidence of basal cell or squamous cell skin cancer. There were 377 new cases of basal cell skin cancer among patients in the selenium group and 350 cases among the control group (relative risk [RR], 1.10; 95% confidence interval [CI], 0.95-1.28), and 218 new squamous cell skin cancers in the selenium group and 190 cases among the controls (RR, 1.14; 95% CI, 0.93-1.39). Analysis of secondary end points revealed that, compared with controls, patients treated with selenium had a nonsignificant reduction in all-cause mortality (108 deaths in the selenium group and 129 deaths in the control group [RR; 0.83; 95% CI, 0.63-1.08]) and significant reductions in total cancer mortality (29 deaths in the selenium treatment group and 57 deaths in controls [RR, 0.50; 95% CI, 0.31-0.80]), total cancer incidence (77 cancers in the selenium group and 119 in controls [RR, 0.63; 95% CI, 0.47-0.85]), and incidences of lung, colorectal, and prostate cancers. Primarily because of the apparent reductions in total cancer mortality and total cancer incidence in the selenium group, the blinded phase of the trial was stopped early. No cases of selenium toxicity occurred. CONCLUSIONS: Selenium treatment did not protect against development of basal or squamous cell carcinomas of the skin. However, results from secondary end-point analyses support the hypothesis that supplemental selenium may reduce the incidence of, and mortality from, carcinomas of several sites. These effects of selenium require confirmation in an independent trial of appropriate design before new public health recommendations regarding selenium supplementation can be made.

Societal responsibilities of clinical trial sponsors. Stephen Evans, Stuart Pocock. BMJ 2001: 322(7286); 569-570. [Full text] [PDF]

Journals should see original protocols for clinical trials. C J Hawkey. BMJ 2001: 323(7324); 1309-. [Medline] [Full text]

Assessing cause and effect from trials: a cautionary note. D. Howel, R. Bhopal. Control Clin Trials 1994: 15(5); 331-4.

Premature discontinuation of clinical trial for reasons not related to efficacy, safety, or feasibility Commentary: Early discontinuation violates Helsinki principles. Michel Lievre, Joel Menard, Eric Bruckert, Joel Cogneau, Francois Delahaye, Philippe Giral, Eran Leitersdorf, Gerald Luc, Luis Masana, Philippe Moulin, Philippe Passa, Denis Pouchain, Gerard Siest, K Boyd. BMJ 2001: 322(7286); 603-606. When investigators embark on a clinical trial, they naturally expect that the journey will end with the completion of the scheduled patient follow up and publication of the results. Some trials may sink en route because of organisational or ethical reasons, and such misfortunes must be accepted. Sometimes, however, trials are scuttled by their sponsors. Such premature discontinuation not only is frustrating for investigators but may have important medical implications. In this article we analyse the case of a clinical trial that was recently stopped for financial reasons, discuss the consequences of such discontinuations, and make some proposals to avoid recurrence. [Full text] [PDF]

Randomised controlled trial of cardiotocography versus Doppler auscultation of fetal heart at admission in labour in low risk obstetric population. G. Mires, F. Williams, P. Howie. British Medical Journal 2001: 322(7300); 1457-60; discussion 1460-2. (See "Commentary: changes between protocol and manuscript should be declared at submission" at the end of this article.) OBJECTIVE: To compare the effect of admission cardiotocography and Doppler auscultation of the fetal heart on neonatal outcome and levels of obstetric intervention in a low risk obstetric population. DESIGN: Randomised controlled trial. SETTING: Obstetric unit of teaching hospital PARTICIPANTS: Pregnant women who had no obstetric complications that warranted continuous monitoring of fetal heart rate in labour. INTERVENTION: Women were randomised to receive either cardiotocography or Doppler auscultation of the fetal heart when they were admitted in spontaneous uncomplicated labour. MAIN OUTCOME MEASURES: The primary outcome measure was umbilical arterial metabolic acidosis. Secondary outcome measures included other measures of condition at birth and obstetric intervention. RESULTS: There were no significant differences in the incidence of metabolic acidosis or any other measure of neonatal outcome among women who remained at low risk when they were admitted in labour. However, compared with women who received Doppler auscultation, women who had admission cardiotocography were significantly more likely to have continuous fetal heart rate monitoring in labour (odds ratio 1.49, 95% confidence interval 1.26 to 1.76), augmentation of labour (1.26, 1.02 to 1.56), epidural analgesia (1.33, 1.10 to 1.61), and operative delivery (1.36, 1.12 to 1.65). CONCLUSIONS: Compared with Doppler auscultation of the fetal heart, admission cardiotocography does not benefit neonatal outcome in low risk women. Its use results in increased obstetric intervention, including operative delivery. [Medline] [Abstract] [Full text] [PDF]

Celestial determinants of success in research. R. Pollex, B. Hegele, M.R. Ban. Cmaj 2001: 165(12); 1584. [Medline] [Full text] [PDF]

Ozone Depletion, History and politics. Brien Sparling. Accessed on 2002-11-27. "Ground based measurements of Ozone were first started in 1956, in at Halley Bay, Antarctica. Satellite measurements of ozone started in the early 70's, but the first comprehensive worldwide measurements started in 1978 with the Nimbus-7 satellite. Nimbus-7 carried a TOMS (total ozone mapping spectrometer, and a SBUV(solar backscatter UV meter). The TOMS finally broke on May 7th,1993, but today there are several different satellites measuring concentrations of ozone and other atmosheric gases. Gases in the troposphere and lower stratosphere are sampled by weather balloons or by airplanes such as the ER-2 managed by NASA." www.nas.nasa.gov/About/Education/Ozone/history.html

False positive outcomes and design characteristics in occupational cancer epidemiology studies. G. G. Swaen, O. Teggeler, L. G. van Amelsvoort. Int J Epidemiol 2001: 30(5); 948-54. BACKGROUND: Recently there has been considerable debate about possible false positive study outcomes. Several well-known epidemiologists have expressed their concern and the possibility that epidemiological research may loose credibility with policy makers as well as the general public. METHODS: We have identified 75 false positive studies and 150 true positive studies, all published reports and all epidemiological studies reporting results on substances or work processes generally recognized as being carcinogenic to humans. All studies were scored on a number of design characteristics and factors relating to the specificity of the research objective. These factors included type of study design, use of cancer registry data, adjustment for smoking and other factors, availability of exposure data, dose- and duration-effect relationship, magnitude of the reported relative risk, whether the study was considered a 'fishing expedition', affiliation and country of the first author. RESULTS: The strongest factor associated with the false positive or true positive study outcome was if the study had a specific a priori hypothesis. Fishing expeditions had an over threefold odds ratio of being false positive. Factors that decreased the odds ratio of a false positive outcome included observing a dose-effect relationship, adjusting for smoking and not using cancer registry data. CONCLUSION: The results of the analysis reported here clearly indicate that a study with a specific a priori study objective should be valued more highly in establishing a causal link between exposure and effect than a mere fishing expedition.


Who knew what when?

Introduction

Knowledge of group membership, during the research study collection can cause problems. When possible, the treatment status should be blinded to the patients, anyone who interacts with the patients, anyone who evaluates the patients or anyone who collects data from the patients. Even when this is not possible, the randomization list should stay be concealed until the patient agrees to participate in the study and is shown to be eligible for the study.

Acupuncture

Acupuncture is an example of a therapy that is difficult to blind. One study of the effect of acupuncture on the prevention of recidivism among alcohol and other drug abusers (Bullock et al 1989). This study used a placebo acupuncture that placed needles 5 mm away from the designated acupuncture point.

The use of placebo acupuncture was intended to keep information about the treatment groups hidden from the patients themselves. The patients knew that they were being "needled", but they did not know if the needles were placed correctly or incorrectly. The assumption for this study is that if acupuncture is effective, then correct application of acupuncture should show a greater effect than incorrect application of acupuncture. There is some controversy, however, over this assumption (Nahin and Strauss 2001).

Because of the nature of acupuncture, the acupuncturists were aware of which patients were which, making this only a partially blinded study. A critique of this study (Sampson 1997) pointed out that there were significant interactions between the acupuncturists and the patients, with opportunities for indirect suggestion and nonverbal communication to occur. One indication that subjects became aware of who was in which group was the fact that there was a far greater tendency for control subjects to drop out of the study.

Definition of blinding.

In an experimental study, it is desirable (but not always possible) to keep the information about the treatments hidden from the patients and anyone involved with evaluating the patient. This is known as "blinding" or "masking." Blinding prevents conscious or subconscious biases or expectations from influencing the outcome of the study.

There is always some individual who knows which patients get which treatments, such as the pharmacy that prepares the pills and placebos. This is perfectly fine as long as these individuals do not interact with the patients or evaluate the patients.

There is a bit of ambiguity with respect to who is blinded (Devereaux et al 2001). For example, a survey of 25 textbooks produced nine different definitions of "double blind." Therefore, you should avoid using these terms and focus instead on which individuals are blinded. If you are evaluating an article, look for evidence of blinding for the following groups:

If only some of the above are unaware of the treatment, then the study is partially blinded.

The effect of blinding on the patient.

Blinding prevents the placebo effect from distorting the research results. The placebo effect is a product of "belief, expectancy, cognitive reinterpretation, and diversion of attention" that can lead to psychological and sometimes physiological improvements in situations where the treatment is known to have no effect, such as sugar pills (Beyerstein 1997).

Johnson (1997) lists three specific situations where the placebo effect is of particular concern: when enthusiasm by the patient or the doctor for the new procedure is strong, when outcomes are based on the patient's self-assessment (e.g. quality of life studies), and when the treatment is primarily for symptoms. The placebo effect is less critical for objective outcomes like survival.

A recent study showed that the placebo effect might be overstated in some contexts (Hrobjartsson and Gotzsche 2001). Some of the effects attributed to the placebo are perhaps caused instead by statistical artefacts like regression to the mean or by the tendency of some conditions to resolve spontaneously .

Even without a placebo effect, blinding would still be important to insure uniform rates of compliance. You want to avoid a situation where a patient thinks "I'm in the placebo arm, so it's not really important whether I show up for my follow-up evaluation."

The effect of blinding on the investigators.

The value of blinding also extends to the research team, and should include anyone who interacts with the patients. In a clinical trial of treatments for multiple sclerosis, a pair of neurologists assessed the outcome of each patient (Noseworthy et al 1994). One neurologist was blinded to the treatment status and one was unblinded. The unblinded neurologist gave substantially lower ratings to patients in the placebo group, which would have led to falsely concluding that one of the treatments was effective.

Researchers can also influence the outcome through their attitudes and through their differential use of other medications (Schulz et al 2002).

Those who collect data through an interview might probe harder for some patients if they are not blinded. Gail (1996) describes an observational study where the people asking questions about smoking and other risk factors were unaware of when they were interviewing lung cancer patients or controls. Thus, the interviewers could not subconsciously prod more for smoking information among the lung cancer patients.

When blinding is impossible

Unfortunately, there are many situations where blinding is impossible. For example, if you are comparing oral versus rectal administration of a drug, that's pretty hard to conceal from the patient. In general, observational studies cannot be blinded, because the patient and/or their doctor selects the treatment group.

Surgical procedures are often difficult to completely blind. Nevertheless, Johnson (1997) suggests some partial steps at blinding that prevent some of the biases from creeping in. If two surgical procedures use different types of incisions, identical blood or iodine stained opaque dressings could be used to keep the patients unaware of which operation was performed. Also, although the surgeon cannot be blinded to the difference in surgery, those who evaluate the health of the patient after surgery could be kept unaware of the particular operation, so as to insure that their evaluation of the patient is unbiased.

Even though the placebo may look the same, sometimes the doctor may infer which group a patient belongs to, perhaps through noting a characteristic set of side effects. If you are worried about this, ask the doctors to try to identify which treatment group they believe each patient belonged to. If the percentage of correct guesses is significantly larger than 50%, then the allocation scheme was not sufficiently blinded.

Although unblinded studies are considered less authoritative than blinded studies, you should not use blinding as a surrogate marker for the quality of the research (Schulz et al 2002). For example, Rupert Sheldrake conducted a survey of various journals and showed that blinding was used in 85% of all parapsychology research. But it would be a mistake to claim, as Dr. Sheldrake does, that

"Parapsychologists ... have been constantly subjected to intense scrutiny by skeptics, and this has made them more rigorous." http://www.parascope.com/en/articles/blindScience.htm

Blinding is just of many factors that combine to indicate a study's rigor and quality.

The problem with studies without blinding.

Two researchers have examined studies with and without blinding. These authors found that studies without blinding show an average bias of 11-17% (Schulz 1996; Colditz 1989). In other words, when an unblinded study was compared to a blinded study, the former study tended to estimate a treatment effect that was (on average) 11% to 17% higher than the latter.

Additional evidence of this problem appears in a meta-analysis of the effect of intermittent sunlight exposure and melanoma (Nelemans 1995). When nine studies without blinding were combined, they showed a odds ratio of 1.84 which was statistically significant (95% confidence interval 1.52 to 2.25). When the seven studies with blinding were combined, they showed a much smaller odds ratio (1.17, 95% confidence interval 0.98 to 1.39) which was not statistically significant. This is further evidence that unblinded studies are more likely to show statistical significance than blinded studies.

Concealed allocation.

Another important aspect of research is concealed allocation, which is the concealment of the randomization list from those involved with recruiting subjects. This concealment occurs until after subjects agree to participate and the recruiter determines that the patient is eligible for the study.

It is always possible to conceal the randomization list, even when the treatment itself cannot be blinded. Check out all the exclusion criteria and if the subject qualifies, open a sealed envelope which identifies which group the patient belongs to. So, for example, it is impossible to use blinding when comparing a surgical to a non-surgical technique, but the selection of who gets the surgical technique could be hidden from both the patient and the surgeon until after all the selection and inclusion criteria are applied.

Knowledge of treatment order allows the doctors recruiting patients to consciously or unconsciously influence the composition of the groups. They can do this by applying exclusion criteria differentially or by delaying entry of a certain healthier (or unhealthier) subject so he/she gets into the "desirable" group. Unblinded allocation schemes show an average bias of 30-40% (Schulz 1996).

There are many stories of physicians who have tried and suceeded in recruiting a patient into a preferred group. If the treatment allocation is hidden in sealed envelopes, they can hold it up to a strong light. If the sealed envelopes are not sequentially numbered, they can open several envelopes at once. If the allocation is controlled by a central operator, they can call and ask for the allocation of several patients at once.

When a doctor has an overt preference to enroll a patient into one group over another, it raises ethical issues about equipoise and perhaps the doctor should not be participating in the trial.

Concealed allocation only makes sense for a truly randomized study. For convenience, some researchers will allocate in a systematic (non-random) fashion, such as alternating regularly between the two treatments. This is a bad idea. Systematic allocations allow the doctors to guess which group the next patient is going to be allocated to, leading to the same potential problems described above. Systematic assignment causes an average bias of 15% (Colditz 1989).

Summary - Who knew what when?

Knowledge of group membership, either before or during the data collection can bias the study. Ask yourself who knew what when. Ideally information about the treatment should be hidden from the patients themselves, anyone interacting with the patients, anyone evaluating the patients, or anyone collecting data from the patients. The randomization list should be concealed and the treatment assignment should not be revealed until the patient agrees to participate in the study and the recruiting physician has verified that the patient is eligible for the study.

Further reading

Controlled trial of acupuncture for severe recidivist alcoholism. Bullock ML, Culliton PD and Olander RT. Lancet 1989:1(8652);1435-9.

How study design affects outcomes in comparisons of therapy. I: Medical. Colditz G, Miller J and Mosteller F. Stat Med 1989:8(4);441-454.

"Double blind, you are the weakest link- good-bye!" Devereaux PJ, Bhandari M, Montori VM, Manns BJ, Ghali WA and Guyatt GH. ACP Journal Club 2002:136;A11-A12.

Statistics in Action. Gail MH. Journal of the American Statistical Association 1996:91(433);1-13.

Is the placebo powerless? An analysis of clinical trials comparing placebo with no treatment. Hrobjartsson A and Gotzsche PC. N Engl J Med 2001:344(21);1594-602.

Removing bias in surgical trials. Johnson AG and Dixon JM. British Medical Journal 1997:314(7085);916-7.

Research into complementary and alternative medicine: problems and potential. Nahin RL and Straus SE. British Medical Journal 2001:322(7279);p161-4.

An addition to the controversy on sunlight exposure and melanoma risk: a meta-analytical approach. Nelemans PJ, Rampen FH, Ruiter DJ and Verbeek AL. J Clin Epidemiol 1995:48(11);1331-42.

The impact of blinding on the results of a randomized, placebo-controlled multiple sclerosis clinical trial. Noseworthy JH, Ebers GC, Vandervoort MK, Farquhar RE, Yetisir E and Roberts R. Neurology 1994:44(1);p16-20.

Inconsistencies and Errors in Alternative Medicine Research. Sampson W. Skeptical Inquirer 1997 (September/October);21(5):35-38.

Empirical evidence of bias dimensions of methodological quality associated with estimates of treatment effects in controlled trials. Schulz K, Chalmers I, Hayes R and Altman D. JAMA 1995:273(5);408-12.

Randomised trials, human nature, and reporting guidelines. Schulz KF. Lancet 1996:348(9027);596-8.

The Landscape and Lexicon of Blinding in Randomized Trials. Schulz KF, Chalmers I and Altman DG. Annals of Internal Medicine 2002:136(3);254-259.

Allocation concealment in randomised trials: defending against deciphering. Schulz KF and Grimes DA. Lancet 2002:359;614-618.

Blinding in randomised trials: hiding who got what. Schulz KF and Grimes DA. Lancet 2002:359;696-700.

Generation of allocation sequences in randomised trials: chance not choice. Schulz KF and Grimes DA. Lancet 2002:359;515-519.


Who was left out?

Introduction

Research studies often have a narrow focus, but sometimes it can be too narrow. When too many patients are left out, those who remain may not be not representative of the types of patients you will encounter.

When you are trying to figure out who was left out and what impact this has, ask the following two questions:

4.1 Who was excluded at the start of the study?
4.2 Who dropped out during the study?

Nicotine patches

The Journal of Pediatrics published a study of adolescent smokers in 1996. The researchers recruited 22 volunteers from five public high schools in the Rochester, MN area for participation in a smoking cessation program involving behavioral counseling, group therapy, and nicotine patches. Researchers measured the number of cigarettes smoked, side effects, and blood levels of nicotine.

The purpose of the research was to evaluate "the safety, tolerance, and efficacy of 22 mg/d nicotine patch therapy in smokers younger than 18 years who were trying to stop smoking." The authors also listed a secondary goal, "to compare blood cotinine levels, nicotine withdrawal scores, and adverse experiences with those of adults obtained in previous patch studies." Cotinine is a metabolite of nicotine and provides a useful objective measure of cigarette smoking. It also allowed the authors to examine whether nicotine toxicity was an issue.

This study did not include major segments of the teenage smoking population. The study included only white subjects because there were too few minority studentsin the Rochester area. Subjects had to get parental permission, excluding smokers who wished to keep their habit secret from their parents. Subjects were also volunteers, and thus could be considered more motivated to quit than the typical teenage smoker.

The study also had a serious drop out rate. Of the presumably thousands of teenage smokers in the Rochester Minnesota area, only 71 volunteers responded to the initial call for subjects. Of the 71 volunteers, 55% met inclusion criteria. Of the remaining 39, 44% declined to attend the initial meeting. Of the remaining 22, 14% were non-compliant. Of the remaining 18, 39% failed to respond to the one year survey. Only 11 completed the entire study (50% of those who started the study; 28% of those meeting inclusion criteria; 15% of the initial volunteers.)

This study had a serious problem with who was left out. The large number of subjects who did not get into the study or who did not complete the study makes it hard to generalize the findings of this research.

4.1 Who was excluded at the start of the study?

Researchers, trying to minimize variation, will use exclusion criteria to create more homogenous groups. While minimizing variability is good, too much homogeneity can backfire. It’s difficult to extrapolate results from a very tightly controlled and homogenous clinical trial to the variation of patients seen in your practice. Ask yourself the question "How similar are my patients?"

For the study to be useful to us, we want the research subjects to be as similar as possible to the patients we see. Watch out for exclusion criteria that leave out large groups of patients. Also be aware that too many research studies exclude women unnecessarily.

Ask yourself whether the geographic location or the type of health care setting places restrictions on the type of patients seen. Tertiary care centers only see patients that are extremely ill. A study of Midwest hospitals will not have a representative number of Hispanic patients compared to the Southwest.

Exclusion of elderly patients

[To be added]

Exclusion of women

[To be added]

Exclusion of children

[To be added]

Volunteer bias

Quite often, the only patients we are able to study are those who volunteer to help out. The use of volunteers, however, may exclude important segments of the patient population.

Volunteers may differ from the normal population on several critical factors. Volunteers for a study involving cash payments may come more often from economically challenged environments. If a free health check-up is included, volunteers may come more often from people worried about their health status. Volunteers for lengthy studies are less likely to be employed.

Recruiting controls is especially troublesome in a study that involves a painful procedure. Gustavsson (1997) documents volunteer bias in a study of lumbar puncture to obtain cerebrospinal fluid.

In this study, subjects were asked to submit to a lumbar puncture in order to "examine the associations between personality traits and biochemical variables." Of the 87 subjects, 48 declined to participate. The authors were fortunate enough to have measures of personality on both those who participated in the study and those who did not participate.

Those who participated had scores roughly a half standard deviation higher on impulsiveness. They did not differ on other personality traits such as socialization and detachment.

The large difference in the impulsiveness measurement would obviously cloud any attempt to correlate personality traits and biochemical measurements in spinal fluids among those who volunteered.

Hughes et al (1997) point out the obvious fact that smokers who participate in smoking cessation studies are different from smokers in the general population.

Volunteers in survey study.

An aspect of volunteering can occur in survey studies. People who volunteer to return a questionnaire are frequently quite different from those who refuse to fill out the survey. In particular, the non-responders tend to be more apathetic. Return rates for surveys vary by the type of survey, but if less than half of the subjects returned the survey, any results are of very limited value. Again, look for efforts to minimize non-response and/or efforts to characterize the demographics of non-responders.

Stocks and Grunnell (2000) examined general practitioners who routinely failed to return mail surveys. A follow-up telephone call assessed demographic characteristics of this group. They were older, less likely to have post graduate qualifications and were less likely to be involved with a teaching practice.

The use of email and the Internet to recruit and/or survey subjects is problematic, because not everyone owns or uses a computer. Etter and Perneger (2001) recruited cigarette smokers both by the Internet and by regular mail. Those subjects recruited by the Internet differed in age, education, degree of smoking, and desire to quit. The authors of this report, however, argue that in spite of these demographic differences, the trends and associations found in the Internet recruited group matched those of the other group. For example, in both groups, light smokers were more likely than heavy smokers to adopt a "taking control" self-change strategy and less likely to adopt a "risk assessment" strategy.

In 1976, Shere Hite published a study on female sexual attitudes that represented the responses of 3,019 surveys. While that sounds impressive, it was a small fraction of the 100,000 surveys that were sent out.

One can speculate on the characteristics of those who failed to respond, but it is a pretty good bet that many of them felt uncomfortable discussing aspects of their sex lives in a survey format. It's obvious that this tendency alone would tend to affect many of the responses in the survey.

What to look for in studies using volunteers.

Examine the incentives and disincentives for participation. Are any incentives or disincentives related to important prognostic factors?

Were the researchers able to characterize various aspects of those who did not volunteer? How similar were the volunteers and non-volunteers?

Do people volunteer themselves into specific treatment groups? If so, we have an observational study.

Some studies involve the use of volunteers who are subsequently randomized into two groups. If this case, some problems will diminish. Comparison between the two groups will be unbiased, but it may be difficult to generalize to a non-volunteer population.

4.2 Who dropped out during the study?

It is inevitable that some patients will drop out during the study. If the number is more than a few, this is a cause for concern. Dropouts often have a different prognosis than those who stay. Ignoring the dropouts will often paint a rosier picture of the outcome. Was there any effort (financial inducement, follow-up reminders) made to minimize dropouts? Were the authors able to characterize the demographics of the dropouts?

Were non-compliant patients excluded? Non-compliance is often associated with poor prognosis. Excluding these patients may also paint a rosier picture of the outcome. Patients should be analyzed in the groups they were randomized to. This is known as "intention to treat" analysis.

Consider a new surgical therapy which is being compared to a standard non-surgical therapy. Some patients randomized to the surgical therapy might die prior to receiving the therapy. This is the most extreme form of non-compliance. These patients should still be analyzed as part of the surgical therapy group. Otherwise the rapidly dying patients will be excluded from the treatment group, but not from the control group, leading to serious bias.

Additional resources

Unjustified exclusion of elderly people from studies submitted to research ethics committee for approval: descriptive study. A. Bayer and W. Tadd. British Medical Journal 2000:321(7267);992-3. Abstract not available yet. [Medline] [Full text] [PDF]

Exclusion of elderly people from clinical research: a descriptive study of published reports. G. Bugeja, A. Kumar and A. K. Banerjee. British Medical Journal 1997:315(7115);1059. Abstract not available yet. [Medline] [Full text]

Hold the Lard! The Atkins Diet still doesn't work.. Michael Fumento. Accessed on 2002-12-06. A careful analysis of the recent research on the Atkins diet shows that there was a much higher drop out rate in that group, which could partially explain the promising results of this diet. www.reason.com/hod/mf120502.shtml

Participation in Research and Access to Experimental Treatments by HIV-Infected Patients. Allen L. Gifford, William E. Cunningham, Kevin C. Heslin, Ron M. Andersen, Terry Nakazono, Dale K. Lieu, Martin F. Shapiro, Samuel A. Bozzette and the HIV Cost and Services Utilization Study Consortium. N Engl J Med 2002:346(18);1373-1382. Background Although there is concern that minority groups and women are underrepresented in research involving patients with human immunodeficiency virus (HIV) infection, the available data are inconclusive. Methods We used nationally representative data from the HIV Cost and Services Utilization Study to determine the characteristics of the participants and nonparticipants in trials of medications for HIV infection and whether or not patients had access to experimental treatments. A probability sample of 2864 persons, representing all 231,400 adults with known HIV infection who are cared for in the contiguous United States, were interviewed on three occasions between 1996 and 1998. They were asked about participation in clinical research studies of medications and past receipt of experimental medications for HIV. Results We estimate that 14 percent of adults receiving care for HIV infection participated in a medication trial or study; 24 percent had received experimental medications; and 8 percent had tried and failed to obtain experimental treatments. According to multivariate models, non-Hispanic blacks and Hispanics were less likely to be participating in trials than non-Hispanic whites (odds ratio for participation among non-Hispanic blacks, 0.50 [95 percent confidence interval, 0.28 to 0.91]; odds ratio among Hispanics, 0.58 [95 percent confidence interval, 0.37 to 0.93]) and to have received experimental medications (odds ratios, 0.41 [95 percent confidence interval, 0.32 to 0.54] and 0.56 [95 percent confidence interval, 0.41 to 0.78], respectively). Patients who were cared for in private health maintenance organizations were less likely to participate in trials than those with fee-for-service insurance (odds ratio, 0.43 [95 percent confidence interval, 0.21 to 0.88]). Women were not underrepresented in research trials and had a similar likelihood of receiving experimental treatments. Conclusions Among patients with HIV infection, participation in research trials and access to experimental treatment is influenced by race or ethnic group and type of health insurance. [Abstract]

The exclusion of the elderly and women from clinical trials in acute myocardial infarction. J. H. Gurwitz, N. F. Col and J. Avorn. Jama 1992:268(11);1417-22. OBJECTIVE--To determine the extent to which the elderly have been excluded from trials of drug therapies used in the treatment of acute myocardial infarction, to identify factors associated with such exclusions, and to explore the relationship between the exclusion of elderly and the representation of women. DATA SOURCES--We conducted a systematic search of the English-language literature from January 1960 through September 1991 to identify all relevant studies of specific pharmacotherapies employed in the treatment of acute myocardial infarction. To accomplish this, we searched MEDLINE, major cardiology textbooks, meta-analyses, reviews, editorials, and the bibliographies of all identified articles. STUDY SELECTION--Only trials in which patients were randomly allocated to receive a specific therapeutic regimen or a placebo or nonplacebo control regimen were included for review. DATA EXTRACTION--Studies were abstracted for year of publication, source of support, performance location, drug therapies to which patients were randomized, use of invasive diagnostic tests or therapeutic procedures, exclusion criteria, size and demographic characteristics of the randomized study population, and principal outcome measures. DATA SYNTHESIS--A total of 214 trials met inclusion criteria, involving 150,920 study subjects. Over 60% of trials excluded persons over the age of 75 years. Studies published after 1980 were more likely to have age-based exclusions compared with studies published before 1980 (adjusted odds ratio, 4.92; 95% confidence interval, 2.33 to 10.54). Trials of thrombolytic therapy involving an invasive procedure were more likely to exclude elderly patients compared with other studies (adjusted odds ratio, 2.45; 95% confidence interval, 1.10 to 5.47). Studies with age-based exclusions had a smaller percentage of women compared with those without such exclusions (18% vs 23%; P = .0002), with the mean age of the study population significantly associated with the proportion of women participants (P = .0001, R2 = .29). CONCLUSIONS--Age-based exclusions are frequently used in clinical trials of medications used in the treatment of acute myocardial infarction. Such exclusions limit the ability to generalize study findings to the patient population that experiences the most morbidity and mortality from acute myocardial infarction.

Randomised study of long term outcome after epidural versus non-epidural analgesia during labour. C. J. Howell, T. Dean, L. Lucking, K. Dziedzic, P. W. Jones and R. B. Johanson. Bmj 2002:325(7360);357. OBJECTIVE: To determine whether epidural analgesia during labour is associated with long term backache. DESIGN: Follow up after randomised controlled trial. Analysis by intention to treat. SETTING: Department of obstetrics and gynaecology at one NHS trust. PARTICIPANTS: 369 women: 184 randomised to epidural group (treatment as allocated received by 123) and 185 randomised to non-epidural group (treatment as allocated received by 133). In the follow up study 151 women were from the epidural group and 155 from the non-epidural group. MAIN OUTCOME MEASURES: Self reported low back pain, disability, and limitation of movement assessed through one to one interviews with physiotherapist, questionnaire on back pain and disability, physical measurements of spinal mobility. RESULTS: There were no significant differences between groups in demographic details or other key characteristics. The mean time interval from delivery to interview was 26 months. There were no significant differences in the onset or duration of low back pain, with nearly a third of women in each group reporting pain in the week before interview. There were no differences in self reported measures of disability in activities of daily living and no significant differences in measurements of spinal mobility. CONCLUSIONS: After childbirth there are no differences in the incidence of long term low back pain, disability, or movement restriction between women who receive epidural pain relief and women who receive other forms of pain relief. [Medline]

Do safety practices differ between responders and non-responders to a safety questionnaire? D. Kendrick, R. Hapgood and P. Marsh. Injury Prevention 2001:7(2);100-3. OBJECTIVE: To compare reported safety practices between responders and non-responders to a safety survey. DESIGN: Cross sectional survey at baseline compared with safety practices reported at subsequent child health surveillance checks. SUBJECTS: Parents of children aged 3-12 months registered with practices participating in a controlled trial of injury prevention in primary care that did, and did not, respond to the baseline survey and who subsequently attended child health surveillance checks. RESULTS: No difference in safety practices was found between responders and non-responders to the survey at the 6-9 month check. Responders were more likely to report owning a stair gate (odds ratio (OR) 2.75, 95% confidence interval (CI) 1.82 to 4.16) and socket covers (OR 2.16, 95% CI 1.53 to 3.04) at the 12-15 month check, and owning socket covers (OR 2.19, 95% CI 1.34 to 3.61) at the 18-24 month check. Responders were more likely to report greater than the median number of safety practices at the 18 month check. CONCLUSIONS: Non-responders to a safety survey appear to be less likely to report owning several items of safety equipment than responders. Further work is needed to confirm these findings. Extrapolating the results of safety surveys to the population as a whole may lead to over estimation of safety equipment possession. [Medline]

Spectrum bias in the evaluation of diagnostic tests: lessons from the rapid dipstick test for urinary tract infection. M. S. Lachs, I. Nachamkin, P. H. Edelstein, J. Goldman, A. R. Feinstein and J. S. Schwartz. Ann Intern Med 1992:117(2);135-40. OBJECTIVE: To determine if the leukocyte esterase and bacterial nitrite rapid dipstick test for urinary tract infection (UTI) is susceptible to spectrum bias (when a diagnostic test has different sensitivities or specificities in patients with different clinical manifestations of the disease for which the test is intended). DESIGN: Cross-sectional study. PATIENTS: A total of 366 consecutive adult patients in whom clinicians performed urinalysis to diagnose or exclude UTI. SETTING: An urban emergency department and walk-in clinic. MEASUREMENTS: After the patient encounter, but before dipstick test or culture was done, clinicians recorded the signs and symptoms that were the basis for suspecting UTI and for performing a urinalysis and an estimate of the probability of UTI based on the clinical evaluation. For all patients who received urinalysis, dipstick tests and culture were done in the clinical microbiology laboratory by medical technologists blinded to clinical evaluation. Sensitivity for the dipstick was calculated using a positive result in either leukocyte esterase or bacterial nitrite, or both, as the criterion for a positive dipstick, and greater than 10(5) CFU/mL for a positive culture. RESULTS: In the 107 patients with a high (greater than 50%) prior probability of UTI, who had many characteristic UTI symptoms, the sensitivity of the test was excellent (0.92; 95% CI, 0.82 to 0.98). In the 259 patients with a low (less than or equal to 50%) prior probability of UTI, the sensitivity of the test was poor (0.56; CI, 0.03 to 0.79). CONCLUSIONS: The leukocyte esterase and bacterial nitrite dipstick test for UTI is susceptible to spectrum bias, which may be responsible for differences in the test's sensitivity reported in previous studies. As a more general principle, diagnostic tests may have different sensitivities or specificities in different parts of the clinical spectrum of the disease they purport to identify or exclude, but studies evaluating such tests rarely report sensitivity and specificity in subgroups defined by clinical symptoms. When diagnostic tests are evaluated, information about symptoms in the patients recruited for study should be included, and analyses should be done within appropriate clinical subgroups so that clinicians may decide if reported sensitivities and specificities are applicable to their patients. [Medline]

Comorbidity of chronic diseases in general practice. F. G. Schellevis, J. van der Velden, E. van de Lisdonk, J. T. van Eijk and C. van Weel. J Clin Epidemiol 1993:46(5);469-73. With the increasing number of elderly people in The Netherlands the prevalence of chronic diseases will rise in the next decades. It is recognized in general practice that many older patients suffer from more than one chronic disease (comorbidity). The aim of this study is to describe the extent of comorbidity for the following diseases: hypertension, chronic ischemic heart disease, diabetes mellitus, chronic nonspecific lung disease, osteoarthritis. In a general practice population of 23,534 persons, 1989 patients have been identified with one or more chronic diseases. Only diseases in agreement with diagnostic criteria were included. In persons of 65 and older 23% suffer from one or more of the chronic diseases under study. Within this group 15% suffer from more than one of the chronic diseases. Osteoarthritis and diabetes mellitus are the diseases with the highest rate of comorbidity. Comorbidity restricts the external validity of results from single-disease intervention studies and complicates the organization of care.

Sample size slippages in randomized trials: exclusions and the lost and wayward. K. F. Schulz and D.A. Grimes. Lancet 2002:359(781-785. Proper randomisation means little if investigators cannot include all randomised participants in the primary analysis. Participants might ignore follow-up, leave town, or take aspartame when instructed to take aspirin. Exclusions before randomisation do not bias the treatment comparison, but they can hurt generalisability. Eligibility criteria for a trial should be clear, specific, and applied before randomisation. Readers should assess whether any of the criteria make the trial sample atypical or unrepresentative of the people in which they are interested. In principle, assessment of exclusions after randomisation is simple: none are allowed. For the primary analysis, all participants enrolled should be included and analysed as part of the original group assigned (an intent-to-treat analysis). In reality, however, losses frequently occur. Investigators should, therefore, commit adequate resources to develop and implement procedures to maximise retention of participants. Moreover, researchers should provide clear, explicit information on the progress of all randomised participants through the trial by use of, for instance, a trial profile. Investigators can also do secondary analyses on, for instance, per-protocol or as-treated participants. Such analyses should be described as secondary and non-randomised comparisons. Mishandling of exclusions causes serious methodological difficulties. Unfortunately, some explanations for mishandling exclusions intuitively appeal to readers, disguising the seriousness of the issues. Creative mismanagement of exclusions can undermine trial validity.

Nonresponse bias and early versus all responders in mail and telephone surveys. J. Siemiatycki and S. Campbell. Am J Epidemiol 1984:120(2);p291-301. Mail and telephone survey methods, with or without follow-up by other methods, are cost-effective alternatives to the conventional home interview approach. However, it has long been thought that they are especially susceptible to nonresponse bias. The study addressed this issue in the context of parallel mail and telephone health surveys carried out in Montreal. The mail strategy among 1,555 adults achieved 68.5% response and follow-up by telephone and home interview increased response to 80.9%. Respondents were adequately representative of the entire sample with respect to socioeconomic status, number of adults in household, and ethnic distribution. The 68.5% initial stage respondents were similar to all respondents on the above variables as well as on age, sex, education and reported health status. Odds ratios of smoking and respiratory symptoms hardly differed between initial stage and all respondents. The telephone survey among 1,595 adults achieved 72.7% response and follow-up by mail and personal interview increased response to 88.2%. Comparisons between respondents and the entire sample and between initial stage respondents and all respondents gave similar results to those found in the mail strategy, although there was some change in a symptom-smoking odds ratio from the initial stage respondents to all respondents. In both survey strategies, there was no evidence of substantial nonresponse bias and estimates of morbidity and health care would not have differed much if the fieldwork had stopped at the initial mail or telephone stage.

Ozone Depletion, History and politics. Brien Sparling. Accessed on 2002-11-27. "Ground based measurements of Ozone were first started in 1956, in at Halley Bay, Antarctica. Satellite measurements of ozone started in the early 70's, but the first comprehensive worldwide measurements started in 1978 with the Nimbus-7 satellite. Nimbus-7 carried a TOMS (total ozone mapping spectrometer, and a SBUV(solar backscatter UV meter). The TOMS finally broke on May 7th,1993, but today there are several different satellites measuring concentrations of ozone and other atmosheric gases. Gases in the troposphere and lower stratosphere are sampled by weather balloons or by airplanes such as the ER-2 managed by NASA." www.nas.nasa.gov/About/Education/Ozone/history.html

Applying evidence to the individual patient. S. E. Straus and D. L. Sackett. Ann Oncol 1999:10(1);29-32. Abstract not available yet. [Medline]

The Effect of School Dropout Rates on Estimates of Adolescent Substance Use among Three Racial/Ethnic Groups. Randall C. Swaim, F Beauvais, EL Chavez and ER Oetting. American Journal of Public Health 1997:87(1);51-55. ABSTRACT: OBJECTIVES: This study examined, across three racial/ethnic groups, how the inclusion of data on drug use of dropouts can alter estimates of adolescent drug use rates. METHODS: Self-report rates of lifetime prevalence and use in the previous 30 days were obtained from Mexican American, White non-Hispanic, and Native American student (n = 738) and dropouts (n = 774). Rates for the age cohort (students and dropouts) were estimated with a weighted correction formula. RESULTS: Rates of use reported by dropouts were 1.2 to 6.4 times higher than those reported by students. Corrected rates resulted in changes in relative rates of use by different ethnic groups. CONCLUSIONS: When only in-school data are available, errors in estimating drug use among groups with high rates of school dropout can be substantial. Correction of student-based data to include drug use of dropouts leads to important changes in estimated levels of drug use and alters estimates of the relative rates of use for racial/ethnic minority groups with high dropout rates.

Physicians' reasons for not entering eligible patients in a randomized clinical trial of surgery for breast cancer. K. M. Taylor, R. G. Margolese and C. L. Soskolne. N Engl J Med 1984:310(21);p1363-7. We studied the reasons surgical principal investigators chose not to enter patients in a large, multicenter trial sponsored by a cooperative group. In 1976 the National Surgical Adjuvant Project for Breast and Bowel Cancers (NSABP) initiated a clinical trial to compare segmental mastectomy and postoperative radiation, or segmental mastectomy alone, with total mastectomy. Because the low rates of accrual were threatening to close the trial prematurely, we mailed a questionnaire to the 94 NSABP principal investigators, asking why they were not entering eligible patients in the trial. A response rate of 97 per cent was achieved. Physicians who did not enter all eligible patients offered the following explanations: (1) concern that the doctor-patient relationship would be affected by a randomized clinical trial (73 per cent), (2) difficulty with informed consent (38 per cent), (3) dislike of open discussions involving uncertainty (22 per cent), (4) perceived conflict between the roles of scientist and clinician (18 per cent), (5) practical difficulties in following procedures (9 per cent), and (6) feelings of personal responsibility if the treatments were found to be unequal (8 per cent). Further investigation into the behavioral aspects of the investigator-patient relationship is particularly pressing, since fear of change in this relationship was the most common reason given for not entering eligible patients in the trial.

Representation of older patients in cancer treatment trials. EL Trimble, CL Carter, D Cain, B Freidlin, RS Ungerleider and MA Friedman. Cancer 1994:74(7);2208-14. ABSTRACT: In 1990, the five leading causes of cancer death in men aged 65 and older were carcinomas of the lung, prostate, colon and rectum, and pancreas, and leukemia. For women in this age group, the five leading causes of cancer death were carcinomas of the lung, breast, colon and rectum, pancreas, and ovary. To determine the representation of the elderly in clinical trials, the 1992 accrual of the National Cancer Institute (NCI)-sponsored Clinical Cooperative Group treatment trials (which included more than 8000 elderly patients) for the aforementioned sites was compared with the 1990 incidence data from the NCI's Surveillance, Epidemiology, and End Results program. Of the male patients enrolled in the trials, an average of 39% were older than 65 (47.3% lung, 79.5% prostate, 47.5% colorectal, 45.6% pancreas, and 9.6% leukemia); whereas 25.9% of all women enrolled in trials were 65 or older (43.6% lung, 17.3% breast, 46.2% colorectal, 59.6% pancreas, and 35.4% ovary). With respect to incidence, older patients generally are underrepresented in cancer treatment trials. With the exception of the data on prostate cancer, each of the comparisons using the Z statistic gave probability values of less than 0.01. The most significant discrepancies between incidence and participation in cancer treatment protocols were noted for leukemia in males and breast cancer in females. Possible explanations for these findings include (1) a research focus on aggressive therapy, which may be unacceptably toxic to the elderly; (2) presence of comorbidity in the elderly; (3) fewer trials available specifically aimed at older patients; (4) limited expectations for long term benefits on the part of physicians, relatives, and the patients themselves; and (5) a lack of financial, logistic, and social support for the participation of elderly patients in clinical trials. Recognizing this situation, NCI recently sponsored a number of trials that specifically target the elderly. This paper describes the status of all major Phase II and III clinical trials that recently were closed, still are active, or now are in review that address the clinical care of this important segment of the U.S. population.

Are Subjects in Pharmacological Treatment Trials of Depression Representative of Patients in Routine Clincal Practice. M. Zimmerman, J.I. Mattia and Michael A. Posternak. American Journal of Psychiatry 2002:159(3);469-473. OBJECTIVE: The methods used to evaluate the efficacy of antidepressants differ from treatment for depression in routine clinical practice. The rigorous inclusion/exclusion criteria used to select subjects for participation in efficacy studies potentially limit the generalizability of these trials' results. It is unknown how much impact these criteria have on the representativeness of subjects in efficacy trials. This study estimated the proportion of depressed patients treated in routine clinical practice who would meet standard inclusion/exclusion criteria for an efficacy trial. METHOD: A total of 803 individuals, aged 16--65 years, who were seen at intake at an outpatient practice underwent a thorough diagnostic evaluation, including the administration of semistructured diagnostic interviews; 346 patients had current major depression. Common inclusion/exclusion criteria used in efficacy studies of antidepressants were applied to the depressed patients to determine how many would have qualified for an efficacy trial. RESULTS: Approximately one-sixth of the 346 depressed patients would have been excluded from an efficacy trial because they had a bipolar or psychotic subtype of depression. The presence of a comorbid anxiety or substance use disorder, insufficient severity of depressive symptoms, or current suicidal ideation would have excluded 86.0% (N=252) of the remaining 293 outpatients with nonpsychotic unipolar major depressive disorder from an antidepressant efficacy trial. CONCLUSIONS: Subjects treated in antidepressant trials represent a minority of patients treated for major depression in routine clinical practice. These results show that antidepressant efficacy trials tend to evaluate a subset of depressed individuals with a specific clinical profile.

What are the characteristics of general practitioners who routinely do not return postal questionnaires: a cross sectional study. Nigel Stocks, David Grunnell. J Epidemiol Community Health 2000; 54:940-941.

Assessing the generalizability of smoking studies. Hughes JR, Giovino RM, Flore MC. Addiction 1997; 92:469-472.

Intention-to-treat principle. Victor M. Montori, Gordon H. Guyatt. CMAJ 2001;165(10):1339-41. http://www.cma.ca/cmaj/vol-165/issue-10/1339.asp

A comparison of cigarette smokers recruited through the Internet or by mail. Jean-Francois Etter and Thomas V Perneger. International Journal of Epidemiology 2001; 30:521-525.

Summary - Who was left out?

Exclusion of subjects can make the study biased or less generalizable.

4.1 Who was excluded at the start of the study? Excessively strict entry criteria in a research study can make it difficult to extrapolate to the types of patients that you normally see.

4.2 Who dropped out during the study? A large number of drop-outs during the course of a research study can bias the final conclusions.


How much did things change?

Introduction

It's not enough just to assess statistical significance in a study. You need to also make sure that the difference has a practical impact, that it represented a clinically relevant outcome, and that there were sufficient number of patients to provide reasonable precision.

When you are looking at how much things changed, ask yourself the following questions:

  1. Did the authors measure the right thing?
  2. Did the authors measure the outcome well?
  3. Was the change clinically significant?
  4. Were there enough subjects?

Case study: Non-steroidal anti-inflammatory drugs

A 1987 study of non-steroidal anti-inflammatory drugs (NSAID) showed that patients who took these drugs were 50% more likely to develop upper gastrointestinal (UGI) bleeding. This rate was statistically significant at alpha=.05. UGI bleeding, however, was rare in both groups. Only 1 case per thousand person years in the controls, 1.5 in the NSAID group. If you see 100 patients a year, you would have to wait two decades, more or less, in order see one excess event of bleeding, on average.

In this article, the authors were up front about the very small increase in risk. Most authors, however, are so relieved to achieve statistical significance that they forget to consider whether the size of the difference will improve clinical practice.

This is summarized well in the following Gertrude Stein quote :"For a difference to be a difference it has to make a difference"

Did the authors measure the right outcome?

There is a tendency to focus on intermediate measures that are easy to assess, but which may or may not be predictive of more important endpoints. Improvement in forced expiratory volume may not translate into a reduction in asthma attacks. A reduction in abnormal ventricular depolarization may not translate into a reduction in the recurrence of heart attacks. If an intermediate endpoint is used, ask yourself whether there is an adequate link between this endpoint and something that is relevant to your patients.

Consider, for example, a study (Leeson et al 2001) that showed an association between duration of breast feeding and brachial artery distensibility at 20 to 28 years of age. This is a measure of stiffness, and could be considered a surrogate marker for cardiovascular disease in mid and later life. Such a link is tenuous and the authors themselves as well as an accompanying editorial (Booth 2001) admit that no cause and effect relationship between breast feeding and heart disease.

Typically patients are interested in only three things: morbidity, mortality, and quality of life. They don't care about concentration of homocysteine in their blood, or what their CD4 cell count is. They want to know more fundamental questions like "will I die?" or "will I be able to walk up a flight of stairs unassisted?"

Unvalidated measures

Jadad and Gagliardi (1998) criticize instruments used to rate web sites for the quality of health information. There were 47 such instruments but only 14 discussed how they were created. None of them included measures of validity, which caused these authors to conclude that

"Many incompletely developed instruments to evaluate health information exist on the Internet. It is unclear, however, whether they should exist in the first place, whether they measure what they claim to measure, or whether they lead to more good than harm."

Validity is a loaded word that means different things to different people. A general consensus, though, is that a measure is valid to the extent that it measures the thing that it claims to measure and does not mix in things that are unrelated. There are several ways to measure validity, but most of these involve comparison to an external standard.

Short term measures

As noted in the introduction, a good measure of the effectiveness of an intervention for schizophrenia, should wait at least six months from the start of therapy. Unfortunately, the typical study lasted 6 weeks or less.

This is a problem for many studies where budgetary limitations force the researchers to focus on short term outcomes. The problem with this is that it is usually easier to get a short term change, especially with interventions that involve behavioral changes (e.g., weight loss through the use of diet and exercise). It is the long term change, however, which is relevant in most cases.

Other issues

Be careful that you don’t focus solely on the outcomes mentioned in the abstract. There is a tendency to report only in the abstract the outcome measures that were statistically significant, rather than the outcome measures most of interest to health care professionals.

Also always consider whether the researcher provided adequate inspection of side effects.

Did the authors measure the outcome well?

Research is messy and difficult, so it is not always possible to obtain careful and precise measurements. To what extent are the measurements imprecise and subjective?

Measurement error

Measurement error is simply the inability to measure an important variable accurately. Measurement error in the outcome variable does not ordinarily cause bias, buy measurement error in factors that can predict the outcome are of serious concern.

There are several ways to assess dietary fat intake. The most accurate (and also the most costly) way is through the use of prospectively recorded food diaries.

Sometimes the cost limitations or the retrospective nature of a research study will require a less accurate assessment of dietary fat, such as through an interview. Shapiro (1997) points out that estimation of dietary fat using interviews tends to correlate poorly with estimation using prospective diaries. This would cast doubt, for example, on retrospective studies that tried to associate dietary fat intake with the risk of breast cancer.

Retrospective data

Retrospective data are data that is collected by looking backwards in time. We obtain this data by asking subjects to recall events that occur