How to Read a Medical Journal Article (August 1999 version).
This page represents my presentation as it appeared more or less in August 1999. I have made a few formatting changes, especially to the bibliography, to maintain consistency with the other pages on my web site.
"Still, it is an error to argue in front of your data. You find yourself insensibly twisting them around to fit your theories." Sherlock Holmes in The Adventure of Wisteria Lodge.
The medical journals are filled with research on new medical therapies. What should you look for in this research? How do you gauge the strength of evidence? When should you change your medical practices?
0.1 What you should look for.
The answers lie not in how the research data was analyzed but in how it was collected. Simple factors like how the research subjects were recruited determine the strength of evidence in a research paper. When you are reading a journal article, just ask yourself five simple questions: Who did the choosing?; Was there a plan?; Who knew what when?; Who was left out?; and How much did things change?
0.2 Learning Objectives.
In this presentation, you will learn how to:
assess the strength of evidence in a journal article.
identify potential problems with observational studies.
explain why "blinding" is important.
0.3 Important Disclaimer.
This presentation will review several published journal articles. The intent is to gauge how much evidence each article presents in favor of the efficacy of a new therapy. Some articles will provide a greater level of evidence and some will provide a lesser level of evidence. But articles which provide lesser levels of evidence are still valuable and important.
Nothing stated in this presentation about a particular journal article should be construed as a statement about the quality of that article. The very nature of research requires a series of steps from very preliminary and speculative levels of evidence to more definitive levels of evidence.
0.4 Outline.
Here are five questions you should ask yourself when reading a journal article.
1. Who did the choosing? (21 pages)
2. Was there a plan? (12 pages)
3. Who knew what when? (10 pages)
4. Who was left out? (11 pages)
5. How much did things change? (8 pages)
There are three additional sections in this presentation.
Case studies (10 pages)
Special guidelines for meta-analysis. (7 pages)
A resource list. (3 pages)
Furthermore, when I point out limitations in the evidence presented in a journal article, more often than not, the authors of the article delineate these same limitations in their discussion. But in general, you need to be aware of these limitations because not every journal author is going to be open and honest about the limitations of their research.
1. Who did the choosing?
How the subjects for the research study are assigned to the new and the standard therapies plays a critical role in assessing the quality of a research paper.
If you want to judge how effective a new therapy is, you need a comparison group. The comparison group would be a group of subjects who receive either the standard therapy or, in some cases, no therapy (e.g., a placebo comparison).
The ideal comparison group should be similar in all respects to the new therapy group except for the therapy itself. For example, the two groups should have a similar range of ages and weights and should be composed of roughly the same proportions in gender and race/ethnicity. The groups should be evaluated concurrently.
1.1 Was there a good comparison group?
If you want to judge how effective a new therapy is, you need a comparison group. The comparison group would be a group of subjects who receive either the standard therapy or, in some cases, no therapy (e.g., a placebo comparison).
The ideal comparison group should be similar in all respects to the new therapy group except for the therapy itself. For example, the two groups should have a similar range of ages and weights and should be composed of roughly the same proportions in gender and race/ethnicity. The groups should be evaluated concurrently.
1.1.1 Covariate imbalance.
Sometimes the groups are dissimilar on some important characteristics. This is known as covariate imbalance. Covariate imbalance is not an insurmountable problem, but it does make a study less authoritative.
1.1.2 When there is no comparison group.
A prominent surgeon came to give a special lecture at the School of Medicine. He expounded about the great advance that he had made in a specific surgical procedure. At the end of the lecture he drew thunderous applause from the audience.
At first it seemed like there would be no questions, but then a young student in the front row raised her hand. "Did you use any controls?" she asked.
The surgeon seemed to be offended by this question. "Controls?" he asked. "Are you suggesting that I should have denied my surgical advance to half of my patients?"
The rest of the audience grew very quiet. But the young woman was not intimidated. "Yes," she said, "that's exactly what I meant."
The surgeon grew even angrier at this, slammed his fist on the podium and shouted "Why that would have condemned half of my patients to certain death!"
There was silence for a few seconds. Then the entire auditorium burst out in laughter when the young woman asked "Which half?"
1.1.4 Problems with a historical controls study.
Sometimes researchers will assign all of the research subjects to the new therapy. The outcomes of these subjects are compared to historical records representing the standard therapy. This type of study is sometimes called a historical controls study. The very nature of a historical controls study guarantees that there will be a major discrepancy in timing. Thus, you have to consider any factors that have changed over time that might be related to the outcome. To what extent might these factors affect the outcome differentially?
1.1.5 The crossover design.
The crossover design represents a special situation where there is not a separate comparison group.
In a crossover design, a subject is randomly assigned to a specific treatment order. Some subjects will receive the standard therapy first, followed by the new therapy (AB). Others will receive the new therapy first, followed by the standard therapy (BA).
Since the same subject receives both treatments, there is no possibility of covariate imbalance.
1.1.6 What to look out for in a crossover design.
When therapies are applied in sequence, timing effects are of great concern. Are the therapies set far apart enough so that the effect of one therapy is unlikely to carryover into the other therapy? For example, if the two therapies represent different drugs, did the researchers allow enough time so that one drug was fully eliminated from the body before they administered the second drug?
The possibility of learning and fatigue effects are also potential problems in a crossover design.
1.1.7 Failure to randomize the treatment order.
Special problems arise when each subject receives the standard therapy first and then the new therapy (or vice versa). Many factors other than the change in therapy can cause a shift in the health of patients over time. Unless the researchers can point to other evidence that shows stability of the condition over time, information from this type of study is worthless.
Sometimes difficult circumstances (such as a general failure to respond to the standard therapy) will force the use of this type of design. Further discussion of lack of randomization or other issues with crossover designs can be found in Louis (1992).
1.2 Did the authors create the groups?
If the authors of the study decided who would get the new therapy and who would get the standard therapy, we have an experimental design. If the patient did the choosing, if the patient’s doctor did the choosing, or if the groups were intact prior to the start of the research, then we have an observational design.
The distinction between experimental and observational designs is very critical. The greater control that is available in an experimental design generally leads to better quality results.
1.2.1 An example of an experimental design.
In Adkinson (1997), 121 children with moderate-to-severe asthma were "randomly assigned to receive subcutaneous injections of either a mixture of seven aeroallergen extracts or a placebo."
Since the researchers generated the sequence of random assignment, this is an experimental design.
1.2.2 A second example of an experimental design.
In Bullock (1989), "80 severe recidivist alcoholics received accupuncture either at points specific for the treatment of substance abuse (treatment group) or at nonspecific points (control group)."
Since the researchers controlled the nature of the accupuncture, this is an experimental design.
1.2.3 An example of an observational design.
In Cardo (1997), 33 health care workers who became seropositive to HIV after percutaneous exposure to HIV-infected blood were compared to 665 health care workers with similar exposure who did not become seropositive.
Since the researchers did not control who became seropositive, this is an observational study.
1.2.4 A second example of an observational design.
In Hu (1997), 80,082 women between the ages of 34 and 59 years were followed for 14 years to look for instances of non-fatal myocardial infarction or death from coronary heart disease. These women were divided into low, intermediate, and high groups on the basis of their consumption of dietary fat. Since the women themselves controlled their diets, rather than having a diet imposed on them by the researchers, this represents an observational design.
1.2.5 Why an experimental design is better.
Information from an experimental design is generally considered more authoritative than information from an observational design. For example, randomization is possible when the authors control group membership. Randomization provides some level of assurance that the two groups are comparable in every way except for the therapy received.
Also observational studies are more likely to require difficult statistical adjustments.
Nevertheless, much can be learned from observational designs. Almost everything we know about the risks of cigarette smoking came from observational designs (Gail 1996).
1.2.6 A specific critique of observational designs.
An editorial in the Journal of the American Medical Association (Sherwin 1997) tries to make sense of recent studies of the effect of dietary fat on obesity, heart disease, and stroke. After reviewing the results of numerous studies, the editorial comments:
"At present, most of this evidence in humans is observational and, consequently, an imperfect basis for causal inference. Large scale experimental studies that would provide more compelling data (such as the Women's Health Initiative) cost hundreds of millions of dollars and take decades to complete. Each study can only address the effects of a single nutritional change. Thus, it is still necessary to base advice to patients on dietary information that is less than certain and complete."
1.2.7 Limitations of an experimental design.
Experimental designs rely on the use of volunteers in a narrowly defined research setting. Such situations may not be reflective of how a typical patient behaves in a typical health care setting (Sackett 1997). In this particular aspect, a carefully planned observational design may provide a more relevant comparison.
Another problem with experimental designs is the limit to their size and scope. Rare but important side effects may be difficult to detect in an experimental design. Also, side effects that take a long time to manifest themselves may not be detected in an experimental design with a limited follow-up time. An observational approach like post marketing surveillance is more likely to be successful in these situations.
1.2.8 An additional limitation of experimental designs.
Experimental designs that study the potential harm caused by environmental exposures (such as lead based paint, second hand tobacco smoke, or electro-magnetic fields) are impossible to conduct because of logistical and ethical issues.
These exceptions, however, do not diminish the value of experimental designs. In situations where observational and experimental studies can both be conducted, most researchers will give greater weight to the evidence in an experimental study.
1.3 Was the assignment randomized?
The ideal for any research study is to insure that there are no differences between the standard therapy group and the new therapy group other than the therapy itself. If the groups differ on other factors such as the age of the patients or the racial composition, then these other factors could change the outcome of the study, either masking a difference that actually exists, or creating an artefactual difference between the two groups that is unrelated to the differences in the two therapies.
1.3.1 What does randomization entail.
Randomization requires the use of a random device, such as a coin flip or a table of random numbers. Systematic allocation (i.e., alternating between treatments) is not the same as randomization.
The simplest way to randomize is to layout the treatment schedule in a systematic (non-random) fashion, generate a random number for each value in the schedule and then sort the schedule by the random number.
1.3.2 Why is randomization important?
Randomization insures that both measurable and unmeasurable factors are balanced out across both the standard and the new therapy, assuring a fair comparison. It also guarantees that no conscious or subconscious efforts were used to allocate subjects in a biased way.
Randomization is not always possible or practical. When this is the case, we have to rely on observational data to draw any conclusions. But when randomization is possible, its use makes a research study more authoritative.
1.3.3 An example when randomization was needed.
Although I do not have a bibliographic citation for this example, I heard an amusing story about a study of water toxicants on fish. This research required that the fish be separated into five tanks, each of which would get a different level of the toxicant.
The researchers caught one fifth of the fish and put then in one tank, then an additional one fifth and put them in a second tank and so forth. The outcome measurements were related not to the dosage, but to the order in which the tanks were filled, with the worst outcomes being in the first tank filled. and the best outcomes in the last tank filled.
1.3.4 Why the tank that was filled first had the poorest outcome.
What happened was that the slow-moving, easy-to-catch fish were all allocated to the first tank. The fast-moving, hard-to-catch fish ended up in the last tank. It turned out that the sicker fish were also the slow-moving, easy-to-catch fish, the healthiest fish swam faster and avoided early capture.
A better way to design this experiment was to allocate the fish into tanks randomly. This would insure that each tank got a fair share of the fast-and-healthy and the slow-and-sick fish.
1.9 Summary - Who did the choosing?
Assignment of subjects to the new and the standard therapy plays a critical role in the quality of the research.
1.1 Was there a good comparison group? The evaluation of the new and the standard therapy should occur concurrently. If the therapies are applied in sequence to the same group of subjects, beware of learning effects, fatigue effects, and carryover effects.
1.2 Did the authors create the groups? If the assignment of new versus standard therapy was not under the complete control of the authors, the study is observational, which is generally considered to be less authoritative.
1.3 Was the assignment randomized? Randomization insures balance among the two therapy groups with respect to both measurable and unmeasurable factors.
2. Was there a plan?
The presence of a plan developed before data collection and analysis adds to the quality of a publication.
2.1 Were there enough subjects?
Every research study, especially negative studies, should justify the sample size chosen. It is unethical to perform research on humans or animals without first demonstrating that the sample size you have chosen is appropriate.
Justification of sample size is particularly important for a negative study (one where no difference between the standard and new therapies were found) and in studies assessing the equivalence of two therapies.
2.1.1 How can you tell if the sample size is too small?
Ideally, the authors should provide justification of the sample size in the paper itself. The justification is considered better if it is made a priori (prior to the start of the data collection). If no justification of sample size (e.g., power calculations) is given, examine the width of the confidence intervals. Very wide intervals indicate an inadequate sample size.
2.1.2 There are many examples of studies with inadequate sample sizes.
A revealing study of inadequate sample size appears in Freiman 1992. In a series of 71 publications appearing between 1960 and 1977, the outcome was either percent mortality, percent complications, or a similar outcome that could be measured as a percentage. The authors examined power, the ability of the study to detect either a moderate improvement (25% relative reduction in the outcome) or a large improvement (50% relative reduction in the outcome). For example, if a study showed a 40% mortality in the controls, then a 30% mortality rate in the treated group would be considered a moderate improvement and a 20% mortality rate would considered a large improvement.
2.1.3 The results of the Freiman study.
The results of the Freiman study were very disappointing.
Of the 71 papers, 57 had greater than a 50% chance for missing a moderate improvement and 31 had a 50% or greater chance for missing a large improvement.
One wonders why anyone would undertake a study when there is such a high probability for failure. You should never initiate a study unless you know that the chance of missing a reasonable improvement is less than 20%.
2.1.4 Special issues in a study of equivalency.
Some studies attempt to show not that a new therapy is superior to the standard therapy, but that it is equivalent. Showing equivalence requires a very careful assessment of sample size.
An example of an equivalence study is when a drug company tests a generic drug and wishes to show equivalence with the (presumably more expensive) brand name drug.
2.1.5 Special issues in a study of equivalency.
If we applied the traditional testing approach, the company would have a strong disincentive to design the study with an adequate sample size. A small sample size is more likely to show equivalency under the traditional testing framework.
There are several modifications to the traditional testing framework for equivalency studies. The simplest approach uses confidence interval for the ratio of the outcome under new therapy to the outcome under the standard therapy. If both limits of the confidence interval are reasonably close to 1 (e.g., no less than 0.8 and no more than 1.25) then the two therapies are considered equivalent.
2.2 Did the research have a narrow focus?
A good research study has limited objectives that are specified in advance. Failure to limit the scope of a study leads to problems with multiple testing.
When there are a large number of comparisons being made, the study is considered a fishing expedition. There is a saying in Statistics circles "If you torture your data long enough, it will confess to something."
2.2.1 When is multiple testing likely to occur?
Multiple testing often occurs when a researcher examines a large number of subgroups or a large number of endpoints (Howel 1994). Multiple testing problems also occur when a study examines multiple side effects.
When multiple tests are done simultaneously within a paper, there is an increase in the overall Type I error. If 100 tests were performed at alpha=.05, you would expect that 5 of those tests would be significant, even if there was nothing at all going on. There are statistical adjustments for multiple comparisons, but these are controversial. Significant results from a large number of unplanned comparisons are useful mostly just for setting future research priorities.
2.2.2 An example of multiple testing.
Studies of the effects of diet on health often have difficulties with multiple endpoints. An example is a study of the effect of cured and broiled meat consumption on childhood cancer (Sarasua 1994).
This study examined two types of cancer (acute lymphocytic leukemia and brain tumor). The authors examined five types of meat consumption (ham/bacon/sausage, hot dogs, hamburgers, lunch meats, and charcoal broiled foods). Finally, the authors looked at food consumption both of the child and of the mother during pregnancy.
2.2.3 An example of multiple testing.
While one direct comparison was statistically significant, it was selected from a total of 20 direct comparisons. Since the authors allowed for the traditional one-in-twenty chance of a Type I error, the finding of exactly one significant result can hardly be considered earth shattering.
The authors also examined the interacting effect of diet and vitamin consumption. Here the evidence was a bit more convincing. The authors examined 12 interactions, and six provided significant indications that infants with meat consumption and no vitamin supplements were at higher risk of cancer.
2.2.4 Optimal cut points and the problem with multiple comparisons.
Researchers will often simplify analysis of a continuous outcome measure by dividing that measure into two or more distinct groups on the basis of cut points. For example, a researcher might categorize his/her subjects as high or low blood pressure when they are above or below a certain value.
An abuse of this approach, called the minimum p-value approach, was noted by Altman (1994). Researchers would examine a variety of cut points and select the one that yielded the most favorable statistics.
2.2.5 Optimal cut points and the problem with multiple comparisons.
For example, some researchers have chosen the cut point from among a large number of possible cut points so as the make the difference in survival times between those patients above the cut point and those patients below the cut point as large as possible.
By examining a multiple number of cut points the chance of drawing a false conclusion (Type I Error) is inflated from the traditional 5% value to a value as large as 40%.
2.2.6 The proper way to select cut points.
There are several objective ways to select a cut point. Perhaps the best way is to select the cut point prior to looking at the data. This would involve the use of medical judgment.
After the data has been collected, there are some neutral ways of selecting a cut point. The simplest is a median split. If you wanted to create a median split for blood pressure, you would combine the blood pressure data from both groups, and select a value so that half of the blood pressures are larger and half are smaller.
2.2.7 Another abuse of cut points.
Researchers should avoid the implicit or explicit use of more than one cut point. For example, in Sarasua (1994), the authors looked at the relationship between hamburger consumption in children and leukemia. In the table, they compare one or more hamburgers consumed per week to less than one per week. Although there is a large risk for leukemia, it does not achieve statistical significance. In the text, however, the authors point out that children who ate two or more hamburgers per week had a statistically significant higher risk of leukemia compared to children who ate one or less per week.
2.3 Did the authors deviate from the plan?
Not all research is predictable, so deviations from a pre-designed plan are sometimes necessary. Nevertheless, be cautious about any major deviation from the original research protocol. Some examples of deviations from the plan include:
Investigating end-points other than those originally specified.
Developing new exclusion criteria after the study has started.
You need to ask yourself if the authors deviated from the protocol in a conscious or subconscious effort to manipulate the results. Did the authors add other end-points in order to salvage a largely negative study? Were new exclusion criteria targeted to keep "troublesome" subjects out? It is impossible, of course, to discern the motives of the researchers. Nevertheless, for any deviation or modification to the protocol, you can ask whether this change would have made sense to include in the protocol if it had been thought of before data collection began.
2.3.1 An example of a deviation from the research plan.
An interesting deviation from the research plan occurs in a randomized double blind control trial for the use of selenium supplements (Clark 1996). The study was initiated in 1983 with basal skin carcinoma and squamous skin carcinoma as the primary end points. The researchers also looked for signs of selenium toxicity.
In 1990, funding was obtained to look at additional secondary end points (total mortality, cancer mortality, and incidence of lung, colorectal, and prostate cancers). While it was relatively easy to add extra endpoints in the middle of the study, the authors acknowledged that this represented a deviation from the protocol.
Another deviation from the protocol occurred when the study was terminated early (January 1996). No statistical changes were found in the primary endpoints, nor was any evidence of selenium toxicity found.
Among the secondary endpoints, however, the authors found statistically significant declines in total cancer mortality and lung cancer mortality. The authors also found statistically significant declines in the incidence of prostate cancer, colorectal cancer, lung cancer and total carcinomas. There was also a decline in overall mortality, though it did not achieve statistical significance.
There were no significant changes in the incidence of nine other types of cancer, including breast cancer, bladder cancer, and leukemia.
Because the significant results occurred in areas that were not originally planned for study, the authors acknowledge that any results have to be considered preliminary. Furthermore, it is unclear what impact the early termination of the study had on the statistics. Early termination of a study can cause serious biases, unless specific rules for early termination are established at the start of the study.
2.4 Did the authors discard outliers?
You should be skeptical of any study that removes outliers. Inappropriate removal of outliers can seriously bias the study results.
Sometimes the outliers are more interesting than the bulk of the data themselves. You may gain more insight by trying to uncover the cause of an outlying observation than you would by examining the relatively small effects that occur with the rest of the data.
2.4.1 The best ways to treat outliers.
It is generally a bad idea to remove data points on the basis of their data values alone. If an investigation of an outlier leads to a discovery of a typing error or the inclusion of a subject who did not meet the pre-specified inclusion criteria, then correction or removal of the outlier is appropriate.
If there is no such justification, then the best solution is to leave the outlier alone. Another alternative is reporting data analysis results both with and without the outlier.
2.4.2 An example of inappropriate outlier deletion.
The NASA web site has an interesting example of outlier deletion. Researchers in the 1980s first published information about the hole in the ozone layer above Antarctica. These researchers were nervous because the results from the British Antarctic survey did not match results from earlier years taken by an American satellite. The authors discovered, however, that the American satellite had a computer filter built in that automatically removed any large sudden changes in ozone concentration which it considered as instrument errors. When this filter was removed, the authors were able to trace the development of the ozone hole all the way back to 1976.
Further details about the history of the ozone hole can be found at the NASA web site.
2.9 Summary - Was there a plan?
The presence of a plan developed before data collection and analysis adds to the quality of a publication.
2.1 Were there enough subjects? Did the authors examine power prior to the start of the study? If not, are the confidence intervals sufficiently wide?
2.2 Did the research have a narrow focus? A large number of comparisons limits the amount of evidence that you can place on any single conclusion. Results from a limited number of planned comparisons are considered more authoritative.
2.3 Did the authors deviate from the plan? While minor deviations are expected, be cautious about major deviations from the research plan, such as developing new exclusion criteria during the course of the study.
2.4 Did the authors discard outliers? Removing outliers without a sound scientific reason is dangerous. Sometimes the outliers provide evidence that is more interesting than the rest of the data.
3. Who knew what when?
Knowledge of group membership, either before or during the data collection can bias the study.
3.1 During the study, did the patients know which group they were in?
In an experimental study, it is desirable (but not always possible) to keep the information about the treatments hidden from the patients and anyone involved with evaluating the patient. This is known as "blinding." Blinding prevents conscious or subconscious biases or expectations from influencing the outcome of the study.
Unfortunately, there are many situations where blinding is impossible. For example, if you are comparing oral versus rectal administration of a drug, that's pretty hard to conceal from the patient. In general, observational studies cannot be blinded, because the patient and/or their doctor selects the treatment group.
Unblinded studies are still useful, but they are considered less authoritative than blinded studies.
3.1.1 The placebo effect.
Positive effects of a treatment are sometimes due to a placebo effect. The placebo effect is a product of "belief, expectancy, cognitive reinterpretation, and diversion of attention" that can lead to psychological and sometimes physiological improvements in situations where the treatment is known to have no effect, such as sugar pills (Beyerstein 1997).
3.1.2 When is the placebo effect especially strong?
Johnson (1997) lists three specific situations where the placebo effect is of particular concern: when enthusiasm by the patient or the doctor for the new procedure is strong, when outcomes are based on the patient's self-assessment (e.g. quality of life studies), and when the treatment is primarily for symptoms. The placebo effect is less critical for objective outcomes like survival.
3.1.3 Blinding in surgical trials.
Surgical procedures are often difficult to completely blind. Nevertheless, Johnson (1997) suggests some partial steps at blinding that prevent some of the biases from creeping in.
If two surgical procedures use different types of incisions, identical blood or iodine stained opaque dressings could be used to keep the patients unaware of which operation was performed.
Also, although the surgeon cannot be blinded to the difference in surgery, those who evaluate the health of the patient after surgery could be kept unaware of the particular operation, so as to insure that their evaluation of the patient is unbiased.
3.1.4 Partial blinding in an observational study.
As noted earlier, it is impossible to completely blind an observational study. Gail (1996), however, describes an observational study where some level of blinding was achieved.
In a study of the relationship of smoking and cancer, the people asking questions about smoking and other risk factors were unaware of when they were interviewing lung cancer patients or controls. Thus, the interviewers could not subconsciously probe harder for smoking information among the lung cancer patients.
3.1.5 The problem with studies without blinding.
Two researchers have examined studies with and without blinding. These authors found that studies without blinding show an average bias of 11-17% (Schulz 1996; Colditz 1989). In other words, when an unblinded study was compared to a blinded study, the former study tended to estimate a treatment effect that was (on average) 11% to 17% higher than the latter.
3.1.6 Additional evidence about blinding.
Additional evidence of this problem appears in a meta-analysis of the effect of intermittent sunlight exposure and melanoma (Nelemans 1995). When nine studies without blinding were combined, they showed a odds ratio of 1.84 which was statistically significant (95% confidence interval 1.52 to 2.25). When the seven studies with blinding were combined, they showed a much smaller odds ratio (1.17, 95% confidence interval 0.98 to 1.39) which was not statistically significant. This is further evidence that unblinded studies are more likely to show statistical significance than blinded studies.
3.1.7 Problems with keeping a treatment blinded.
Even though the placebo may look the same, sometimes the doctor may infer which group a patient belongs to, perhaps through noting a characteristic set of side effects. In an anonymous survey, more than half of the doctors participating in research studies admitted to breaking a blinded allocation (Schulz 1996). If you are worried about this, ask the doctors to try to identify which treatment group they believe each patient belonged to. If the percentage of correct guesses is significantly larger than 50%, then the allocation scheme was not sufficiently blinded.
3.1.8 An example of unsuccessful attempts at blinding.
Acupuncture is an example of a therapy that is difficult to blind. One study (Bullock 1989) of the effect of accupuncture on the prevention of recidivism among alcohol and other drug abusers used a placebo accupuncture that placed needles 5 mm away from the designated accupuncture point. Because of the nature of accupuncture, the accupuncturists were aware of which patients were which, making this a single blind study.
3.1.9 How blinding might have been revealed.
A critique of this study (Sampson 1997) pointed out that there were significant interactions between the accupuncturists and the patients, with opportunities for indirect suggestion and nonverbal communication to occur. One indication that subjects became aware of who was in which group was the fact that there was a far greater tendency for control subjects to drop out of the study.
3.2 At the start of the study, did the patients know which group they were going to be in?
The randomization list should be blinded to those involved with recruiting subjects.
It is always possible to blind the randomization list, even when the treatment itself cannot be blinded. Check out all the exclusion criteria and if the subject qualifies, open a sealed envelope which identifies which group the patient belongs to. So, for example, it is impossible to use blinding when comparing a surgical to a non-surgical technique, but the selection of who gets the surgical technique could be hidden from both the patient and the surgeon until after all the selection and inclusion criteria are applied.
3.2.1 How knowledge of the treatment order can cause bias.
Knowledge of treatment order allows the doctors recruiting patients to consciously or unconsciously influence the composition of the groups. They can do this by applying exclusion criteria differentially or by delaying entry of a certain healthier (or unhealthier) subject so he/she gets into the "desirable" group. Unblinded allocation schemes show an average bias of 30-40% (Schulz 1996).
3.2.2 Problems with systematic allocation.
Systematic allocations can also cause biases. For convenience, some researchers will allocate in a systematic (non-random) fashion, such as alternating regularly between the two treatments. This is a bad idea. Patients may arrive in a systematic order. Systematic allocations allow the doctors to guess which group the next patient is going to be allocated to. Systematic assignment causes an average bias of 15% (Colditz 1989).
3.3 Did the authors rely on retrospective data?
Retrospective data are data that is collected by looking backwards in time. We obtain this data by asking subjects to recall events that occurred earlier in their lives. We also get retrospective data when we review medical records, birth certificates, death certificates, or other sources of historical data. In contrast, data collected during the course of the study is known as prospective data.
3.3.1 The advantages and disadvantages of retrospective data.
Retrospective data are often inexpensive to collect, but you should be concerned about its accuracy. The ability of a subject to recall information is sometimes affected by which group that they are in.
Women who have experienced miscarriages, for example, are more likely to search for and remember events that they feel might "explain" their miscarriage, much more so than a group of comparable control subjects. This differential level of reporting is known as recall bias.
In addition, historical data are often incomplete and it is sometimes difficult to verify their accuracy. Therefore, retrospective data are considered less authoritative than prospective data.
3.3.2 An example of recall bias.
An interesting review of the research process that helped establish that smoking causes lung cancer can be found in Gail (1996). One aspect of the research process was addressing the issue of recall bias.
Doll (1950) studied the association between tobacco smoking and cancer. They selected 709 patients with lung cancer and an equal number of matched controls. The authors were concerned about the retrospective assessment of smoking among patients in both groups. Would patients with lung cancer exaggerate the amount of smoking? Would the interviewers press harder for information about smoking among the cancer patients?
While it would be impossible to totally rule out recall bias, the authors did examine a third group, patients who were diagnosed with lung cancer and who later found out that they suffered from a different disease (false cases). If recall bias was the sole explanation of the difference in reported smoking, then the group of false cases should have had a similar level of smoking with the lung cancer patients. Instead they reported a lower level of smoking. This helped to rule out the possibility that recall bias alone accounted for the higher reported smoking levels in the lung cancer patients.
3.9 Summary - Who knew what when?
Knowledge of group membership, either before or during the data collection can bias the study.
3.1 During the study, did the patients know which group they were in? While this is not always possible, it is preferred to use a blinded approach to remove the possibility of the placebo effect.
3.2 At the start of the study, did the patients know which group they were going to be in? Even when blinding is impossible, you can always hide the randomization plan through the use of sealed envelopes. This will ensure that the health professional do not consciously or subconsciously influence group membership through the differential application of entry criteria.
3.3 Did the authors rely on retrospective data? Retrospective data are more likely to suffer from inaccuracy, incompleteness and bias.
4. Who was left out?
Exclusion of subjects can make the study biased or less generalizable.
4.1 Who was excluded at the start of the study?
Researchers, trying to minimize variation, will use exclusion criteria to create more homogenous groups. While minimizing variability is good, too much homogeneity can backfire. It’s difficult to extrapolate results from a very tightly controlled and homogenous clinical trial to the variation of patients seen in your practice. Ask yourself the question "How similar are my patients?"
For the study to be useful to us, we want the research subjects to be as similar as possible to the patients we see. Watch out for exclusion criteria that leave out large groups of patients. Also be aware that too many research studies exclude women unnecessarily.
Ask yourself whether the geographic location or the type of health care setting places restrictions on the type of patients seen. Tertiary care centers only see patients that are extremely ill. A study of Midwest hospitals will not have a representative number of Hispanic patients compared to the Southwest.
4.1.1 An example where the patient groups were narrowly drawn.
A study of secondhand smoke and cholesterol levels (Neufeld 1997) recruited 161 children who had been referred to the lipid program at Children's Hospital in Boston. That group of patients was divided into those who had exposure to secondhand smoke and those who did not. The first group had HDL cholesterol levels that were 10% lower than the second group. Since high levels of HDL cholesterol are considered protective against heart disease, the authors claimed that secondhand smoke may increase the risk for atherosclerosis in children.
While this is an interesting finding, note that these children, who were recruited solely from referrals to the hospital, represented children at the unhealthy end of the spectrum. The HDL levels of all of these children were significantly less than those of normal kids. It is difficult to extrapolate results from a set of unhealthy children to all children.
4.2 Who dropped out during the study?
It is inevitable that some patients will drop out during the study. If the number is more than a few, this is a cause for concern. Dropouts often have a different prognosis than those who stay. Ignoring the dropouts will often paint a rosier picture of the outcome. Was there any effort (financial inducement, follow-up reminders) made to minimize dropouts? Were the authors able to characterize the demographics of the dropouts?
Were non-compliant patients excluded? Non-compliance is often associated with poor prognosis. Excluding these patients may also paint a rosier picture of the outcome. Patients should be analyzed in the groups they were randomized to. This is known as "intention to treat" analysis.
4.2.1 An example why you should include non-compliant patients.
Consider a new surgical therapy which is being compared to a standard non-surgical therapy. Some patients randomized to the surgical therapy might die prior to receiving the therapy. This is the most extreme form of non-compliance. These patients should still be analyzed as part of the surgical therapy group. Otherwise the rapidly dying patients will be excluded from the treatment group, but not from the control group, leading to serious bias.
4.3 Were volunteers used?
Quite often, the only patients we are able to study are those who volunteer to help out. The use of volunteers, however, may exclude important segments of the patient population.
4.3.1 Ways in which volunteers may differ from the overall patient population.
Volunteers may differ from the normal population on several critical factors. Volunteers for a study involving cash payments may come more often from economically challenged environments. If a free health check-up is included, volunteers may come more often from people worried about their health status. Volunteers for lengthy studies are less likely to be employed.
4.3.2 What to look for in studies using volunteers.
Examine the incentives and disincentives for participation. Are any incentives or disincentives related to important prognostic factors?
Were the researchers able to characterize various aspects of those who did not volunteer? How similar were the volunteers and non-volunteers?
Do people volunteer themselves into specific treatment groups? If so, we have an observational study.
4.3.3 The case of volunteers who are subsequently randomized.
Some studies involve the use of volunteers who are subsequently randomized into two groups. If this case, some problems will diminish. Comparison between the two groups will be unbiased, but it may be difficult to generalize to a non-volunteer population.
4.3.4 An example of volunteer bias.
Recruiting controls is especially troublesome in a study that involves a painful procedure. Gustavsson (1997) documents volunteer bias in a study of lumbar puncture to obtain cerebrospinal fluid.
In this study, subjects were asked to submit to a lumbar puncture in order to "examine the associations between personality traits and biochemical variables." Of the 87 subjects, 48 declined to participate. The authors were fortunate enough to have measures of personality on both those who participated in the study and those who did not participate.
4.3.5 An example of volunteer bias.
Those who participated had scores roughly a half standard deviation higher on impulsiveness. They did not differ on other personality traits such as socialization and detachment.
The large difference in the impulsiveness measurement would obviously cloud any attempt to correlate personality traits and biochemical measurements in spinal fluids among those who volunteered.
4.3.6 Volunteers in survey study.
An aspect of volunteering can occur in survey studies. People who volunteer to return a questionnaire are frequently quite different from those who refuse to fill out the survey. In particular, the non-responders tend to be more apathetic. Return rates for surveys vary by the type of survey, but if less than half of the subjects returned the survey, any results are of very limited value. Again, look for efforts to minimize non-response and/or efforts to characterize the demographics of non-responders.
Problems with volunteers are especially troublesome in surveys using 900 numbers and web-based surveys.
4.3.7 An example of a poor response rate.
In 1976, Shere Hite published a study on female sexual attitudes that represented the responses of 3,019 surveys. While that sounds impressive, it was a small fraction of the 100,000 surveys that were sent out.
One can speculate on the characteristics of those who failed to respond, but it is a pretty good bet that many of them felt uncomfortable discussing aspects of their sex lives in a survey format. It's obvious that this tendency alone would tend to affect many of the responses in the survey.
4.9 Summary - Who was left out?
Exclusion of subjects can make the study biased or less generalizable.
4.1 Who was excluded at the start of the study? Excessively strict entry criteria in a research study can make it difficult to extrapolate to the types of patients that you normally see.
4.2 Who dropped out during the study? A large number of drop-outs during the course of a research study can bias the final conclusions.
4.3. Were volunteers used? Volunteers differ from others in ways that could affect the outcome.
5. How much did things change?
Research results should be quantifiable. Look for measurements of important outcomes that are free from bias.
5.1 Was there a quantitative measure of the size of the effect?
"When you can measure what you are speaking about, and express it in numbers, you know something about it; but when you cannot measure it, when you cannot express it in numbers, your knowledge is of a meagre and unsatisfactory kind: it may be the beginning of knowledge, but you have scarcely, in your thoughts, advanced to the stage of science." William Thomson Kelvin (Lord Kelvin)
Knowing that a new therapy is better is not enough information. You need to quantify how much the new therapy is better. In this respect, confidence intervals are better than p-values. A p-value tells you whether the new therapy is better. A confidence intervals tells you whether the new therapy is better and by how much. A confidence interval allows you to balance the size of the improvement against the possibility of greater cost or more side effects. Many journals now require confidence intervals instead of p-values.
Statistical methods are sometimes able to detect differences that are so small as to be meaningless from any practical perspective. This is known as statistical significance without clinical significance. Always put the numbers into the perspective of your practice. Try to estimate how of the patients you see within a year are likely to perform better under the new therapy.
5.1.1 An example of statistical significance without clinical significance.
A study of non-steroidal anti-inflammatory drugs (NSAID) showed that patients who took these drugs were 50% more likely to develop upper gastrointestinal (UGI) bleeding (Carson 1987). This rate was statistically significant at alpha=.05. UGI bleeding, however, was rare in both groups. Only 1 case per thousand person years in the controls, 1.5 in the NSAID group. You would have to follow 100 patients for two decades in order see one excess event of bleeding, on average.
In this article, the authors were up front about the very small increase in risk. Most authors, however, are so relieved to achieve statistical significance that they forget to consider whether the size of the difference will improve clinical practice.
This is summarized well in the following Gertrude Stein quote :"For a difference to be a difference it has to make a difference"
5.2 Could other factors account for this effect?
Always look for factors that can cause bias. Three major sources of bias are covariate imbalance, measurement error, and differential handling of the two groups.
Covariate imbalance occurs when the two groups differ on important prognostic factors. For example, one group may tend to have older children, or a greater proportion of smokers.
Measurement error is simply the inability to measure an important variable accurately. Measurement error in the outcome variable does not ordinarily cause bias, buy measurement error in factors that can predict the outcome are of serious concern.
Differential handling occurs when the two groups are not treated equally, save for the difference between the two therapies. Does one group have greater access to medical care, or are they evaluated more frequently?
5.2.1 Covariate imbalance.
Almost all observational studies have problems with covariate imbalance because observational studies cannot use randomization to insure balance between the two groups. But even in a randomized study, it is possible that some differences will crop up randomly, especially if there are a large number of variables that can vary between the two groups.
There are statistical methods to control for imbalance (e.g., matching) or to adjust for covariate imbalance (e.g., analysis of covariance).
Still, some variables may be difficult or impossible to measure, such as the patient's psychological state, presence or comorbid conditions, or underlying severity of the illness. These factors can be balanced out only if we randomize.
5.2.2 Covariate imbalance in a randomized study.
In a yet to be published research study here at Children's Mercy Hospital, pre-term infants were randomized either to a group that received normal bottle feeding while they were in the hospital or to a nasogastric (ng) tube feeding group. The researchers wanted to see if the latter group of infants, because they had not become habituated to bottle feeding, would be more likely to breastfeed after discharge from the hospital.
The randomization was only partially effective at preventing covariate imbalance. The infants had comparable birth weights, gestational ages, and Apgar scores. There were similar proportions of caesarian section and vaginal births in both groups. But the mothers in the ng tube group were older on average than the mothers in the bottle fed group.
Since older mothers are more likely to breast feed than younger mothers, we had to include mother's age in an analysis of covariance model so that the effect of ng tube feeding could be estimated independent of mother's age.
5.2.3 Measurement error.
There are several ways to assess dietary fat intake. The most accurate (and also the most costly) way is through the use of prospectively recorded food diaries.
Sometimes the cost limitations or the retrospective nature of a research study will require a less accurate assessment of dietary fat, such as through an interview. Shapiro (1997) points out that estimation of dietary fat using interviews tends to correlate poorly with estimation using prospective diaries. This would cast doubt, for example, on retrospective studies that tried to associate dietary fat intake with the risk of breast cancer.
5.2.4 Differential handling of the treatment groups.
Beware of situations where the two treatment groups are handled differently. An example of this would be the study of women who use oral contraceptives. These women visit a doctor at least every six months to get their prescriptions renewed. If these women are compared to a women who do not use oral contraceptives, then the former group will probably be evaluated by a doctor more frequently. An increase in the prevalence of certain diseases may actually reflect the fact these diseases are diagnosed earlier because of the frequency of hospital visits.
Similarly, if a certain drug is suspected to have certain side effects, doctor may question more closely those patients who are on that medication, creating a self-fulfilling prophecy.
5.3 Were any important outcomes forgotten?
There is a tendency to focus on intermediate measures that are easy to assess, but which may or may not be predictive of more important endpoints. Improvement in forced expiratory volume may not translate into a reduction in asthma attacks. A reduction in abnormal ventricular depolarization may not translate into a reduction in the recurrence of heart attacks. If an intermediate endpoint is used, ask yourself whether there is an adequate link between this endpoint and something that is relevant to your patients.
Be careful that you don’t focus solely on the outcomes mentioned in the abstract. There is a tendency to report only in the abstract the outcome measures that were statistically significant, rather than the outcome measures most of interest to health care professionals.
Also always consider whether the researcher provided adequate inspection of side effects.
5.9 Summary - How much did things change?
Research results should be quantifiable. Look for measurements of important outcomes that are free from bias.
5.1 Was there a quantitative measure of the size of the effect? Look for a confidence interval and compare the size of the effect to what you would expect to see in your practice.
5.2 Could other factors account for this effect? Look for differences in demographics between the two groups and ask if these differences could explain the results of the research.
5.3 Were any important outcomes forgotten? Research results should focus on endpoints that are of interest to your patients.
6. Case studies.
In this section, we will apply the techniques discussed in the previous five sections to two research papers. The first paper is a study of Vitamin C as a treatment for advanced cancer. The second is a study of nicotine patch therapy in adolescent smokers.
6.1 Vitamin C therapy and cancer.
This example is highlighted in Chapter 1 of Observational Studies by Paul R. Rosenbaum. Cameron and Pauling published an observational study of Vitamin C as a treatment for advanced cancer. For each patient, ten matched controls were selected with the same age, gender, cancer site, and histological tumor type. Patients receiving Vitamin C survived four times longer than the controls (p < 0.0001).
Cameron and Pauling minimize the lack of randomization. "Even though no formal process of randomization was carried out in the selection of our two groups, we believe that they come close to representing random subpopulations of the population of terminal cancer patients in the Vale of Leven Hospital." (Cameron 1976).
Ten years later, the Mayo Clinic conducted a randomized experiment which showed no statistically significant effect of Vitamin C (Moertel 1985).
6.1.1 What went wrong with the Cameron and Pauling study?
The problem with the Cameron and Pauling paper becomes obvious when you ask "Who did the choosing?" Controls were recruited from death certificates. The authors estimated survival time by retrospectively estimating the date at which a prognosis of terminal cancer was made. The Vitamin C group was recruited from people freshly diagnosed with terminal cancer. The authors estimated survival time prospectively from the time therapy was started.
The Cameron and Pauling study had two major flaws. First, the controls and the Vitamin C group differed on a major prognostic factor. No matter how grim the prognosis was in the Vitamin C group, it can’t compare to the prognosis of someone who has already died. Second, the controls were evaluated differently. Their survival times were estimated retrospectively; the Vitamin C survival times were estimated prospectively.
6.2 Case Study #2: Nicotine Patch Therapy in Adolescent Smokers.
The Children's Mercy Hospital Journal Club discussed a paper on nicotine patches (Smith 1996).
The authors recruited 22 volunteers from five public high schools in the Rochester, MN area for participation in a smoking cessation program involving behavioral counseling, group therapy, and nicotine patches. Researchers measured the number of cigarettes smoked, side effects, and blood levels of nicotine.
The purpose of the research was to evaluate "the safety, tolerance, and efficacy of 22 mg/d nicotine patch therapy in smokers younger than 18 years who were trying to stop smoking." The authors also listed a secondary goal, "to compare blood cotinine levels, nicotine withdrawal scores, and adverse experiences with those of adults obtained in previous patch studies." Cotinine is a metabolite of nicotine and provides a useful objective measure of cigarette smoking. It also allowed the authors to examine whether nicotine toxicity was an issue.
6.2.1 Who did the choosing?
Was there a good comparison group? If a comparison of smoking rates with historical controls had been done, it would have been problematic because of the timing issue.
Did the authors create the groups? There was not a well defined control group in this study. Smoking cessation rates could have been compared to historical smoking cessation rates, but this was not done. Perhaps the cessation rates in this study were so poor, that no comparison would be necessary. Blood cotinine levels compared to adults in an inpatient smoking cessation program. The groups failed to overlap on age and on patient status (all the adults were in-patient and all the teenagers were out-patient).
Was the assignment randomized? This design did not allow for randomization.
6.2.2 Was there a plan?
Were there enough subjects? Unfortunately, there was no assessment of adequacy of sample size, although the authors claimed there were no major side effects. Was this because the patch is safe, or because they did not study enough subjects to find major side effects?
The authors did provide confidence intervals for smoking cessation rates At eight weeks, they computed a 14% success rate (95% CI 2.9 to 34.9). At six months, they computed a 4.5% success rate (95% CI 0.1 to 22.8). While these are not the narrowest intervals, they are sufficiently narrow to rule out the possibility of any large rates of smoking cessation. Thus, from the perspective of efficacy, the sample size was probably sufficient.
Did the research have a narrow focus? In general, the authors did well to keep a narrow focus. The assessment of side effects is always troublesome, and here the authors noted three types of skin reactions (erythema only, erythema and edema, and erythema and vesicles), headaches, nausea and vomiting, tiredness, dizziness, arm pain, shortness of breath, pyelonephritis (kidney infection), abdominal pain, back pain, fever, cough, flu, diarrhea, shakiness, and depression. While this list of side effects is very long, it would be difficult to shorten it much. The authors did note that none of the reported side effects were serious.
Did the authors deviate from the plan? There are no stated deviations from the protocol.
Did the authors discard outliers? There were no efforts to exclude outliers from any statistical analysis.
6.2.3 Who knew what when?
Was the new therapy indistinguishable from the standard therapy? The study was not blinded, even though a blinded study (using a placebo patch) was possible. This is a major disappointment, but perhaps reflects the preliminary nature of this research. The lack of blinding implies that results presented on safety, tolerance, and efficacy could be accounted for by a placebo effect.
Was the randomization plan known prior to selecting patients? This design did not allow for randomization.
Did the authors rely on retrospective data? The authors did ask students each week to recall their cigarette smoking over the previous week. This time frame was short enough to avoid problems with recall bias. Furthermore, the authors also included exhaled carbon monoxide levels as an objective measure to validate the self-reported data.
6.2.4 Who was left out?
Who was excluded at the start of the study? The Rochester MN location excluded minority students. All the subjects in this study were white. Subjects had to get parental permission, excluding smokers who wished to keep their habit secret from their parents. Subjects were also volunteers, and thus could be considered more motivated to quit than the typical teenage smoker..
Who dropped out during the study? The study had a serious drop out rate. Of the presumably thousands of teenage smokers in the Rochester Minnesota area, only 71 volunteers responded to the initial call for subjects. Of the 71 volunteers, 55% met inclusion criteria. Of the remaining 39, 44% declined to attend the initial meeting. Of the remaining 22, 14% were non-compliant. Of the remaining 18, 39% failed to respond to the one year survey. Only 11 completed the entire study (50% of those who started the study; 28% of those meeting inclusion criteria; 15% of the initial volunteers.)
Fortunately, noncompliant subjects were treated as if they were still smoking (intention to treat). The researchers also took the trouble to characterize the noncompliant subjects and showed that they did not drop out because of any side effects of the nicotine patch.
Were volunteers used? The subjects of the research study were all volunteers. Volunteers could be expected to be more motivated to quit than typical adolescent smokers.
6.2.5 How much did things change?
Was there a quantitative measure of the size of the effect? To their credit, the authors did provide confidence intervals. At eight weeks, they computed a 14% success rate (95% CI 2.9 to 34.9). At six months, they computed a 4.5% success rate (95% CI 0.1 to 22.8). But no attempt was made to compare these cessation rates with historical rates.
Could other factors account for this effect? No direct comparisons were made for efficacy. Indirect comparisons were made between side effects experienced by the teenagers and the side effects experienced by adults. If there were an age effect (e.g., the older you are, the more likely you are to report side effects), then this could be a problem.
Were any important outcomes forgotten? But while the major outcome (smoking cessation rate) was not totally overlooked, it was subordinated to the study of side effects. 90% of the paper focused on the side effects data. Nowhere did the authors mention that the long term success rate (4.5%) was substantially less than what could be hoped for.
6.9 Summary of all five sections
The following questions will help you assess the quality of a journal article about a new therapy.
1. Who did the choosing?
1.1 Was there a good comparison group?
1.2 Did the authors create the groups?
1.3 Was the assignment randomized?2. Was there a plan?
2.1 Were there enough subjects?
2.2 Did the research have a narrow focus?
2.3 Did the authors deviate from the plan?
2.4 Did the authors discard outliers?3. Who knew what when?
3.1 During the study, did the patients know what group they were in?
3.2 At the start of the study, did the patients know what group they were going to be in?
3.3 Did the authors rely on retrospective data?4. Who was left out?
4.1 Who was excluded at the start of the study?
4.2 Who dropped out during the study?
4.3 Were volunteers used?5. How much did things change?
5.1 Was there a quantitative measure of the size of the effect?
5.2 Could other factors account for this effect?
5.3 Were any important outcomes forgotten?
8.0 Resources
Chapter 1
A case-control study of HIV seroconversion in health care workers after percutaneous exposure. Centers for Disease Control and Prevention Needlestick Surveillance Group. D. M. Cardo, D. H. Culver, C. A. Ciesielski, P. U. Srivastava, R. Marcus, D. Abiteboul, J. Heptonstall, G. Ippolito, F. Lot, P. S. McKibben, D. M. Bell. N Engl J Med 1997: 337(21); 1485-90. [Medline] [Abstract] [Full text] [PDF]
Controlled trial of acupuncture for severe recidivist alcoholism. M. L. Bullock, P. D. Culliton, R. T. Olander. Lancet 1989: 1(8652); 1435-9. [Medline]
A controlled trial of immunotherapy for asthma in allergic children. N. F. Adkinson, Jr., P. A. Eggleston, D. Eney, E. O. Goldstein, K. C. Schuberth, J. R. Bacon, R. G. Hamilton, M. E. Weiss, H. Arshad, C. L. Meinert, J. Tonascia, B. Wheeler. New England Journal of Medicine 1997: 336(5); 324-31. [Medline] [Abstract] [Full text] [PDF]
Crossover and Self-Controlled Designs in Clinical Research. TA Louis, PW Lavori, JC Bailar, Polansky M. In: Bailar J and Mosteller F ed. Medical Uses of Statistics: 2nd Edition. 1992; Vol. Boston MA: NEJM Books;
Dietary Fat Intake and the Risk of Coronary Heart Disease in Women. Frank B. Hu, Meir J. Stampfer, JoAnn E. Manson, Eric Rimm, Graham A. Colditz, Bernard A. Rosner, Charles H. Hennekens, Walter C. Willett. N Engl J Med 1997: 337(21); 1491-1499. [Medline] [Abstract] [Full text] [PDF]
Evidence-based medicine and treatment choices. D. L. Sackett. Lancet 1997: 349(9051); 570; discussion 572-3. [Medline]
Fat chance: diet and ischemic stroke [editorial; comment]. R. Sherwin, T. R. Price. Jama 1997: 278(24); 2185-6.
Statistics in Action. M.H. Gail. Journal of the American Statistical Association 1996: 91(433); 1-13.
Studies Without Internal Controls. JC Bailar, TA Louis, PW Lavori, M Polansky. In: Bailar JC and Mosteller F ed. Medical Uses of Statistics, Second Edition. 1992; Vol. Boston MA: NEJM Books;
Chapter 2
Assessing cause and effect from trials: a cautionary note. D. Howel, R. Bhopal. Control Clin Trials 1994: 15(5); 331-4. [Medline]
Cured and broiled meat consumption in relation to childhood cancer: Denver, Colorado (United States). S. Sarasua, D. A. Savitz. Cancer Causes Control 1994: 5(2); 141-8. [Medline]
Dangers of using "optimal" cutpoints in the evaluation of prognostic factors. D. G. Altman, B. Lausen, W. Sauerbrei, M. Schumacher. Journal of the National Cancer Institute 1994: 86(11); 829-35. [Medline]
Effects of selenium supplementation for cancer prevention in patients with carcinoma of the skin. A randomized controlled trial. Nutritional Prevention of Cancer Study Group. L. C. Clark, G. F. Combs, Jr., B. W. Turnbull, E. H. Slate, D. K. Chalker, J. Chow, L. S. Davis, R. A. Glover, G. F. Graham, E. G. Gross, A. Krongrad, J. L. Lesher, Jr., H. K. Park, B. B. Sanders, Jr., C. L. Smith, J. R. Taylor. Jama 1996: 276(24); 1957-63. [Medline]
The Importance of Beta, the Type II Error, and Sample Size in the Design and Interpretation of the Randomized Control Trial. JA Freiman, TC Chalmers, H Smith Jr, RR Kuebler. In: II JCB and Mosteller F ed. Medical Uses of Statistics, Second Edition. 1992; Vol. Boston MA: NEJM Books;
Ozone Depletion, History and politics. Brien Sparling. Accessed on 2002-11-27. www.nas.nasa.gov/About/Education/Ozone/history.html
Chapter 3
An addition to the controversy on sunlight exposure and melanoma risk: a meta-analytical approach. P. J. Nelemans, F. H. Rampen, D. J. Ruiter, A. L. Verbeek. J Clin Epidemiol 1995: 48(11); 1331-42.
Blinding and exclusions after allocation in randomised controlled trials: survey of published parallel group trials in obstetrics and gynaecology. Kenneth F Schulz, David A Grimes, Douglas G Altman, Richard J Hayes. BMJ 1996: 312(7033); 742-744. [Medline] [Abstract] [Full text]
Controlled trial of acupuncture for severe recidivist alcoholism. M. L. Bullock, P. D. Culliton, R. T. Olander. Lancet 1989: 1(8652); 1435-9. [Medline]
How study design affects outcomes in comparisons of therapy. I: Medical. GA Colditz, JN Miller, F. Mosteller. Stat Med 1989: 8(4); 441-454. [Medline]
Inconsistencies and Errors in Alternative Medicine Research. W Sampson. Skeptical Inquirer 1997: 21(5); 35-38.
Removing bias in surgical trials. A. G. Johnson, J. M. Dixon. British Medical Journal 1997: 314(7085); 916-7. [Medline] [Full text]
Smoking and Carcinoma of the Lung: Preliminary Report. R. Doll, A.B. Hill. British Medical Journal 1950: 11451-1455.
Statistics in Action. M.H. Gail. Journal of the American Statistical Association 1996: 91(433); 1-13.
Why Bogus Therapies Seem to Work. Barry L. Beyerstein. Skeptical Inquirer 1997: 21(5); [Full text]
Chapter 4
The healthy control subject in psychiatric research: impulsiveness and volunteer bias. J. P. Gustavsson, M. Asberg, D. Schalling. Acta Psychiatr Scand 1997: 96(5); 325-8. [Medline]
Passive cigarette smoking and reduced HDL cholesterol levels in children with high-risk lipid profiles. E. J. Neufeld, M. Mietus-Snyder, A. S. Beiser, A. L. Baker, J. W. Newburger. Circulation 1997: 96(5); 1403-7. [Medline] [Abstract] [Full text]
Chapter 5
The association of nonsteroidal anti-inflammatory drugs with upper gastrointestinal tract bleeding. J. L. Carson, B. L. Strom, K. A. Soper, S. L. West, M. L. Morse. Arch Intern Med 1987: 147(1); 85-8. [Medline]
Chapter 6
High-dose vitamin C versus placebo in the treatment of patients with advanced cancer who have had no prior chemotherapy. A randomized double-blind comparison. C Moertel. New England Journal of Medicine 1985: 312(3); 137-141. [Medline]
Nicotine patch therapy in adolescent smokers. T. A. Smith, R. F. House, Jr., I. T. Croghan, T. R. Gauvin, R. C. Colligan, K. P. Offord, L. C. Gomez-Dahl, R. D. Hurt. Pediatrics 1996: 98(4 Pt 1); 659-67. [Medline]
Observational Studies. PR Rosenbaum (1995) New York: Springer-Verlag.
Supplemental ascorbate in the supportive treatment of cancer: Prolongation of survival times in terminal human cancer. E. Cameron, L. Pauling. Proc Natl Acad Sci U S A 1976: 73(10); 3685-9. [Medline]
Chapter 7
Cook DJ, Guyatt GH, Ryan E, Clifton J, Buckingham L, Willan A, WcIlroy W, Oxman AD. "Should unpublished data be included in meta-analyses" Journal of the American Medical Association, 269: 2749-2753 (1993).
Davies HTO, Crombie IK, Tavakoli M. When can odds ratios mislead? BMJ 1998; 316: 989-991.
Dickersin K. "The existence of publication bias and risk factors for its occurrence" Journal of the American Medical Association, 263: 1385-1389 (1990).
Glass GV, McGaw B, Smith ML. Meta-analysis in social research. pp.18-20. Newbury Park CA: Sage (1981).
Gotsche OC. "Reference bias in reports of drug trials." British Journal of Medicine, 295: 654-656 (1987).
Halvorsen KT, Burdick E, Colditz GA, Frazier HS, Mosteller F. "Combining Results from Independent Investigations: Meta-analysis in Clinical Research" pp. 413-426, in Medical Uses of Statistics: 2nd Edition, Bailar JC and Mosteller F (editors), Boston MA: NEJM Books (1992).
Hu FB, Stampfer MJ, Manson JE, Rimm E, Colditz GA, Rosner BA, Hennekens CH, Willett WC. "Dietary Fat Intake and the Risk of Coronary Heart Disease in Women." New England Journal of Medicine, 337(21): 1491-1499 (1997).
Ravnskov U. "Cholesterol lowering trials in coronary heart disease: frequency of citation and outcome." British Journal of Medicine, 305: 15-19 (1992).
Sacks HS, Berrier J, Reitman D, PAgano D, Chalmers TC. "Meta-Analyses of Randomized Control Trials: An Update of the Quality and Methodology" pp. 427-442, in Medical Uses of Statistics: 2nd Edition, Bailar JC and Mosteller F (editors), Boston MA: NEJM Books (1992).
Shapiro S. "Is Meta-Analysis a Valid Approach to the Evaluation of Small Effects in Observational Studies?" Journal of Clinical Epidemiology. 50(3): 223-229 (1997).
Other resources
Discern Online. Quality Criteria for Consumer Health Information. Deborah Charnock, Sasha Shepperd. Accessed on 2003-09-15. www.discern.org.uk/
The Glossary of Mathematical Mistakes. Paul Cox. Accessed on 2003-06-10. www.mathmistakes.com/
Junk Science. Steve Milloy. Accessed on 2003-09-15. www.junkscience.com
Skeptical Inquirer. The Magazine for Science and Reason.. CSICOP. Accessed on 2003-09-11. www.csicop.org/si/
Some Remarks on Wild Observations. William H. Kruskal. Accessed on 2002-11-27. www.tufts.edu/~gdallal/out.htm
This webpage was written by Steve Simon on 1999-08-01, edited by Steve Simon, and was last modified on 2008-07-08. Send feedback to ssimon at cmh dot edu or click on the email link at the top of the page. Category: Statistical evidence