Children's Mercy Hospital
Find a Doctor | Press Room | Careers | Directions & Locations

About Us | Contact Us | Giving to Children's Mercy
For Patients and Families   Your Child's Health   Clinical Services   |   For Health Care Professionals   Medical Education   Medical Research

Stats #44: Things You Need to Know Before Starting a Research Project

Content: This class will introduce you to the statistical issues important in developing a research study. It combines material from classes #32, 42, and 52. This class is useful for anyone who participates in the planning of research. There are no pre-requisites for this class.

Teaching strategies: Didactic lectures and small group exercises.

Objectives: In this class you will learn how to:

  • identify various research designs and their limitations;
  • recognize factors that influence the sample size of a study;
  • assess how restrictions on your sample can hamper generalizability; and
  • identify ethical issues associated with randomization and blinding.

This class qualifies for 3 IRB Education Credits (IRBECs).

Contents

  • Overview of the STATS web pages
  • Consulting services that I provide
  • Where do research ideas come from? (Ronan Conroy)
  • Developing a research hypothesis
  • Getting IRB approval for your research
  • Three things you need for a power calculation
  • Statistical Evidence: Overview
  • Statistical Evidence: Apples or Oranges
  • Please fill out an evaluation form

Overview of the STATS web pages (January 21, 2000)

What are the STATS web pages?

The STATS pages are a collection of handouts that I use in my job as a statistical consultant. The web provides a nice home for these handouts, because as I update my material, the newest version is immediately available to anyone who is interested.

Where can I find STATS?

If you have a web browser, like Internet Explorer or Netscape Navigator, you can surf on over to my site,

http://www.childrensmercy.org/stats

which is also found at http://internet1/stats, if you are attached to the Children's Mercy Hospital network. There are two obsolete sites: http://www.cmh.edu/stats and http://simon/stats. Do not use either of these sites.

Some of the fun stuff you can find on the STATS web pages.

Ask Professor Mean.  For the tough Statistics questions that Dear Abby won't touch.

Planning Your Research Study.  Things you need to plan for before you start collecting your data.

Selecting An Appropriate Sample Size.  How much data do you really need?

Managing Your Research Data.  Everything you want to know before you step to the keyboard.

Steps In a Typical Data Analysis.  I have my data on the computer. Now what?

How to Read a Medical Journal Article.  Reading a journal is hard work. Here's some help.

Professor Mean's Library.  Good books and good web sites about Statistics.

... and even more good stuff!!!

This webpage was written by Steve Simon, edited by Linda Foland, and was last modified on 07/08/2008. Send feedback to ssimon at cmh dot edu or click on the email link at the top of the page. Category: Website details


For CMH employees only: Statistical Consulting Services.

You can get free statistical consulting if you work for Children's Mercy Hospital. Steve Simon and Ashley Sherman provide a wide range of statistical consulting services to help you with your research projects. This help can start as early as the initial planning of your research. I also help with the analysis of your data, using SPSS or other statistical software. We can also provide assistance with the preparation of your presentations and publications.

Here area some examples of the services that we have provided:

  • setting up your research hypothesis,
  • selecting and justifying your sample size,
  • writing the statistical methods section for your grant,
  • preparing randomization tables for your study,
  • reviewing your surveys for content and quality,
  • developing a system for entering your data,
  • choosing an appropriate statistical model for your data,
  • establishing validity and/or reliability for your measurement scales,
  • checking for violations of statistical assumptions in your data,
  • producing graphs and tables for your research publication, and
  • providing references for new and unusual statistical methods.

Specific statistical advice has been outlined on a series of web pages which can be found at http://www.childrensmercy.org/stats/. The pages provide advice about planning your research, selecting an appropriate sample size, managing your research data, performing a variety of data analyses, presenting research data, and writing research papers.

How to get in touch with a statistician

If you would like to meet with Steve Simon or Ashley Sherman, you can set up an appointment by emailing or calling Judy Champion (jmchampion (at) cmh (dot) edu or 816-983-6784). If you have a very simple question, send an email directly to us (ssimon (at) cmh (dot) edu and aksherman (at) cmh (dot) edu).

This webpage was written by Steve Simon on 2003-04-30, edited by Steve Simon, and was last modified on 2008-07-08. Send feedback to ssimon at cmh dot edu or click on the email link at the top of the page. Category: Professional details


Directions to my new office (April 25, 2008).

I have moved to a new office. It is a modular building just north of Children's Mercy Hospital. It is between 23rd and 22nd street, just off of Kenwood Avenue (Kenwood is a small north/south street just west of Holmes). If you need to get from your office to mine, here are some directions written by my Administrative Assistant, Judy Champion.

  • Take the elevator of the research tower down to the yellow level. Exit the employee parking garage on 23rd Street, walk to Kenwood and cross 23rd Street. Your destination is Building M 3 which is the building closest to 22nd Street. However, the entrance to our building faces Building M 2. It’s best to walk into the parking area that is just north of Building M 1 and follow the sidewalk around the west side of building M 2 in order to get to our building’s entrance on its south side. Another route would be to exit the Hospital Hill Center Building on Holmes and then walk ½ block north to 23rd Street, cross 23rd Street, walk west to Kenwood then north to building M 3 address 2220 Kenwood.

This webpage was written by Steve Simon and was last modified on 2008-07-14. Send feedback to ssimon at cmh dot edu or click on the email link at the top of the page. Category: Professional details


Where do research ideas come from? by Ronan Conroy (September 20, 1999)

This is an HTML format version of an email by Ronan Conroy on April 9, 1999 to edstat-l, an Internet list and to sci.stat.edu, a USENET group. This email summarized a presentation he made about how to develop ideas for research. I have made some minor formatting changes (mostly the use of bolding, bulleting, and indenting to highlight the major themes), but all of the credit for writing up this summary belongs to Ronan Conroy. Part of this presentation represents a summary of discussions on edstat-l and sci.stat.edu. Here is the acknowledgement in Dr. Conroy's original email.

I'd like to thank the many people who took part in the discussion, or who wrote to me privately, and to stress that the quotes in it are often the person who made the point most memorably, rather than the only person who said it.

Many thanks to Chris Zorn, Gabriele Susinno, Giovanni S. Leonardi, the... inimitable dennis... roberts, Joe Ward, Michael Granaas, Roland Andersson, Jay Warner, Alex Heath, The Anthonys, Bob Frick, Jerry Dallal, Tjen-Sien Lim, Tim Cole, John F. Schnell, Robert C. Knodt, Paul Velleman and Joseph L. McCrary. Did I say Herman Rubin? I did now.

This material is reproduced with permission from Dr. Conroy. For what it's worth, I have included a copy of the original email. It's pretty clear that there was some formatting in Dr. Conroy's original document that got lost when it was translated to a text format for email.

Introduction

This paper tackles one of the questions that statisticians dread most: the most basic one of all. How do you start formulating a research project? It began life as a talk at a research seminar in the Rotunda Hospital, Dublin. Trying to write it up, I decided to mail the statistics lists that I subscribe to. This paper has been greatly enriched by the ideas and discussion generated on edstat-l, the statistics teaching list, as well as contributions from subscribers to the stata list and the UK statistics list allstat. Quotes are often attributed to the only person who made the point most memorably, but many of the ideas emerged repeatedly in different postings. I'd like to thank all those who took part in the discussion.

Exploring your environment

The first thing you need to do is identify your resources for research. This is often easier when you first arrive somewhere. After a while you begin seeing an environment as the place where you work or live or eat. You need to see it with a fresh eye to see it as a potential research environment.

Don't forget that your research environment includes not just your patients and your colleagues, but also includes any source of data, ideas or help that you have access to. Many of my own research projects have taken shape because my office is next door to the psychology department; a casual remark has often triggered a flurry of speculation, articles rooted out, contacts mentioned and so on.

The internet is also a valuable environment. Discussion lists abound,which can provide not just free advice but also an insight into current controversies and new directions in research. Simply subscribing to a list and reading the postings (the word for a person who does this is a lurker) without taking part in the discussions will often give you ideas.

General resources

How much time will you be able to devote to research? To what extent can you integrate it into your daily work?

Will colleagues help? For instance, if you need blood taken outside working hours, will the doctor on call oblige? Will nursing staff collaborate by collecting extra information?

Do you have access to a person, unit or department with a specific research interest? They can often be a useful source of ideas. Never underestimate the value of just going for coffee with someone who does a lot of research, or, better, a research team. The speed with which a bunch of researchers can take a vague idea and shape it into a research design is amazing. Most of these ideas go nowhere, but eavesdropping on the process can help you to do it yourself.

Giovanni Leonardi of the Environmental Epidemiology Unit at London School of Hygiene and Tropical Medicine put it like this: "There are many potential research ideas that never make it to becoming research projects, and the likelihood that a research idea will become a research project is heavily influenced by this idea having being selected and refined in an environment where potential ideas are routinely tested for their viability. Think of this as 'natural selection' of research ideas within the research environment."

Do you have access to a statistician, or someone who can advise you on study design and sample size?

What library facilities do you have access to? Skimming journals is a good ideas generator, which I will deal with in more detail later; but access to a good library, including literature searching and reprint ordering facilities, is a must. Add extra points for library staff who are willing to do literature searching with you looking over their shoulder to refine the search.

What computer facilities are available?

  • Ideally you should be able to write your own research papers and reports.
  • This is helped greatly by being able to create tables and graphs, so if you can't do this, learn.
  • The next real bonus is having a dedicated bibliographic package which you can use to store, organize and annotate your references, and to generate bibliographies for papers -- well worth investing in.
  • Finally, have you a statistical package, or access to someone who will do your statistics for you? Ideally, you should be able to examine your data using graphs and tables, rather than handing the whole analysis over to a number cruncher. Professional statisticians are best used to answer complex questions, but you should be able to do the simple statistics yourself, perhaps guided by a statistician. Not being able to do any statistics means that you lost contact with your data at the vital point when it is being investigated for patterns, anticipated and unanticipated.

What are they funding this year? This sounds like a cynical point, but if there are funds available for research in specific areas, make use of them. What charities are there who might be interested in your research area? Talk to colleagues; there is often no single listing of available research sponsors, and you have to rely on the grapevine.

Specific resources

Do you have access to information already collected which could be the basis for a research project? This information could have been collected as routine clinical information. Although you probably cannot do a research project solely on the contents of patients' charts, routinely collected information may allow you to

  • Identify patient groups that are interesting to study (and figure out if there are enough of them to be worth studying!)
  • Identify controls
  • Add already-collected clinical information to the data that you will collect in the course of your project.

Information may also be available as an offshoot of another research project. You may liaise with another research project and

  • Study a subgroup of their patients in more detail
  • Follow up a previously-studied group
  • Add a sub-study of your own to a study that is being planned or, not so easy, ongoing.

It is a good idea to talk to people who are doing research in the setting in which you work. They will be able to spot potential difficulties in proposals, and may also have useful ideas as to what they would do if they had access to your facilities.

Potential projects

Now all you need to do is to get an idea for a project which will be realistic, given the resources available to you. This is often a stumbling block. I had one person come into the office to discuss a research project with me. 'I have 24 patients with rapid cycling mood disorder' he said. And stopped, waiting for me to say something. The trouble is that 24 patients with rapid cycling mood disorder is no more a research project than 24 trout in a shoebox. What you need to ask yourself is 'what do we not know about rapid cycling mood disorder'?

One very important piece of advice that recurred frequently in the edstat-l discussion was the need to develop many ideas simultaneously. Christie Brown, Assistant Professor of Marketing at University of Michigan Business School tells her students to imagine that inside them they have a large basket of research ideas, some better than others:

'I point students toward Donald Campbell's work on creativity. Campbell suggests one secret to generating better ideas lies in the QUANTITY of ideas generated. In other words you stand the best chance of pulling an idea from the "high" end of your good-idea basket if you make a lot of draws.' (Campbell, Donald T. "Blind variation and selective retentions in creative thought as in other knowledge processes." Psychological Review. 1960;67:380-400.)

Don't focus prematurely on a single idea. Develop a few together. It's like the process of conception: the chances of a child resulting from a single act of sexual intercourse are small. But the chances of a child not resulting from regular sexual intercourse are likewise small. Carry a notebook and write down every idea that you get, good or bad. You will learn from thinking about why the bad ones are bad as well as from why the good ones are good.

Christie Brown again: 'Write down everything. Do not self-censor. Keep a log of your baby-ideas in case they end up being worth pursuing. Get in the habit of generating at least one idea based on everything you read in your domain and even out of it.

Bob Frick, a cognitive scientist, actually forces students to develop a number of research ideas as a learning exercise. 'The assignment was to come up with three "kernels", and the students had about a month to do it. The notion was that they were supposed to find some original idea they had. It usually ended up being an original observation. Original to them -- it didn't have to be original to the field of psychology. Their original idea would then be a kernel that could be developed into an experiment. Most people have these, but they don't pay attention.'

Extending the ideas of others

Much of the discussion on edstat-l centred around where ideas for research projects come from. The sources of ideas divide into two:

  • Replicating/extending the work of others
  • Doing something original.

I'll take the easy one first!

Repeating research that has been done by others doesn't sound like task, but there are several important reasons why it needs to be done, and there are some other benefits too. The reasons why research needs to be replicated include:

Local research is needed to make sure that findings from other countries apply locally. Indeed, basic research is constantly needed to monitor local health needs and to evaluate the services being delivered.

  • Is the problem the same here? For instance, research has shown that liaison psychiatry in Ireland sees a very different spectrum of morbidity than its counterpart in the US. Planning services using US models is inappropriate. If you suspect that the service in which you work sees a spectrum of patients or problems that is different to what you would expect, research this. It can lead to a better understanding of the health service needs in your service.
  • Do the strategies worked out elsewhere apply? Many treatments and interventions have been researched in settings that may be quite different to your own. You might well ask if they are best for your setting.

All research needs extension to new contexts and development along an obvious line - Clinical trials are often done on homogeneous, idealised patient groups; they need extension to realistic groups such as those with comorbidity, or beyond the age range of the original research. Think of

  • Treatment of hypertension in the elderly. This is a classic case where treatment was wrong for many years because it was based on the results of studies of younger patients.
  • Use of treatments in neonates that have only been studied in older children or adults. This example has become topical recently, as paediatricians have realised that very young children are often excluded from treatment trials, yet the treatments end up being used to treat them,
  • Many treatments are evaluated on patients who have 'pure' diagnoses, while they are frequently applied to people who have more than one disorder.

Factors which have been identified in a disease may be present in other similar diseases. Since its role in peptic ulcer disease was uncovered, H pylori has been investigated for many other unsolved crimes.

  • Is it responsible for the digestive symptoms some pregnant women experience?
  • Is it linked to cancers?

Yes, there is a feeling of a bandwagon rolling along, but someone has to check out these questions.

You may spot an explanation which the original study failed to identify and test. This is, of course, classic 'stroke-of-genius' research. Just remember, though, that the explanations that are most often overlooked are the commonest, most familiar things.

You may not believe a piece of research. Not all research is good research. I have, several times, replicated and extended research because I didn't believe it. Incredible research deserves to be replicated. If you confirm the original findings, you have helped to overcome the resistance that they will find in being accepted. If you fail to confirm the findings, this in itself is interesting. Though try to make sure that the original author isn't asked to review your paper!

Even doing a straightforward replication of a previous study can be a very worthwhile exercise. As a first project, it means that you already have a 'canned' methodology, and you will learn a lot about running, analysing and presenting research, But there are often surprises too.

Chris Zorn of Emory University wrote: 'As a social scientist responsible for training grad students in statistics, one thing that I've always found useful is replications While the main reason I use replications is to teach students statistics and/or software, these exercises often prompt them to extend the work they are replicating. These can range from the simple (e.g. testing for relationships in the data that the original investigators didn't look for) to the very involved. The result is often interesting, if a bit derivative, research projects, some of which have led to PhD theses, etc.'

Andersson Roland puts it simply: Dig where you stand. That is, make use of all the data that is already at hand and that nobody had time to analyse. Almost always there will be unexpected or unknown patterns in these data that can be detected if you analyse them with an open mind. You do not always need to have an research idea ready when you start. They will come up when you try to formulate an explanation for the patterns that you find in your data.

Alex Heath, an economist from Australia, wrote: A good way to get started thinking about research questions for me is to find things which have been done overseas (usually the US or the UK) and adapt them to Australian data. I find that once you start replicating things you find interesting twists and turns which allow you to say something completely new.

Although I have replicated several studies because I didn't believe them, this probably isn't the best spirit in which to replicate. But neither should you simply accept the original research as scripture. Paul Velleman, the person responsible for the DataDesk statistical package and ActivStats, a statistical teaching package, wrote in praise of an attitude of well-informed skepticism: This misses the most important part of the process -- an abiding skepticism. You must know your science before you can be intelligently skeptical about it, but just because you know what is common wisdom doesn't mean you should believe it. Indeed, if science is to progress, you must maintain a willingness to disbelieve. You don't do research by replicating previous results but by doubting them.

Dennis Roberts, responding to this, said: a good replication study does not have to be done BECAUSE one doubts them but rather, to bolster the case that the research findings made ...

I think that he and Paul really just differ in emphasis, with Dennis arguing that 'replication is very valuable ... we don't do enough of it ... ' while Paul cautions against literal-minded repetition. I think everyone would agree that the scientific idea of replication is doing something more intelligent than just looking for what the other guys already saw.

Paul makes the point, too, that it is hard to sit down and work carefully through a set of data without coming up with at least one pattern that needs further investigation. You may start by replicating a study, but this is almost guaranteed to act as a springboard to innovative questions of your own.

Getting a research idea by reading papers.

You can simply bury yourself in the library with a whole year's worth of your favourite journal and, starting from the most recent issue, use a series of filters to identify studies that you would be interested in and capable of extending. Even when I'm not in need of a research project, I often graze my way through a small stack of journals, picking up an interesting methodological approach here, or a useful measurement technique there. Many of my more prolific colleagues do this a lot. One, in particular, seems able to rummage out a half-a-dozen relevant journal articles from her shelves on any topic in about five minutes.

If I am looking for a potential project, I look at each article in turn and ask:

  • Does the title sound interesting? If so, look at the abstract. There is little point in taking on a research project that doesn't sound interesting. It will be a lot of work, so it will have to hold your attention.
  • Can I extend this study to a relevant patient group? Think about the sorts of patients you have access to--would it be interesting to try this on them?
  • Can I extend the methods by assessing something not assessed in the original paper, or increasing quality of assessment? Remember that you are one of your own best research resources. Can you refine the study design by substituting a personal interview or a medical examination for a self-administered questionnaire?
  • Read the end of the discussion to see if the authors identify areas for future research. Usually the researchers will make a big play to inflate the importance of their area of interest, and do a paragraph of 'we-simply-must-know' stuff. They are usually either working on this themselves or they have applied for a grant already, so if you decide to follow up their suggestion, move fast.
  • If you find that an article catches your attention as a possible project, do a literature search to get a feel for the area (and look carefully through the references in the original article, of course; aside from identifying useful articles to read, you will get an idea of the sorts of journals that might take your paper, and you can target it accordingly).
  • Contact the authors. Authors love to hear from people who think that what they are researching is interesting. They often provide advice, access to unpublished manuscripts, free consultation time simply because that's the way research works. Be prepared to do the same in your turn! The easiest way to contact the authors is get an email address. Search the website of their institution if the article doesn't have an email address, or try using an email search engine on the internet to find them. People reply faster to email, and you can bombard them with small queries which would be irritating in letters. And, of course, keep them in touch with your work. They may be a great help when it comes to analysing and writing up.
  • Finally, you should also think about plundering the abstracts from conferences. There may well be something there, and the work will be newer than published articles. If you are actually at a conference, seek out the people who are doing research that would interest you. Over coffee you can usually get ideas for several projects from them. One favourite ploy I use is to ask them how they would do their research if they could start all over again, knowing what they know now.

Getting your own ideas

This is an even harder subject to write about than extending and developing the ideas of others. (Did I say plagiarising? -- Never!). The secret seems to be keeping your eyes and ears open all the time. The observation doesn't have to be complicated. On the contrary, spotting an obvious question in an everyday event often has greater potential.

Jack Schnell of Department of Economics at the University of Alabama in Huntsville remembers simple advice he got as a student: 'look out of the window', meaning 'pay attention to what is happening out there in the world, look for issues that are ripe for investigation'. And since that time I have tried to do just that. For me, this has been more intellectually sustaining than, say, combing through some literature in the hopes of seeing a useful extension.

A simple observation can spark off a whole train of ideas. Roland Andersson, of the Department of Surgery in Joenkoeping, Sweden, said For me it started like this: I observed that we had had 12 patients with appendicitis during one week. The following weeks we had only one or two. I wondered: 'Had we had an epidemic of appendicitis?'. I happened to know about Knox space-time analysis and I started off from there and finally have written a thesis about 'Appendicitis - epidemiology and diagnosis'. Lots of new questions arise and I am now involved in a (as it seems) never ending project about aspects of appendicitis. (And please, don't worry if you have no idea what Knox space-time analysis is; the important point is that Roland brought together a specialised theoretical framework which he already knew and a common everyday observation. In other words, he applied the theory he knew to the world outside the window.)

But what frame of mind, what view of the world do you need in order to have productive research ideas? A lot of discussion focussed on this question. At one extreme was Robert Hamer, who very much doubted whether you could teach anyone how to look at the world in a questioning manner. I don't think that this is true, though. We are brought up in a way that does not encourage us to question the explanations we are given for things. Don't forget that all children are hungry to find things out, to know why things are so. This voice of hunger for knowledge and delight in figuring things out is much smaller and more timid by the time we have grown up, but with patience it can be called back. It takes time to rid ourselves of this learned uncuriuosity.

The trick is doing what children do: asking lots of questions and teasing out the logical consequences of the answers. Paul Velleman again: "Dennis is right that the problem is nudging the mind. We need to start that process in childhood. We must cultivate in our children and our students a broad-based skepticism coupled with a sense that there *is* order in the universe."

These are the sorts of questions that scientists and other children ask.

  • How does this work?
  • What is the proposed biological explanation for the thing I am looking at?
  • How do we know?
  • Is there another possible explanation?

One must maintain an active and abiding skepticism about the explanations and models that have been proposed in science. Skepticism, which Paul Velleman identifies as a key attitude, doesn't involve simple disbelief, but rather being able to entertain a number of different explanations at once.

This struck a chord with Robert Knodt: After being involved with masters and doctoral students for over thirty-years and looking back for an answer to the original post, I find that the statement above applied to over 90% of those I helped... The first person I worked with was bothered by a statement in a 10th grade Biology book which said that trees were pruned in the fall in order to make them fill out areas and become more symmetrical. This still bothered him eight years later. He finally did is work on 'wound' hormones in plants.

Says who? Many pieces of medical knowledge are folkloric, and the evidence is slender. In particular

  • Clinical signs and diagnostic tests--are they really reliable and valid? The apex beat, for instance, is taught as a religion in some cardiology departments. Research into the repeatability of this clinical sign shows that it is hard for two observers to locate it in the same patient, much less in the same place.
  • Factors are supposed to be associated with prognosis or diagnosis: where's the evidence? Women with small feet, for instance, are supposed to be at higher risk of caesarian section because of the baby's head being too big for the pelvis. But this assertion rests on a couple of small and doubtful studies. Better research (already done--sorry!) has shown that this is just a medical myth.
  • Rituals of patient management (surgical masks, changing wound dressings) may have no evidence to support their usefulness.

I don't believe that! Always trust your disbelief. Often a trip to the library will put your mind at rest, but think about

  • Alternative explanations for things you have been told
  • Your feeling that what you see in your own practice is not what you have been told you will see

Why are we doing this? At every point in clinical practice there are decision forks. Some may be invisible (we always do X when Y happens) but these are the most interesting! For example

  • What is the most informative sequence of diagnostic tests?
  • What are the treatment options at a given moment?
  • How do we decide between them?

Why are they both right? Some disagreements in the literature are because no-one has yet spotted the reason why two different sets of investigators should have observed data that were seemingly contradictory.

Can we learn from the abnormal? We learn once from describing the normal--normal course of disease, normal range of variation etc. We learn a second time by examining cases that do not fit the general picture. Rare, pathological conditions can give us an insight into how more subtle, commonplace processes work.

  • Congenital homocysteinemia and CHD led to the investigation of homocysteine in CHD in 'normal' people
  • Research into doctor-patient interactions in hypochondriasis gave leads for investigating health anxiety as a dynamic in these interactions in normal patients.

Final thoughts

I don't know where ideas come from, but I do know that you get more ideas if you try to remember everything that happens that doesn't have a good explanation. I carry a little black notebook which can simply be used to note phone numbers and things I have to buy next time I go shopping, but it also means that I have a way of writing down an idea the instant I spot something interesting.

The last thing I want to say is based on my experiences teaching music to adolescents, as much as teaching research methods to medical students. The biggest obstacle you encounter is a feeling that you can't do this; that you aren't the sort of person who can sing, or make interesting observations or pose original questions. Just remember: this is what you did as a child, before you were taught any different. So you already know how to do this; just think of yourself as a little rusty.

The copyright for this page belongs to Ronan Conroy. This page was formatted by Steve Simon and was last modified on 2008-04-28. Send feedback to ssimon at cmh dot edu or click on the email link at the top of the page.


Developing a research hypothesis (August 18, 1999)

Dear Professor Mean, I want to do some research, but before the hospital won't approve anything until I have a protocol with a research hypothesis. I'm not sure why a research hypothesis is important. Can you help? -- Little Linda

Dear Little,

Think of it as job security for your local statistician.

Short answer

A research hypothesis provides clarity. A problem has to be stated clearly before it can be solved. The research hypothesis will also provide direction for writing the rest of your protocol.

There are several steps that you should follow:

  1. Identify the four components that most research hypotheses have.
  2. Select between a one sided and a two sided hypothesis.
  3. Use your hypothesis to guide the writing of your research protocol.

Stating a hypothesis

Ideally, your research hypothesis should be specified prior to the collection of any data. An exception would be an exploratory study. For example, if you are investigating the cause of poor morale among health care providers, you may not have enough information to specify anything more specific than a whole range of factors that might influence morale.

In general, a hypothesis will have four major components. Not every hypothesis can be fit into this framework, of course, but knowledge of these four components might help you if you have an incompletely formed hypothesis.

The first component is the subject group. In other words, who are you interested in studying? Subjects could be patients, their parents, or the health care providers.

The second component is the treatment or exposure. In other words, what is being done to part or all of your subject group. A treatment implies an action on your part, such as providing information or applying a new therapy. An exposure, on the other hand, implies some action that you do not control, such as lead poisoning or premature birth.

The third component is the outcome measure. In other words, how or in what manner is the treatment or exposure going to be assessed. It is very important that the outcome measure be defined precisely and unambiguously. For example, if your outcome is breast feeding rates, you should use standard definitions of breast feeding, such as those provided by the World Health Organization.

The fourth component is the control group. In other words, who are you comparing to. It is important for the control group to be as similar as possible to those who receive a treatment or exposure.

As mentioned earlier, not every research hypotheses will have all four components. For example, a cross-over design involves applying both a new treatment and a standard treatment using the same patients. For this study, the hypothesis would not involve a separate control group. Correlational studies look at relationships within a single group, such as a study of the factors that cause medication errors. This type of study would not have a treatment/exposure. The structure mentioned here, however, is still useful for developing most research hypotheses.

One sided versus two sided hypotheses

During the planning of your research, you need to specify whether you plan to use a one sided or two sided hypothesis. A two sided hypothesis states that there is a difference between the treatment/exposure group and the control group, but does not specify in advance what direction you think this difference will be. A one sided hypothesis states a specific direction (e.g., increase).

If you expect that a change in either direction is possible and that changes in either direction are interesting, then you should use a two sided hypothesis.

If changes in one direction are uninteresting and unpublishable, then use a one sided hypothesis. Also if a change in the unexpected direction is equivalent in practice to no change, then use a one sided hypothesis.

The best example of this is when you are comparing a new therapy to an existing therapy, where the new therapy is much more expensive, your only concern is to show that the new therapy is better. If it turns out that then new therapy is equal to or worse than the standard therapy, you will not adopt it.

Some important issues involving the control group

With a treatment, where you intervene, it is often possible to select those patients who receive the treatment through the use of randomization. Randomization ensures comparability, because the random selection ensures that, on average, subjects who receive the treatment will be comparable to subjects who do not receive the treatment.

When you have an exposure instead, it is often difficult to ensure that the subjects without the exposure are comparable to the the exposed subjects. Sometimes matching will help, but you should only use matching for very important prognostic variables. For example, birth weight plays a major role in infant mortality, so it is often helpful to match your exposure group to your control group on the basis of birth weight. Matching, however, will often present difficult logistics, especially when the pool of control subjects in not much larger than the pool of exposed subjects.

What are your next steps?

Other important issues to be considered in your protocol is

  1. determination of the sample size,
  2. identification of potential confounding variables, and
  3. what efforts at blinding will be used, if any.

Once you have a well defined research hypothesis, though, these details will fall into place. Hah, hah, did I really say that? The rest of the protocol is still pretty darn hard, but it would have been impossible if you didn't have that research hypothesis.

To determine an appropriate sample size, you need a research hypothesis, an estimate of the standard deviation of your outcome measure, and assessment of how much change is considered clinically relevant. Hey, you're already a third of the way there! Finding a standard deviation requires either reviewing previous research on that outcome measure or running a pilot study. The clinically relevant difference is a judgement that is made solely on medical knowledge. Your statistician cannot tell you what a clinically relevant difference would be.

Confounding variables are those variables which are related to your outcome measure and which may differ between your treatment/exposure group and your control group. Assessment of potential confounding variables is especially important when you cannot randomize.

Blinding means hiding information about the treatment/exposure from the patients, their parents, and any health care professional who interacts with the patients and their parents. Blinding is useful when it can be done, but blinding is not always possible. For example, in a comparison of a drug that is rectally administered to oral administration, the patient usually figures out quickly which group they are in. But even when the patients themselves know which group they are assigned to, you can sometimes still use blinding for laboratory personnel and for interviewers.

Summary

Little Linda needs to include a research hypothesis in her grant proposal, but doesn't know what it should say. Professor Mean explains that you should develop a hypothesis to giveyour research clarity. There are four components in most research hypotheses:

  1. a subject group,
  2. a treatment or exposure,
  3. an outcome measure, and
  4. a control or comparison group.

Other important issues to keep in mind while developing a research hypothesis:

  1. Use a one sided hypothesis when changes in the opposite direction are uninteresting.
  2. Randomization helps ensure that you have a comparable control group.
  3. Use the research hypothesis to guide the determination of sample size, the identification of confounding variables, and the efforts to blind information.

Annotated Bibliography

http://www.shef.ac.uk/~scharr/reswce/question.htm

This site provides information about evidence-based medicine, but much of the material is still relevant to developing research protocols. The four components to a research hypothesis come from this site.

Massey, V.H. (1995) Nursing Research, Second Edition, Springhouse PA: Springhouse Corporation

This book provides a "how to" framework for conducting research in any easy to skim outline format. The book includes topics on ethics, literature review, sampling techniques, data analysis, and presentation of research results. The sections that deal with planning are the best parts of this book.

Lang, T.A. and Secic, M. (1997) How to Report Statistics in Medicine. Annotated Guidelines for Authors, Editors, and Reviewers, Philadelphia, PA: American College of Physicians.

It seems ironic to recommend a book on writing the final results, but it helps to start out with your goal in mind. If you think about the information that belongs in your research paper, then you will have a good idea of what you need to specify during the planning stages of your research. This book also uses an easy to skim outline format, but it has significant narrative text under each outline element.

This webpage was written by Steve Simon on 2008-xx-xx, edited by Steve Simon, and was last modified on 2008-07-14. Send feedback to ssimon at cmh dot edu or click on the email link at the top of the page. Category: Ask Professor Mean, Category: Grant writing


Getting IRB approval for  your research (October 9, 2002)

Dear Professor Mean: I am submitting a proposal to our Institutional Review Board. Is there anything you can do to help me get IRB approval? --Terrified Terri

Dear Terrified:

Why not bring a freshly baked batch of chocolate chip cookies to the IRB meeting? I'd be glad to sample the batch first to make sure it tastes okay.

Disclaimer

In a perfect world, everyone would listen when Professor Mean talks and they would decide things exactly the way he would. Alas, it's not a perfect world. Our IRB here at Children's Mercy Hospital uses criteria that differ from the guidance I give below, and your IRB probably does also. I'm working with our IRB to better understand the criteria they use and when I get a better understanding, I'll update these web pages accordingly.

But don't try the PMSS defense: You should approve this protocol because Professor Mean Said So. Sadly, it does not work.

By the way, if you serve on an IRB, I'd love some feedback from you on how your IRB assesses scientific validity.

Short answer

The IRB does look at a variety of issues, but the one with particular relevance to statistics is whether the study has scientific validity. It is unethical to expose research subjects to any risks, discomforts, or inconveniences if the study has dubious validity. The Declaration of Helsinki states

Medical research involving human subjects should only be conducted if the importance of the objective outweighs the inherent risks and burdens to the subject. www.wma.net/e/policy/17-c_e.html 

Justification for scientific validity also appears in the Nuremberg Code.

The experiment should be such as to yield fruitful results for the good of society, unprocurable by other methods or means of study, and not random and unnecessary in nature. ohsr.od.nih.gov/nuremberg.php3

Good statistical design can touch on several aspects of scientific validity:

  1. Is your sample chosen appropriately?
  2. Is your sample size large enough?
  3. Are you measuring things well?
  4. Do you have a good plan for analysis of the data?

Make sure that you provide enough documentation in your proposal to convince the IRB that the answer is YES! to all these questions.

Is your sample chosen appropriately?

Who you choose to participate in your research study will say a lot about how easily you can generalize your results to the real world. No sample is perfect, and even just the process of asking for informed consent can hurt generalizability.

If you randomly select subjects and/or randomly assign them to treatment and control, that's good. But more important is the pool of subjects that you are drawing your sample from. Ideally, your pool of subjects should include the full spectrum of the rainbow. In practice, logistical constraints make this ideal impossible.

Watch out when you select subjects only when your research coordinator is on the clock, or only from a tertiary care center. These are examples, where you may not have much success in extrapolating your findings to a more general group of patients. You can't generalize to all fruit when your sample is restricted to apples.

Sometimes there are hidden restrictions on your sample. Some studies may implicitly exclude patients if they:

  • speak English poorly,
  • move around a lot, or
  • lack a primary care physician.

The logistics of your research and limitations on your time and trouble may also place restrictions on your sample by excluding patients who arrive on weekends and evenings.

Sometimes these restrictions are trivial and sometimes not. It's best to acknowledge these implicit restrictions and be honest about the extent to which they hurt your ability to generalize.

Also, you need to be very careful about selecting your control group. The control group needs to be identical to the treatment group, except for the therapy or exposure being studied. If the control group differs on other factors, especially factors that affect prognosis, then you have problems. You need to control for these other factors, through randomization, matching, or covariate adjustment.

Is your sample size large enough?

The size of your sample plays a vital role in scientific validity. You can't ignore this issue. Every single research study, no matter what the type, should have an explicit justification of the sample size. Virtually every research area has identified and documented problems with inappropriate sample sizes. Failure to consider sample size represents one of the biggest problems with research today.

With a small sample size, you may not have enough precision to make any useful statements about your research data. This is a waste of research dollars, but it is also unethical. An inadequate sample size needlessly puts subjects at risk without any benefit  to society.

The opposite problem can also occur. Some research studies include too many research subjects, but this problem is rarer. Including too many research subjects is also a waste of research money and it is also unethical. You are exposing more patients to the risks, discomforts, and inconveniences of the research study than you need to make precise statements about your data.

The justification of your sample size could take the form of a power calculation, if you have a formal research hypothesis. If your study will produce some simple descriptive statistics, then you should show that the confidence limits about these statistics will be reasonably narrow. Even if your study has a non-quantitative objective, you should still justify your sample size, possibly using a non-quantitative criteria.

There are many complex formulas for determining sample size; here is some general advice.

First you need to think about the size of the difference you are trying to detect and compare that to the standard deviation of your outcome measurement. If you are trying to detect differences that are small relative to your standard deviation, then you need a very large sample size. Detecting a difference that is about one fifth of a standard deviation, for example, might require a sample size in the hundreds.

If you are trying to detect a difference that is very large relative to your standard deviation, then you can get by with a smaller sample size. Detecting a difference that is about the same size as a standard deviation would only require a few dozen subjects.

Be careful! You might be tempted to say that you are only looking for differences that are large relative to the standard deviation, but you may end up painting yourself into a corner. If you suspect that your control group is a full standard deviation or more away from the treatment group, then this difference is one that would be so large as to be visibly different.

For example, Jacob Cohen points out that 13 year old girls and 18 year old girls differ in average height by about 0.8 standard deviations. He also mentions that the Ph.D. holders and college freshman differ in average IQ by about the same amount. Do you really believe that your study will show such a large difference?

Second, when you are counting events, discrete events like deaths, it is the number of these events, not the total number of subjects studied, that determines the precision of your results. When the events are very rare, this means that you have to sample a large number of patients in order to accumulate enough events.

As a very rough guide, you should strive for at least 25 to 50 events per group. If your event occurs only 1% of the time, that means that you might need as many as 5,000 patients per group. If an event occurs one fourth of the time, you might be able to get by with one or two hundred patients per group.

Event Rate Recommended
sample size
25% 100 to 200
5% 500 to 1,000
1% 2,500 to 5,000
0.2% 12,500 to 25,000

Finally, if the sample size you need is unattainable--you don't have the budget, perhaps, or the study would take too long--then consider redesigning your experiment. Find a way to reduce the variability of your outcome measure. A cross-over design, for example, will usually have much less variability because each patient serves as his/her own control. Sometimes intermediate measurements (often called surrogate measurements) will improve your sensitivity enough so you can attain a reasonable amount of precision with a limited sample size.

Sometimes research will have a qualitative rather than quantitative goal. We might be interested, for example, in the issues that children with sickle cell disease face, or teenagers reasons for starting to smoke cigarettes. For qualitative studies, there is no mathematical formula that you can apply to justify your sample size.

The sample size needs to be large enough to ensure a rich and complete set of responses. Look for a sample size large enough to ensure that both ends of the spectrum (and the middle) are represented. If the population you are studying is very homogenous, then as few as a dozen patients may be enough. You may also wish to depart from random sampling and use a purposive sample instead. You can also justify a small sample size if you use purposive sampling. A purposive sample deliberately looks for patients with certain characteristics and can ensure that you have included all relevant viewpoints and perspectives in your study.

Another way to assess the sample size is by saturation. Saturation occurs when the same themes get repeated over and over and no new ideas are generated.

          Are you measuring things well?

There are a lot of scientific issues that I can't answer here. Is arterial distensibility is a good marker of heart disease? What is the best way to determine gestational age? Should you measure blood pressure in the left arm or the right arm?

I can, however, ask some questions that will help you determine whether your measures are clinically relevant.

Is your measure valid and reliable?

Every discipline has slightly different definitions and standards for validity and reliability. As a general rule, the issues of validity and reliability become most important when you are measuring something abstract, like stress, or something subjective, like quality of life.

The easiest way to ensure validity and reliability is to use measures that have already been established in the peer reviewed literature. You can also hedge your bets by including several measures of the same outcome.

If you have concerns about validity and reliability, you might reserve a fraction of your sample (from 5% to 20% is a good starting point) for more thorough analysis. These patients might receive additional tests to verify that your simple outcome measure actually works well. Or you might have these patients evaluated by two different people and measure the level of agreement.

Be cautious about sources of information that are known to be imperfect. For example, in a study of 295 deaths from child maltreatment, only half were identified as such on the death certificates. The gender of the child, whether the perpetrator was a parent, and whether the child died in a rural or urban county, had a differential impact on ascertainment.

Do you define all your terms objectively?

Research must be repeatable, so you need to use terms that are defined well enough so that another expert could reproduce your work and come up with roughly comparable findings.

You need to provide operational definitions for any events that are subject to differing interpretations. For example, the Scottish Intercollegiate Guidelines Network defines life threatening asthma as:

"Features of life threatening asthma include agitation, altered level of consciousness, fatigue, exhaustion, cyanosis, and bradycardia. Air entry is often greatly reduced, which may lead to a 'silent chest'. The peak flow, if recordable, is usually less than 33% of best or predicted."  www.sign.ac.uk/guidelines/fulltext/38/section2.html

Up to 1992, the National Center for Health Statistics defined current and former smokers by asking the following two questions:

"Have you ever smoked 100 cigarettes in your lifetime?"
"Do you smoke now?"
www.cdc.gov/nchs/datawh/nchsdefs/currentsmoker.htm

The Social Security Administration defines blindness as:

"when your vision cannot be corrected to better than 20/200 in your better eye, or if your visual field is 20 degrees or less, even with corrective lens. Many people who meet the legal definition of blindness still have some sight and may be able to read large print and get around without a cane or guide dog." www.rcep7.org/socialsecurity/faq/blind/default.html

Is your outcome important to your patients?

Patients are usually interested in one of three things: morbidity (will I develop diabetes?), mortality (will I die?), or quality of life (will I be able to lift and carry a bag of groceries?). Ideally, you should try to measure one or more of these things directly. If you can't measure them directly, then does your indirect measurement (sometimes called a surrogate measurement) have a strong link with morbidity, mortality, or quality of life?

Also, are you focusing on a short term outcome because of your convenience, when your patients are most interested in long term outcomes? It is easy to get someone to quit smoking for a week, but it is much harder to get them to persist through a full year.

Do you have a good plan for analysis of the data?

It is important to have a plan. If you don't tell the IRB what you expect to do with your data, they won't be able to decide if the goal of your research is worth the risks, discomforts, and inconveniences of the patients in the study.

This does not have to be very detailed. If all you want to do is a descriptive study where you estimate a few means and proportions, then that's all you need to say. A lot of very valuable research does nothing more than this. Here's an example:

In this research study, we will study children with severe hearing loss  in order to estimate the proportion who lose a hearing aid, and the average expense associated with these losses.

It's a myth that all research requires a hypothesis specified prior to the collection of the data.  Most (but not all) qualitative research lacks a formal hypothesis. A descriptive study like the one described above does not have a research hypothesis. Some other examples of research without a formal hypothesis include:

  • pilot testing of a questionnaire,
  • studies assessing validity or reliability, and
  • exploratory or hypothesis generating research.

You can sometimes artificially contrive a hypothesis in these situations, but it is usually better to explicitly state that you don't have a research hypothesis. Instead identify the alternative goal you are trying to achieve or the question you are trying to answer. For example;

There is no research hypothesis for this pilot study. Our goal instead is to identify ambiguous language, missing categories, and other problems with the patient satisfaction questionnaire.

If you are testing a hypothesis, you need to specify that hypothesis as well as how you will test that hypothesis. This may appear difficult to you, but if you don't muck this up too badly, the IRB will probably give you a pass. You need to show enough detail so you don't appear totally incompetent.

If your data analysis plan is bad, it can still be fixed after the data are collected. In contrast, if you have a lousy control group or your sample size is grossly inadequate, you need to do something before you start collecting data.

So don't worry about the details too much. If you specify a Mann-Whitney test and you really needed to use a Kruskal-Wallis test instead, the IRB will probably still approve your study contingent on fixing that detail. Still, there are some statistical details that you need to worry about.

  • If your data are paired or matched, you must use a statistical approach that acknowledges this.
  • If some of your outcome variables are categorical and some of them are continuous, you have to use a different statistical model for each of these data types.
  • If you plan to remove outliers or possibly stop your study early, you need to be explicit about the rules and conditions for these actions.

Specify what your alpha level is (usually 5%) and whether your hypothesis is one-sided or two-sided. A one-sided test looks at changes in a single direction. Changes in the opposite direction are considered either impossible or irrelevant. One-sided tests are often used when changes in the opposite direction would have the same implications as a null finding. For example, we might find that a new drug is equivalent to a placebo, or that it performs worse than a placebo. We would refuse to adopt the drug in either situation. So comparisons to a placebo are usually one-sided.

Contrast this with testing a standard drug to a new drug. If the new drug performs worse, we would never use it, but if it is equivalent, then we would use part of the time based on other factors like cost, convenience, and patient preference. Comparisons of two active drugs are usually two-sided. This might change, however, if the side effect profile of one drug is so harsh that you would only prescribe it when it is superior.

Further reading

  1. Assert: A standard for the review and monitoring of randomized clinical trials. Howard Mann. (Accessed on October 14, 2002). http://www.assert-statement.org/ Excerpt: "The ASSERT statement is the articulation of A Standard for the Scientific and Ethical Review of Trials. It proposes a structured approach whereby research ethics committees review proposals for, and monitor the conduct of, randomized controlled clinical trials. In order to ensure the ethical conduct of research involving human subjects, the ASSERT checklist comprises items that need to be addressed by investigators applying for approval to conduct a clinical trial. These items are chosen to enable fulfillment of certain universally applicable requirements for the ethical conduct of research: social and scientific value; scientific validity; fair subject selection; favorable risk-benefit ratio; and respect for potential and enrolled subjects."
  2. Content and quality of 2000 controlled trials in schizophrenia over 50 years. Thornley B and Adams C. British Medical Journal 1998:317(7167);1181-1184. [Abstract] [Full text] [PDF]
  3. Underascertainment of Child Maltreatment Fatalities by Death Certificates, 1990-1998. Crume TL, DiGuiseppi C, Byers T, Sirotnak AP and Garrett CJ. Pediatrics 2002:110(2);e18. [Abstract] [PDF]

Very bad joke: How many IRB members does it take to screw in a light bulb?

As documented in 45 CFR 46.107(a), this review board must consist of five (5) or more members, and at least one of these members must possess a background in Electrical Engineering. In addition, at least one of the members must come from a home without any electricity. Any member of the IRB who owns stock in an electrical utility or who regularly pays bills to an electrical utility should recuse themselves from participation in the review of this research.

If the bulb should burn too brightly, burn too dimly, or flicker, then an adverse event report should be sent to the IRB (21 CFR 312.32). If the light bulb is dropped, then a serious adverse event report should be sent to the FDA by telephone or by facsimile transmission no later than seven (7) calendar days after the sponsor's initial receipt of the information.

If this is a multi-center light bulb trial, then a data and safety monitoring board (DSMB) may be needed (NIH Policy for Data and Safety Monitoring, June 10, 1998, http://grants.nih.gov/grants/guide/notice-files/not98-084.html, accessed on October 9, 2002). The DSMB should review any adverse event reports and interim results. If the clinical equipoise of the light bulb is lost, then the DSMB should terminate the study and provide all previously recruited light bulbs with the best available light bulb socket.

In order to maintain scientific integrity, the use of a placebo socket may be necessary. The placebo socket should have the same taste, appearance, and smell of a regular socket and the fact that this socket has no electricity should be hidden from the light bulb and from the person screwing in the light bulb. According to the 2000 revision of the Declaration of Helsinki, paragraph 29, the use of placebo sockets is acceptable where no proven prophylactic, diagnostic, or therapeutic socket exists.

A systematic review of all previous research into light bulbs must be presented so that the IRB can determine, per 45 CFR 46.11(a)(2), that the risks to the light bulb are reasonable in relation to anticipated benefits. The IRB should also ensure that the selection of light bulbs is equitable (45 CFR 46.11(a)(3)). If the light bulb has less than 18 watts of power, then additional requirements (45 CFR 46.401 through 409) apply.

The IRB must ensure that an informed consent document be prepared in language that the light bulb understands (45 CFR 46.116). This document should explain the expected duration of the light bulb's participation in the research, any reasonably foreseeable risks, and the extent to which the confidentiality of the light bulb will be maintained. This document should also emphasize that participation is voluntary and the light bulb can withdraw itself from the socket at any time without any penalty or loss of benefits.

The clipart on this page was courtesy of the clipsahoy web site: http://www.clipsahoy.com/index2.html. The remainder of the material is licensed under a Creative Commons This page was last modified on 07/08/08 . Send feedback to ssimon at cmh dot edu or click on the email link at the top of the page. Category: Ask Professor Mean,


Three things you need for a power calculation (November 8, 2001) Category: Ask Professor Mean, Category: Sample size justification

Dear Professor Mean, I want to do research. Is forty subjects enough, or do I need more? Didn't I hear you mention something about three things you need for a power calculation? -- Eager Edward

Dear Eager,

That reminds me of a cute joke. How many research subjects does it take to screw in a light bulb? At least 300 if you want the bulb to have adequate power.

Sorry, I was digressing. Is forty subjects an adequate sample size? That depends on a lot of factors. The basic idea, though, is to select a sample size which ensures that your study has adequate power. Power is the probability that your research study will successfully detect a difference, assuming that the treatment or exposure you are examining actually can cause an important difference. If you don't care whether your experiment is successful or not, then you can use just about any sample size.

Short answer

Power is to a research design like sensitivity is to a diagnostic test. A diagnostic test with good sensitivity is normally able to detect a disease when the disease is present. A research study with good power is normally able to detect a change when your treatment is indeed effective.

The actual calculation of power requires three pieces of information:

  1. your research hypothesis,
  2. the variability of your outcome measure, and
  3. your estimate of the clinically relevant difference.

Calculating power is sometimes difficult and it may require you to go to the time and expense of running a pilot study. But you should NEVER start a research project without knowing what your power is. That would be like using a diagnostic test with unknown sensitivity.

Research hypothesis

A research hypothesis will provide specific information that will determine what type of analysis is needed. A common structure for a research hypothesis is specification of the subject group you are testing, the treatment or exposure that this group will receive, the outcome measure, and the comparison or control group.

Some exploratory studies may not have a research hypothesis, of course, and for those studies you determine an appropriate sample size in a different way (for example, by insuring that the estimates from this exploratory study have adequate precision).

Variability of your outcome measure

You also need to have an estimate of the variability of your outcome measure. I'm assuming here that your outcome measure is continuous variable like birth weight or cholesterol level. If you are using a categorical outcome measure like mortality or cancer remission, then you need some estimate of the rate of mortality or remission in your control group.

Your literature review (you did do a literature review before you started this research, I hope), will usually provide you with an estimate of variability. Select a study that is reasonably similar to what you plan to do, and find out what that study reported for the standard deviation for your outcome measure.

Although I prefer a standard deviation, other estimates of variability are also acceptable. If the paper reports a variance, a standard error, a confidence interval, or a coefficient of variation, then there are simple formulas for converting these into standard deviations. If the study priveds a range, then you can divide the range by four to get a good approximation for the standard deviation.

Many of the people I see have a difficult time providing any estimate of variability. This area hasn't been studied before, so no one knows what the variability will be. But don't give up too easily.

First keep in mind that you only need a crude estimate of variability. Power calculations are capable of determining if you are "in the right ball park." They are good at specifying your sample size down to an order of magnitude perhaps but not much more than that. In other words, might tell you whether you need hundreds of subjects dozens of subjects instead of hundreds of subjects, or possibly if you need thousands of subjects.

Second, although most research is innovative and therefore unique, this innovation is often in the treatment and not in the outcome measure. So look for studies that used the same outcome measure, even if the treatment is quite different than yours.

Third, try to characterize variability in your control group and we can try to extrapolate what the variability will be in the treatment group. A retrospective chart review, for example, will provide a rough estimate of variability of your outcome measure under the current standard of care.

Third, you may have to use a clearly flawed estimate, but a flawed estimate of variability may still be better than no estimate at all. An estimate of variability in adults, for example, may not be an ideal estimate for a pediatric study, but at least it tells you if your study will have adequate power assuming that the variation in a pediatric population is comparable to variation in an adult population. That's still better than having no idea whether your study has adequate power.

If you've tried and you still can't come up with an estimate of variability, then don't despair. A pilot study can provide you with an estimate of variability when all else fails. Usually 20 to 30 subjects produce a reasonably stable estimate of variability. A pilot study is also helpful for finding out how quickly you can recruit subjects. Furthermore, a pilot study will also identify any weaknesses in the logistics of your research. Finally, if the protocol remains substantially unchanged after the pilot study, you can usually include those pilot subjects in the final analysis.

Clinically relevant difference

Wow, that was exhausting! You're not done, though, until you can tell me what a clinically relevant difference would be for your outcome measure. This is a difference that is large enough to be considered important by a practicing clinician.

For just about every type of study, some differences are so small as to be clinically meaningless. From a theoretical viewpoint, perhaps, changes of any size might be interesting. But theory and practice are very different. If a six month diet program produces an average weight loss of three pounds, a fever medicine reduces average temperature by half a degree Fahrenheit, or a smoking cessation program helps an additional two percent to quit, who cares what the theoretical implicaitons might be.

It's not easy but this is something that you have to do for yourself. The clinically relevant difference is determined by medical experts and not by statisticians. Hey, I'm still trying to understand the difference between good and bad cholesterol; I wouldn't even be able to start thinking about how much of a change in cholesterol is considered clinically relevant. You might start by asking yourself "How much of an improvement would I have to see before I would adopt a new treatment?" Also, try talking with some of your colleagues. And look at the size of improvements for other successful treatments.

Still, there are some general guidelines that might help. Try looking at the resolution of your measuring device, thinking in terms of relative changes, or specifying changes with respect to your standard deviation.

Average changes that are smaller than the resolution of your measuring instrument are probably not clinically relevant. For example, Apgar scores can take on any whole number between 0 and 10. Gestational age can only be measured accurately to within a week In these contexts, it is clear that average changes should probably be greater than one unit in order to achieve relevance.

Still this is not a perfect rule. We can measure weights to within a gram, but changes in birth weight would have to be in the hundreds of grams or more to be meaningful. And while no family can have a fractional number of children, decreasing the average family size by 0.2 children can have a profound effect on society.

It also may help to think in terms of relative changes. If you can change something by 25 percent or 50 percent, that is considered relevant in most contexts. It becomes harder to argue clinical relevance for changes of less than 10 percent. Again, this is not a perfect rule.

Finally, you might find it easier to specify changes with respect to your standard deviation. This type of change is called an effect size. A common classification is that 0.2 standard deviations is considered a small effect size, 0.5 standard deviations is considered a medium effect size, and 0.8 standard deviations is considered a large effect size.

An effect size of 0.2 is small enough that there is no obvious visible separation between the two groups. The difference in average heights between 15 and 16 year old girls is 0.2 standard deviations. An effect size of 0.8 is clearly visible. The difference in average heights between 14 and 18 year old girls is 0.8 standard deviations.

It may be unrealistic to look for changes much smaller than 0.2 standard deviations because the sample sizes become prohibitively large. It may also be unrealistc to expect to see changes much larger than 0.8 standard deviations since this size change does not seem to occur too often in the published literature.

Like the other two rules, this rule is also not perfect. In some animal experiments, for example, the similarity in the gene pool can often reduce variation to such an extent that changes of more than a full standard deviation are quite realistic. If you are trying to specify a clinically relevant difference, there is no substitute for a good understanding of the context of your research.

But I can't do it.

A lot of people tell me that they can't do this. They can't provide an estimate of variability or they can't determine what a clinically relevant difference is, even after I explain all of the above suggestions.

But you have to do it.

The CONSORT Guidelines require you to have an a priori justification of sample size for publication. If you don't do this now, you won't be able to publish the data in any journal that uses these guidelines. What's the point of doing the research if you can't publish it?

If your research requires an ethical review (e.g., through an IRB), they will require the same a priori justification. If the research involves animals, the appropriate animal care and use committee will require this justification.

The bottom line is that if you know so little about this avenue of research that you can't even come up with a preliminary estimate of the variability of your outcome variable, then you shouldn't be doing the research. You need instead to:

  • do a more thorough literature review,
  • collect some pilot data, or
  • switch to an outcome measure whose variability is known to some extent.

But do something, because your ability to perform the research and to publish your research depends on your justification of the sample size.

Example

In a study of two different skin barriers for burn patients, we are interested in three outcome measures: pain, healing time, and cost. We will randomly assign half of the patients to one skin barrier and half to the other.

For pediatric patients we usually measure pain with the Oucher, a five point scale that has been validated for children. A review of previous studies using the Oucher have shown that it has a standard deviation of about 1.5 units. We would be interested in seeing how large a sample size is needed to show a change of 1 unit, the smallest individual change attainable on the Oucher. We want to have a power of .80, or equivalently, the probability of a Type II error of .20.

The formulas for sample size vary from problem to problem. The sample size needed for a comparison of two independent groups is

wpe26.gif (1536 bytes)

We use the letter "z" to represent a standard normal distribution. Alpha represents the probability of a Type I error (usually .05). Beta represents the probability of a Type II error (we usually want this to somewhere between .05 and .20). Sigma represents the standard deviation, and this formula allows for the possibility of different standard deviations in group 1 and group 2. Don't forget that the formula requires you to square these standard deviations. Finally, D is the clinically relevant difference. In our example,

wpe23.gif (2183 bytes)

We round up. So in order to achieve 80% power for detecting a one unit difference in the Oucher score, which has a reported standard deviation of 1.5, we would need to sample 36 patients in each group.

Healing time is a more difficult endpoint to assess. Medical textbooks cite that the healing time for second degree burns has a range of 4 days (minimum 10, maximum 14). A study of healing times for a glove made from one of the skin barriers showed a healing time range of 6 (minimum 2 and maximum 8 days).

A rule of thumb is that the standard deviation is about one fourth to one sixth the size of the range. So we could have a standard deviation as small as 0.67 or as large as 1.5. An average change of one day in healing time would be considered clinically relevant.

If we use the largest possible estimate of standard deviation, we would get (coincidentally) the exact same sample size of 36 per group. If we used the smallest estimate of the standard deviation, we would need only 7 subjects per group.

Ffor one type of skin barrier, a study of costs showed a range of $4.00 ($5.50 to $9.50). We would like to be able to detect a difference as small as $0.50 in costs.

Using the same rule of thumb, we get an estimate of the standard deviation of either 0.67 or 1.0. Using the smaller estimate of standard deviation, we would need 29 subjects per group using the smaller estimate of standard deviation. We would need 63 subjects per group, using the larger estimate.

A sample size of 63 is untenable, so we decide that we can live with a study that could only detect a $1.00 change in costs. For this size difference, we would need 16 subjects per group using the larger standard deviation.

In summary, to achieve adequate power for all three endpoints, we would need 36 patients per group,. This is larger than we need for the healing time endpoint. It is also larger than what we need for the cost endpoint, unless we wanted to detect a $0.50 change in costs. To detect such a small difference, we need a sample size of 63 subjects per group.

Summary

Eager Edgar wants to know if forty subjects is enough to conduct a research study. Professor Mean explains that it is impossible to determine whether forty is an appropriate sample size without having these three things:

  1. a research hypothesis,
  2. a standard deviation for your outcome measure, and
  3. an estimate of the clinically relevant difference for this outcome measure.

Further reading

Jacob Cohen has an excellent discussion of effect sizes in Chapter 2 of his book and the examples of girls heights comes directly from this book. Bernard Rosner incorporates a discussion of power and sample size issues into every section on statistical testing. Russ Lenth's PiFace software will provide more accurate power calculations than those presented here (or in Rosner's book), which is especially important when you are estimating power for small sample sizes. The range method for estimating staindard deviations gives a more precise rule for converting a range into a standard deviation.

  1. Power and sample size page.
    Russell V. Lenth (Accessed on January 1, 2002).
    http://www.stat.uiowa.edu/~rlenth/Power/
  2. Range method for estimating standard deviation.
    (Accessed on October 2, 2000)
    http://www.uop.edu/cop/psychology/Statistics/range_method.html
  3. Statistical Power Analysis for the Behavioral Sciences, Revised Edition.
    Cohen J.
    New York NY: Academic Press (1977).
    ISBN: 0-12-179060-6.
  4. Fundamentals of Biostatistics, Third Edition.
    Rosner B.
    Belmont CA: Duxbury Press (1990).
    ISBN: 0-534-91973-1.

This page was written by Steve Simon and was last modified on 07/14/2008.


Statistical Evidence: Overview

This is a first draft of the overview for "Statistical Evidence."

"Still, it is an error to argue in front of your data. You find yourself insensibly twisting them around to fit your theories." Sherlock Holmes in The Adventure of Wisteria Lodge.

Reading medical research is hard work. I'm not talking about the medical terminology, though that is often quite bad (if I hear the word "emesis" one more time, I'm going to throw up!). The hard part is assessing the strength of the evidence. When you read a journal article, you have to decide if the authors present a case that is persuasive enough to get you to change your practice. This means assessing the strength of the evidence.

Some evidence is so strong that it stands on its own. Other evidence is weaker and requires support from other studies, from mechanistic arguments, and so forth. Still other evidence is so weak, that you should not consider any changes in your practice until the study is replicated using a more rigorous approach. I hope to elaborate on the criteria that you should use when assessing the strength of the evidence.

0.1. What should you look for?

When you are assessing the quality of the evidence, it's not how the data are analyzed that's important. Far more important is HOW THE DATA ARE COLLECTED. Don't agonize over technical details about the statistical analysis. After all, if you collect the wrong data, it doesn't matter how fancy the analysis is.

This is good news, because you don't need a lot of statistical training or a lot of mathematical sophistication to assess how the data are collected.

In this book, I want to show you what to look for and why. I will also highlight real research articles and use them as examples. Although all of the examples represent good and valuable research, some of the examples represent a level of evidence that by itself is less persuasive. It is helpful to understand why these examples are less persuasive.

0.2. Schizophrenic Research

Unfortunately, there is a lot of less than persuasive research out there. You don't have to look very hard to find solid empirical evidence of this. One of my favorite example is a study by Ben Thornley and Clive Adams that appeared in the British Medical Journal in 1998. You can find the full text of this article on the web at bmj.com/cgi/content/full/317/7167/1181 and it is well worth reading. Thornley and Adams looked at the quality of clinical trials for treating schizophrenia. Since they work for the Cochrane Collaboration Group, a group that provides systematic reviews of the results of medical trials, they are in a good position to write such an article.

Thornley and Adams actually identified over 2500 studies of schizophrenia, but decided to summarize only the first 2000 that they uncovered. Perhaps they reached the point of sheer exhaustion. I am very impressed at the amount of work this must have taken.

The research covered fifty years, starting in 1948 through 1997. The research covered a variety of therapies: drug therapies, psychotherapy, policy or care packages, or physical interventions like electroconvulsive therapy.

What did Thornley and Adams find? It wasn't a pretty picture. First, researchers in schizophrenia studied the wrong patients. Most studies used institutionalized patients, who are easier to recruit and follow up with, but who do not provide a good representation of the all patients with schizophrenia. Readers would probably be interested as much in community based studies, if not more interested, but only 14% of the studies were community based.

Second, the researchers also did not study enough patients. Thornley and Adams estimated that a good study of schizophrenia should have at least 300 patients in each group. This would be based on rates of improvements that might be expected for an active drug compared to placebo effects. Even though the desired sample size was 300, it turns out that the average study had only 65. Only 3% of the studies had 300 or more patients.

Third, the researchers did not study the patients long enough. A good study of schizophrenia should last for six months or more; long term changes are more important than short term changes. Unfortunately, more than half of the studies lasted for six weeks or less.

Finally, the researchers did not measure these patients consistently. In the 2,000 studies, the researchers used 640 ways to measure the impact of the interventions. Granted, there are a lot of dimensions to the schizophrenia and there were measures of symptoms, behavior, cognitive functioning, side effects, social functioning, and so forth. Still, there is no justification for using so many different measurements. Imagine how hard this makes it for anyone to summarize the results of this research. Failure to use and re-use a few standardized assessments has led to a very fragmentary (dare I say, schizophrenic) picture about schizophrenia treatments.

I don't wish to single out research in just this area. There are many reviews in other areas that also point out the flaws and shortcomings of research. Also keep in mind that research on schizophrenia is especially hard to do well. The take home message from Thornley and Adams is that just because the research is peer-reviewed does not mean that it is perfect. I hope to help you identify factors that limit the quality of peer-reviewed research.

0.3. Healthy Skepticism

Please don't panic. Research studies have many flaws but usually those flaws do not make the research wholly uninterruptible. These limitations should make you skeptical, perhaps, but not cynical.

The cynical attitude would be "you can prove anything with statistics" and leads to a nihilistic view that all research is garbage. The cynical attitude would lead you to nit pick a research paper, find a flaw here and a flaw there. Then use these flaws to disregard any research whose conclusions make you uncomfortable.

A skeptical attitude, on the other hand, would ask "how persuasive is this research" and would look at the strengths and the weaknesses of a research paper. It would place limits on how persuasive the research is. When the research was not sufficiently persuasive, a skeptical attitude would encourage you to think about what level of evidence would be enough to persuade you otherwise.

This webpage was written by Steve Simon on (unknown date), edited by Steve Simon and Linda Foland, and was last modified on 2008-07-08. This page needs minor revisions. Category: Statistical evidence


Apples or oranges?

1.0 Introduction

Almost all research involves comparison. Do woman who take Tamoxifen have a lower rate of breast cancer recurrence than women who take a placebo? Do left handed people die at an earlier age than right handed people? Are men with severe vertex balding more likely to develop heart disease than men with no balding?

When you make such a comparison between an exposure/treatment group and a control group, you want a fair comparison. You want the control group to be identical to the exposure/treatment group in all respects, except for the exposure/treatment in question. You want an apples to apples comparison.

1.0.1 Covariate imbalance

Sometimes, however, you get an unfair comparison, an apples to oranges comparison. The control group differs on some important characteristics that might influence the outcome measure. This is known as covariate imbalance. Covariate imbalance is not an insurmountable problem, but it does make a study less authoritative.

Women who take oral contraceptives appear to have a higher risk of cervical cancer. But covariate imbalance might be producing an artificial rise in cancer rates for this group. Women who take oral contraceptives behave, as a group, differently than other women. For example, women who take oral contraceptives have a larger number of pap smears. This is probably because these women visit their doctors more regularly in order to get their prescriptions refilled and therefore have more opportunities to be offered a pap smear. This difference could lead to an increase in the number of detected cancer cases. Perhaps, though, the other women have just as much cancer, but it is more likely to remain undetected.

There are many other variables that influence the development of cervical cancer: age of first intercourse, number of sexual partners, use of condoms, and smoking habits. If women who take oral contraceptives differ in any of these lifestyle factors, then that might also produce a difference in cervical cancer rates. The possibility that oral contraceptives causes an increase in the risk of cervical cancer is quite complex; a good summary of all the issues involved appears on the web at www.jhuccp.org/pr/a9/a9chap5.shtml.

1.0.2 Case study: Vitamin C and Cancer

Paul Rosenbaum, in the first chapter of his book, Observational Studies, gives a fascinating example of an apples to oranges comparison. Ewan Cameron and Linus Pauling published an observational study of Vitamin C as a treatment for advanced cancer (Cameron 1976). For each patient, ten matched controls were selected with the same age, gender, cancer site, and histological tumor type. Patients receiving Vitamin C survived four times longer than the controls (p < 0.0001).

Cameron and Pauling minimize the lack of randomization. "Even though no formal process of randomization was carried out in the selection of our two groups, we believe that they come close to representing random subpopulations of the population of terminal cancer patients in the Vale of Leven Hospital."

Ten years later, the Mayo Clinic (Moertel 1985) conducted a randomized experiment which showed no statistically significant effect of Vitamin C. Why did the Cameron and Pauling study differ from the Mayo study?

The first limitation of the Cameron and Pauling study was that all of their patients received Vitamin C and followed prospectively. The control group represented a retrospective chart review. You should be cautious about any comparison of prospective data to retrospective data.

But there was a more important issue. The treatment group represented patients newly diagnosed with terminal cancer. The control group was selected from death certificate records. So this was clearly an apples versus oranges comparison because the initial prognosis was worse in the control group than in the treatment group. As Paul Rosenbaum says so well: one can say with total confidence, without reservation or caveat, that the prognosis of the patient who is already dead is not good. (page 4)

When the treatment group is apples and the control group is oranges, you can't make a fair comparison.

1.0.3 Apples or oranges: What to look for.

To ensure that the researchers made an apples to apples comparison, ask the following questions:

Did the authors use randomization? In some studies, the researchers control who gets the new therapy and who gets the standard (control) therapy. When the researchers have this level of control, they almost always will randomize the choice. This type of study, a randomized study, is a very effective and very simple way to prevent covariate imbalance.

If randomization was not done, how were the patients selected? Several alternative approaches are available when the researchers have control of treatment assignment, but minimization is the only credible alternative. When researchers do not have control over treatment assignments, you have an observational studies. The three major observational studies, cohort designs, case-control designs, and historical controls, all have weaknesses, but may represent the best available approach that is practical and ethical.

Did the authors use matching to prevent covariate imbalance? Matching is a method for selecting subjects that ensures a similar set of patients for the control group. A crossover design represent the ideal form of matching because each subject serves as his or her own control. Stratification insures that broad demographic groups are equally represented in the treatment and control group.

Did the authors use statistical adjustments to control for covariate imbalance? Covariate adjustment uses statistical methods to try to correct for any existing imbalance. This methods work well, but only on variables that can be measured easily and accurately.

1.1 Did the authors use randomization?

Randomization is the assignment of treatment groups through the use of a random device, like the flip of a coin or the roll of a die, or numbers randomly generated by a computer.

Example: In a study of allergy shots (Adkinson 1997), 121 children with moderate-to-severe asthma were "randomly assigned to receive subcutaneous injections of either a mixture of seven aeroallergen extracts or a placebo."

Example: In a study of acupuncture (Bullock 1989) "80 severe recidivist alcoholics received acupuncture either at points specific for the treatment of substance abuse (treatment group) or at nonspecific points (control group)."

In both studies the researchers decided who got what. This is a hallmark of a randomized design and it only can occur when the patients and/or their doctors have no say in the assignment.

1.1.2 How does randomization help?

Randomization helps ensure that both measurable and unmeasurable factors are balanced out across both the standard and the new therapy, assuring a fair comparison. Used correctly, it also guarantees that no conscious or subconscious efforts were used to allocate subjects in a biased way.

There are situations where covariate imbalance can appear, even in a well randomized study (Roberts 1999). Just as you have no guarantee that a flip of 100 coins will yield exactly 50 heads and 50 tails, you have no guarantee that covariate imbalances cannot creep into a randomized study once in a while. This is not just a theoretical concern. One article (Mann 2002) argues that a difference in baseline stroke severity in a randomized trial of tPA produced an incorrect assertion of the effectiveness of this treatment.

Randomization relies on the law of large numbers. With small sample sizes, covariate imbalance may still creep in. A study examining the probability of covariate imbalance (Hsu 1989) showed that total sample sizes less than 10 could have a 50% chance or higher of having a categorical covariate with levels twice as large in one group than the other. This study also showed that total sample sizes of 40 or greater would have very little chance of such a serious imbalance, and a total of 20-40 subjects would be acceptable if there were only one or two important covariates.

1.1.3 A fishy story about randomization

I was told this story but have no way of verifying its accuracy. A long, long, time ago, the U.S. Environmental Protection Agency wanted to examine a pollutant to find concentration levels that would kill fish. This research required that 100 fish be separated into five tanks, each of which would get a different level of the pollutant. The researchers caught the first twenty fish and put then in the first tank, then the next twenty fish and put them in a second tank and so forth. The last twenty fish went into the fifth tank. Each fish tank got a different concentration of the pollutant. When the research was done, the mortality was related not to the dosage, but to the order in which the tanks were filled, with the worst outcomes being in the first tank filled and the best outcomes in the last tank filled. What happened was that the slow-moving, easy-to-catch fish (the weakest and most sickly fish) were all allocated to the first tank. The fast-moving, hard-to-catch fish (the strongest and healthiest fish) ended up in the last tank.

Failure to randomize in this study ruined the entire effort. The huge imbalance caused by putting the sickest fish in the first tank and the healthiest fish in the last tank overwhelmed any differences in mortality caused by varying levels of the pollutant.

1.1.4 The mechanics of randomization

Random assignment means that the choice is left to some device that is inherently random and unpredictable. A flip of a coin is one approach, but usually a table of random numbers or a random number generator is more practical.

The simplest way to randomize is to layout the treatment schedule in a systematic (non-random) fashion, generate a random number for each value in the schedule and then sort the schedule by the random number. Sorting by a random number is effectively the same thing as putting the list in a random order.

1.1.5 Concealing the randomization list

Another important aspect of randomization is concealed allocation, which is the concealment of the randomization list from those involved with recruiting subjects. This concealment occurs until after subjects agree to participate and the recruiter determines that the patient is eligible for the study. Only then is a sealed envelope opened that reveals the treatment status. Concealed allocation can also be done through an 800 number that the doctor calls to discover the treatment status.

Please note that concealing the randomization list is not the same as blinding the study (a topic I discuss later in this book). Certain treatments, such as surgery, cannot be blinded but the allocation list can still be concealed. Consider, for example, a randomized trial comparing laparoscopic surgery to traditional surgery. After the fact, the patient can tell by the size of the scar what type of surgery they received. But the choice as to what type of surgery that the patient receives could be made as the patient is being sedated. There is an example of a research study where a sterilized coin was flipped in the operating room to decide which surgery will be used.

If the randomization list is not concealed, doctors have the ability to consciously or unconsciously influence the composition of the groups. They can do this by applying exclusion criteria differentially or by delaying entry of a certain healthier (or unhealthier) subject so he/she gets into the "desirable" group. Unblinded allocation schemes show an average bias of 30-40% (Schulz 1996).

There are many stories of physicians who have tried and succeeded in recruiting a patient into a preferred group. If the treatment allocation is hidden in sealed envelopes, they can hold it up to a strong light. If the sealed envelopes are not sequentially numbered, they can open several envelopes at once. If the allocation is controlled by a central operator, they can call and ask for the allocation of several patients at once.

When a doctor has an overt preference to enroll a patient into one group over another, it raises ethical issues and perhaps the doctor should not be participating in the trial. You should only participate in a research study if you believe there is genuine uncertainty about whether the new therapy or the standard therapy is better. If not, you have no business participating in a study where some of your patients will be randomized to a treatment that you consider inferior. Unfortunately, some doctors will continue to participate in these trials but will try to skew the enrollment of some or all of the patients towards a favored therapy.

Concealed allocation only makes sense for a truly randomized study. If patients are assigned in an alternating fashion, concealed allocation is buying a fancy burglar alarm and leaving the front door wide open. You already know that alternating assignments is a bad idea, but it is even worse because it the doctors will immediately recognize the next patient is going to be allocated to. This makes it easy for them to preferentially recruit to a specific treatment if they want to.

1.1.6 Randomization: what to look for.

If a study is randomized, look for the following features:

  • Was there a description of the source of randomness. Did the researchers use a table of random numbers? Did they use a computer to generate random numbers?
  • Did the researchers conceal the randomization list from the doctors during the recruitment of patients?

1.2 If randomization was not done, how were the patients selected?

Randomization is not always used. There are three alternatives to randomization when the control of treatment assignment is under the power of the researchers. When practical or ethical issues prevent researchers from controlling treatment assignment, you have an observational study.

1.2.1 Minimization.

An alternative, when the researchers have sufficient control, is to allocate the assignments so that at each step, the covariate imbalance is minimized. So if the treatment group has a slight surplus of older patients and the next patient to join the study is also older than average, then that patient would be assigned to the control group so as to reduce the age discrepancy.

Example: In a study of behavioral counseling (Steptoe 1999), twenty general practices were allocated either to use behavioral counseling based on the stages of change model for all their patients, or no counseling other than what their current standard of care. These practices were assigned using minimization to insure balance on three factors: the degree of underprivileged patients being served, the patient to nurse ratio of the practice, and fund holding status.

Minimization is a good approach if there are one or two covariates which are especially important and which are easily measured at the start of the study. It will perform better than randomization on those factors, although there is no guarantee of covariate balance for other covariates not used in the minimization. Minimization also cannot control for unmeasured covariates.

There is more effort required in setting up a study with minimization. You need a computer to be available at the time and location of the recruitment of each patient because you can't just print a list ahead of time. Another difficulty is that minimization is open to possible abuse because doctors might be able to predict what the next assignment would be.

1.2.2 Alternating assignments.

Another approach used in place of randomization is to alternate the assignment, so that every even patient is in the treatment group and every odd patient is in the control group.

Alternate assignment was popular in trials before World War II; it was felt that researchers would not understand and not tolerate randomization (Yoshioka 1998).

[Insert a recent example of alternating assignment]

Alternating assignment seems on the surface to be a good approach, but it can sometimes lead to trouble. This is especially true when consecutive patients can influence one another. You may have seen this level of influence if you grow vegetables in a garden. If you have a row of cabbages, for example, you will often see a pattern of big cabbage, little cabbage, big cabbage, little cabbage, etc. What happens, usually if the cabbages a planted a bit too closely is that one of the cabbages will grow just a bit faster at first. It will extend into the neighboring cabbage's territory, stealing some of the nutrients and water, and thus growing even faster at the expense of the neighbor. If you assigned a fertilizer to every other cabbage, you would probably see an artificial difference because of the alternating pattern in growth within a row.

This alternating pattern can also occur in medicine. Consider, for example, a study of how much time doctors spend with their patients. If the first patient takes longer than expected, the doctor will probably rush a bit with the second patient in order to keep from falling further behind schedule. On the other hand, if the first patient finishes quickly, then the doctor will feel more relaxed and might tend to take a bit more time with the next patient.

In some situations, alternating assignment would be tolerable, but there is no good reason to prefer this over random assignment. You should be skeptical of this approach because studies with alternating assignment will tend, on average, to overstate the effectiveness of a new therapy by 15% (Colditz 1989).

1.2.3 Haphazard assignment.

Other choices that researchers will make it to base assignments on some arbitrary value. For example, patients born on days which are even numbers would be assigned to the treatment group and those born on odds days would be assigned to the control group.

Example: In a study of heparinized saline to maintain the patency of patient catheters (Kulkarni 1994), patients admitted on odd-numbered dates received heparinized saline and patients admitted on even-numbered days received normal saline.

In some situations, haphazard assignment might be tolerable, but there is no good reason to use this approach. The study mentioned above was excluded from a meta-analysis of heparinized saline (Randolph 1998) because the reviewers felt the quality level was too low.

1.2.4 Observational studies

There are many situations where randomization is not practical or possible. Sometimes patients have a strong preference for one particular treatment and would not consider the possibility of being randomized into a different treatment. Surgery is one area with strong patient preferences especially for newer approaches like laparoscopic surgery (www.symposion.com/nrccs/lefering.htm).

Sometimes we are studying noxious agents, like second hand cigarette smoke, noisy workplaces, or boring statistics teachers like me. It would be unethical to deliberately expose people to any of these agents, so we have to collect data on those people who are unavoidably exposed to these things.

Sometimes, the sample sizes required or the duration of the study make it difficult to use randomization. Diseases like cancer that have a long latency period are especially hard to study  with a randomized design.

Retrospective studies, studies where the outcome of interest has already occurred and you are looking at factors in the past that might have caused this outcome, are also impossible to randomize, unless you have a time machine.

Sometimes, the groups being studied existed prior to the start of the research. Genetic conditions like Down's syndrome cannot be randomly assigned to half of the patients in your study.

Sometimes researchers just do not want to go to the effort of randomizing. It is usually faster and cheaper to use existing non-randomized databases, and these are often helpful in evaluating the feasibility of then performing a large randomized study.

When randomization is not possible, then you are looking at an observational study. There are three major flavors for observational studies: cohort studies, case control studies, and historical controls studies.

1.2.5 The cohort study.

In a cohort study, a group of patients has a certain exposure or condition. They are compared to a group of patients without that exposure or condition. Does the exposed cohort differ from the unexposed cohort on an outcome of interest?

Example: In a study of dietary fat (Hu 1997), 80,082 women between the ages of 34 and 59 years were followed for 14 years to look for instances of non-fatal myocardial infarction or death from coronary heart disease. These women were divided into low, intermediate, and high groups on the basis of their consumption of dietary fat. This is an observational study because the women chose the type of diets they ate, not the researchers. This particular observational study is a cohort design, with the three levels of fat consumption representing three different exposure groups.

Cohort studies are intuitively appealing and selection of a control group is usually not too difficult. You have to be very wary of covariate imbalance, but other observational designs are likely to have even more problems. Don't worry about every possible covariate imbalance. You should look for large imbalances, especially for covariates which are closely related to the outcome variable.

When you are studying a very rare outcome, the sample size may have to be extremely large. As a rough rule of thumb, you need to observe 25 to 50 outcomes in each group in order to have a reasonable level of precision. So when a condition occurs only once in every thousand patients, a cohort study would require tens of thousands of patients.

You want to avoid "leaky groups" in a cohort design. If the exposure group includes some unexposed patients and the control group includes some exposed patients, then anything effect you are trying to detect will be diluted. Be especially aware of situations where one group is more leaky than the other.

For example, many studies will classify people into various levels of caffeine exposure on the basis of how much coffee they drink. Although coffee is the major source of caffeine for most people, failure to ask about other sources of caffeine consumption can lead to large underestimates of caffeine intake, which can obscure relationships to various diseases (Brown 2001).

Dietary studies will sometimes rely on household food surveys, but these need adjustment for the varying consumption of individual family members. For example, within the same family, males (especially boys aged 11-17 years) will have higher average intakes of calories and nutrients (Nelson 1986).

1.2.6 The case control study

A case control study selects patients on the basis of an outcome, such as development of breast cancer, and are compared to a group of patients without that outcome. Do the cases differ from the controls in some exposures?

Example: In a study of HIV infection (Cardo 1997), 33 health care workers who became seropositive to HIV after percutaneous exposure to HIV-infected blood were compared to 665 health care workers with similar exposure who did not become seropositive. This is an observational study, since the researchers did not control who became seropositive. This particular observational study is a case-control design because patients were selected on the basis of the outcome, seroconversion.

A case-control study is very efficient in studying rare diseases. With this design, you round up all of the limited number of cases of the disease and then find a comparable control group. By contrast, a cohort design has to round up far more exposures to insure that a handful of them will develop the rare disease.

Case-control studies do not perform well when you are evaluating a diagnostic test. They are easy to set up, because you have a group of patients with the disease and you estimate the probability of a positive result for the diagnostic test in this group (sensitivity). You also have a control group and you estimate the probability of a negative result for the diagnostic test in this group (specificity). Unfortunately, the case control design usually has a collection of very obviously diseased patients among the cases and very obviously healthy patients among the controls. This is an example of spectrum bias, the a lack of patients in the ambiguous middle of the spectrum. A study with spectrum bias will often overstate the sensitivity and specificity of a diagnostic test.

[Include reference on spectrum bias.]

Because the outcome in a case control study has already occurred, this study is always retrospective. Retrospective studies usually have more problems with data quality because our memory is not always perfect. What's worse is that sometimes the ability to remember is sharply influenced by the outcome being studied. People who experience a tragic event like a miscarriage will have a strong desire to try to understand why this has happened and will search their pasts for risk factors that have been highly publicized in the press (Bryant 1989). They don't make things up, but the problem is that the people in the control group only seem to remember about half the things that have happened in their past. This selective underreporting in the control group is known as recall bias and it can lead to some serious faulty findings.

If you have "leaky groups" in a case-control design, this can cause problems also. do some of the disease outcomes get left out of the cases? It might be harder, for example, to identify the less serious examples of disease, and this can lead to serious problems. You can avoid this problem if there is some type of registry that allows the researchers to identify every possible case. Watch out also for situations where healthy people or people with the incorrect disease are accidentally classified as cases.

The other major problem with this type of study is that it is so hard to find a good control group. You want to find controls that are identical to the cases in all aspects except for the outcome itself. When there is a roster of all potentially eligible subjects (subjects who would be classified as cases if they developed the disease), then selection of a good quality control group is easy (Wacholder 1995). Most studies would not have such a roster. In this case, the controls are often patients admitted to the hospital for outcomes unrelated to the study. So if cases represent newly diagnosed lung cancer, then the controls might be patients admitted for a bone fracture. Other times, you might ask the case to bring a friend with them or to identify a relative.

Finally, the case-control design just does not sit well with your intuition. You are trying to find factors that cause an outcome, so you are sampling from the causes while a cohort design samples from the effects. Don't let this bother you too much, though. The mathematics that justify the case control design were developed half a century ago by Jerome Cornfield (JNCI 1951, 11: 1269-75) and careful use of the case-control design has helped establish the use of aspirin as a cause of Reye's syndrome (Monto 1999).

1.2.7 The historical controls study.

In a historical controls study, researchers will assign all of the research subjects to the new therapy. The outcomes of these subjects are compared to historical records representing the standard therapy.

Example: In a study of the rapid parathyroid hormone test (Johnson 2001), 49 patients undergoing parathyroidectomy received the rapid test. These patients were compared to 55 patients undergoing the same procedure before the rapid test was available. This is an observational study because the calendar, not the researchers, determined which test was applied. This particular observational study is a historical controls design because the control group represents patients tested before the availability of the rapid test.

The very nature of a historical controls study guarantees that there will be a major covariate imbalance between the two groups. Thus, you have to consider any factors that have changed over time that might be related to the outcome. To what extent might these factors affect the outcome differentially? For the most part, historical controls are considered one of the weakest forms of evidence. The one exception is when a disease has close to 100% mortality. In that situation, there is no need for a concurrent control group, since any therapy that is remotely effective can be detected readily. Even in this situation, you want to be sure there is a biological basis for the treatment and that the disease group is homogenous (www.pharmafile.com/Pharmafocus/Features/feature.asp?fID=354).

1.2.8 Non-randomized studies, what to look for.

When a study was not randomized, look for the following features.

For a study using minimization:

  • Which covariates were used to assess balance?
  • Were any important covariates ignored?

For studies using alternating assignments or haphazard assignments:

  • Did the authors provide a justification for this approach?
  • What possible artificial patterns in the assignments might interfere with the treatment assignment?

For studies using a cohort design:

  • Are you studying a rare exposure?
  • Is the method for determining the exposure and control groups objective and accurate?
  • Some covariate imbalances are inevitable, but are any of them serious?

For studies using a case-control design:

  • Are you studying a rare disease?
  • Excluding the disease outcome itself, does the control group have similar  features to the cases?
  • Were some outcomes missed or were some healthy people accidentally included as cases?
  • Is there a tendency for cases to have better recall of exposures than controls?

For studies using a historical controls design:

  • Did the authors provide a justification for this approach?
  • In the time between the collection of the control group data and the treatment data, what other factors might have changed?

1.3 Did the authors use matching?

To ensure an apples to apples comparison, researchers will often use matching. Matching is the systematic selection, for every subject in the treatment/exposure group, of control subject with similar characteristics. For example, in a study of fetal exposure to cocaine, you would select infants born to a mother who abused cocaine during pregnancy for your exposure group. For every such infant, you would select a infant unexposed to cocaine in utero, but also who had the same sex, race, and socio-economic status for your control group.

Example: In a study of home versus hospital delivery (Ackerman-Liebrich 1996), 489 women who planned to deliver their babies at home were matched with women who planned to deliver at the hospital. Matching was based on age category (5 categories), parity category (3 categories), category of gynecological and obstetric history (24 categories or none), category of medical history (12 categories or none), social class (5 categories), and nationality. Because the matching criteria were so elaborate, they were only able to find a matched hospital delivery for about half of their home deliveries.

Matching will prevent covariate imbalance for those variables used in matching. It will also reduce covariate imbalance for any variables closely related to the matching variables. It will not, however, protect against all covariate imbalance, especially for those covariates that are difficult to measure.

Matching often presents difficult logistical issues, because a matching control subject may not always be available. The logistics are especially difficult when there are several matching variables and when the pool of control subjects that you can draw from is not substantially larger than the pool of treatment/exposed subjects.

Matching is usually reserved for those variables that are known to be highly predictive of the outcome measure. In a cancer study, for example, matching is usually done on smoking. Many neonatology studies will match on gestational age.

1.3.1 Matching in a case control design

When you are selecting patients on the basis of disease and looking back at what exposure might have caused the disease, selection of matching control patients (patients without disease) can sometimes be tricky. You need to find a control that is similar to the case, except for the disease of interest. There are several possibilities, but none of them works perfectly.

  • You could recruit controls from undiseased members of the same family.
  • You could ask each case to bring a friend with them. Their friend would be likely to be of similar age and socioeconomic status.
  • You could find a control that lives in the same neighborhood as the case.

Example: In a study of early onset myocardial infarction (Danesh 1999), 1122 survivors of heart attacks between the ages of 30-49 were matched with people of the same age and gender who did not have heart attacks. These controls were recruited from a pool of subjects related to the cases. A second analysis used 510 survivors and their siblings, if the sibling was the same sex and within five years of age. All of the cases and the controls had blood tests to look for Helicobacter pylori infection, which was more commonly found in the cases than the controls.

[Insert additional discussion here]

1.3.2 Matching in a randomized design

In some randomized studies, matching will be used as well. Partly, this is a recognition that randomization will not totally remove covariate imbalance, just like a flip of 100 coins will not always result in exactly 50 heads and 50 tails. More importantly, however, matching in a randomized study will provide extra precision. Matching creates pairs of subjects who will have greater homogeneity and therefore less variability.

Example: A study of 1,121 patients with tinnitus were randomly assigned these patients to either Ginkgo biloba or a placebo (Drew 2001). The researchers also identified 489 pairs of subjects (978 total) who were the same sex, similar age (within 10 years) and similar duration of tinnitus (within 5 years) to try to improve the precision of this study.

1.3.3 Matching can sometimes backfire

Matching often presents difficult logistical issues, because a matching control subject may not always be available. The logistics are especially difficult when there are several matching variables and when the pool of control subjects that you can draw from is not substantially larger than the pool of treatment/exposed subjects.

In the tinnitus study mentioned above, although there were 1,121 patients, 143 of them did not have a close match in the data and were excluded from the matched analysis. There was also some attrition in the study, which caused a greater loss in the matched analysis. If one of the patients in a pair dropped out, the other patient's data could not be used in the matched analysis. So the analysis of improvement after 4 weeks included only 414 pairs and the analysis after 14 weeks included only 354 pairs. Although the loss in sample size was probably offset by the added precision from the matching, the authors do acknowledge that this was probably "an unnecessary and disadvantageous complication."

In a case control design, matching can sometimes remove the very effect you are trying to study. You should avoid matching when the matching variable is caused by the exposure or is a similar measure of exposure, then you might "over match" the data and remove the effect of the exposure. In a study examining radiation exposure and the risk of leukemia at a nuclear reprocessing plant (Marsh 2002), there were 37 workers diagnosed with leukemia (cases) and they were each matched to four control workers. Each of the four control workers had to work at the same site, have the same gender, have the same job code, be born within two years of the case, and had to be hired within two years of the hire date of the case.

Unfortunately, there was a strong trend between hire date and exposure. Exposures were highest early in the plant's history and declined over time. So both hire date and exposure were measuring the same thing. When the data was matched on hire date, it artefactually controlled the exposure and pretty much ensured that the average radiation exposure would be the same among both the cases and the controls. This led to an estimate of radiation exposure that was actually slightly negative and not statistically significant. When the data was rematched using all the variables except for hire date, the effect of radiation dose was large and positive and came close to approaching statistical significance.

1.3.4 The crossover design

The crossover design represents a special type of matching. In a crossover design, a subject is randomly assigned to a specific treatment order. Some subjects will receive the standard therapy first, followed by the new therapy (AB). Others will receive the new therapy first, followed by the standard therapy (BA). Since the same subject receives both treatments, there is no possibility of covariate imbalance.

Example: In a study of electronic records (Brown 2003), ten physicians were asked to code patient records with two separate systems: Clinical Terms Version 3 and with the Read Codes 5 byte set. Half of the physicians were randomly assigned to code using Clinical Terms Version 3 first and then later with the Read Codes 5 Byte Set. The other half coded using Read Codes 5 Byte Set first.

When therapies are applied in sequence, timing effects are of great concern. Are the therapies set far apart enough so that the effect of one therapy is unlikely to carryover into the other therapy? For example, if the two therapies represent different drugs, did the researchers allow enough time so that one drug was fully eliminated from the body before they administered the second drug?

The washout period can sometimes cause ethical concerns. If you are treating patients for depression, an extensive amount of time during the washout would leave the patient without any effective treatment and increase the chances of something bad happening, like the patient committing suicide.

The possibility of learning effects are also potential problems in a crossover design. You can't use a crossover design, for example, to test alternative training approaches. Imagine the instructions for this study (now forget everything we just told you; we're going to teach it a different way). I guess that would work for the classes I teach; the only things my students remember are the jokes.

Also watch out for the possibility that a subject may get tired or bored. This could lead to a the second treatment assigned being worse than the first. Or if the outcome involves skill, maybe "practice makes perfect" leading to the second treatment assigned being better than the first.

If there are timing effects, randomization is critical. Even with randomization, though, timing effects are a problem because they increase uncertainty by adding an extra source of variation.

Special problems arise when each subject always receives one therapy first and it is always followed by the other therapy. Many factors other than the change in therapy can cause a shift in the health of patients over time. If you cannot randomize the order of treatments, you have all the problems of a historical controls study.

1.3.5 Stratification

Stratification is a method similar to matching that tries to achieve covariate balance across broad groups or strata. The selection of subjects in both the treatment group and the control group are constrained to have identical proportions in each strata. This guarantees covariate balance for the strata itself and any other factors closely related to the strata.

Example: In a study of medical records (Fine 2003), 54 records were selected from each of 10 cardiac surgery centers were examined for accuracy and completeness. To ensure a good balance, the 54 records at each site were allocated evenly to six different predefined risk strata (nine in each strata).

Example: In a study of retention of doctors in rural Australia (Humphreys 2002), a random sample of 1400 doctors was sent a questionnaire. The doctors were selected in strata defined by the size of the town they lived in to keep the proportion in each strata equivalent to those proportions in the entire population of Australian doctors.

Another use of stratification is to ensure that the sample has numbers in each strata that are proportional to numbers in the strata for the entire population of interest. This helps ensure that the sample is generalizable to the entire population.

The strata are usually broadly drawn. If there were a small number of possible patients within each strata, then the logistics become too difficult. So for example, stratification by age will usually involve large intervals such as 21-30 years, 31-40 years, etc.

You cannot stratify on factors that you cannot measure or on information that is not immediately available at the start of the study. And like matching, stratification only works when you have a large pool of subjects to draw from.

Stratification can add precision to a randomized study. A separate randomization list would be drawn up for each strata. This would ensure that the strata would have perfect balance between the treatment group and the control group.

1.3.6 Things to look for in a study with matching

When a study uses matching, look for the following features.

For a study using matching (stratification):

  • Did the researchers match (stratify) on the most important covariates?
  • Were the matching (stratification) variables measured accurately?
  • Were any important variables not considered in the matching (stratification)?

For studies using a cross-over design:

  • Were there any carry-over effects?
  • Were there any fatigue effects?
  • What possible artificial patterns in the assignments might interfere with the treatment assignment?

1.4 Did the researchers use statistical adjustments

Statistical adjustments represent one way of correcting for covariate imbalance. While matching and stratification, try to prevent covariate imbalance before it occurs, statistical adjustment corrects for the imbalance after the fact.

Example: A study of males residents of Caerphilly, South Wales (Smith 1997) examined the relationship between frequency of orgasm and ten year mortality among males residents of Caerphilly, South Wales. They divided the men into low, medium, and high frequency. Low frequency meant less than monthly and high frequency meant twice a week or more often. This is a study which would have been impossible to randomize--the men (and presumably their wives) determined which group they belonged to. As you might expect, there were demographic differences in the three groups. Age was significantly associated with frequency of orgasm. Men in the low, medium, and high frequency groups were 54, 52, and 50 years old, on average. The job categories also differed, with the proportion of non-manual labor being 29%, 42%, and 42% among the three groups. For other variables (height, body mass index, systolic blood pressure, cholesterol, existing coronary heart disease, and smoking status), the differences in  were smaller and less important. The adjustments used a combination of regression approaches and weighting. After adjustment, there was a strong trend in mortality, with men in the low frequency group having an adjusted mortality rate that was twice as big as the high frequency group. Both the article itself, and a subsequent letter to the editor (Batty 1998) mentioned, however, that additional unmeasured variables could have influenced the outcome.

Example: In a breast feeding study here at Children's Mercy Hospital (Kliethermes 1999), pre-term infants were randomized either to a group that received normal bottle feeding while they were in the hospital or to a nasogastric (ng) tube feeding group. The researchers wanted to see if the latter group of infants, because they had not become habituated to bottle feeding, would be more likely to breastfeed after discharge from the hospital. The randomization was only partially effective at preventing covariate imbalance. The infants had comparable birth weights, gestational ages, and Apgar scores. There were similar proportions of caesarian section and vaginal births in both groups. But the mothers in the ng tube group were older on average than the mothers in the bottle fed group. Since older mothers are more likely to breast feed than younger mothers, we had to include mother's age in an analysis of covariance model so that the effect of ng tube feeding could be estimated independent of mother's age. From a regression model, we discover that older mothers breastfeed for longer periods of time, on average, than younger mothers. In fact, for each year of age, the duration of breastfeeding increases by 0.25 weeks on average. So we would adjust the difference of the two groups by 0.25 weeks for every year in discrepancy between the average mothers' ages.

1.4.1 Imperfectly measured covariates

Some covariates can be measured, but only crudely. If the covariate itself is difficult to measure accurately, then any attempts to make statistical adjustments will only be partially successful. Your measurement may only capture half of the information in the covariate. The half of the covariate that is unaccounted for will remain behind leading to an unfair comparison. This is sometimes called residual confounding.

Example: In a study of factors influencing Down syndrome (Chen 1999), smoking had a surprisingly protective effect. This could be explained by the age of the mother. Older mothers smoke less and are also at greater risk for birth of a Down syndrome child. The unadjusted odds ratio for this effect was 0.80 and was borderline statistically significant (95% CI 0.65 to 0.98). A crude adjustment for age used the categories <35 years and >=35 years). With this adjustment, the odds ratio was still small (0.87) and borderline (95% CI 0.71 to 1.07). But when the exact year of age was used to adjust and race parity also included in the adjustment, then there was no association odds ratio=1.00, 95% CI 0.82 to 1.24). This shows that an imperfect adjustment can produce an incorrect conclusion.

Self report measures are often measured imperfectly, and are especially troublesome if they require the patient to recall events from the distant past.

Smoking is an important covariate for many studies and  it would be better to ask about the amount of smoking for current smokers. For smokers who have quit recently, you might also like to know how recently they quit. For both groups it might also help to know when they started. But often, the only question asked is a yes/no question like "do you smoke cigarettes?"

Some covariates like blood cholesterol levels are inherently variable. In an ideal world, these covariates would be measured at a second time and the two measures could be averaged to reduce some of the uncertainty. But this is not always possible or practical.

[Expand the discussion of this section.]

1.4.2 Unmeasured covariates

You can only adjust for those things that you can measure. Unfortunately, there are many things such as a patient's psychological state, presence of co-morbid conditions, and initial severity of the disease that are so difficult to assess that they are often just not measured.

[Add discussion about this topic.]

1.4.3 Other alternatives to covariate adjustment

If there is covariate imbalance in the entire sample, perhaps there may be a subgroup where the covariate is balanced. If you can find such a subgroup and it produces results similar to the entire sample, you can have greater confidence in the findings of the entire sample.

Example: In a study of the effect of men's age on time to pregnancy (Hassan. 2003), older men tended to have a longer time to pregnancy. These older men, though, also have older wives, on average. This creates an unfair comparison, since the wife's age would probably also influence time to pregnancy. To produce a fairer comparison, they conducted a separate analysis looking at men of all ages who married young wives.

Of course, it is not always possible to find a subgroup without covariate imbalance. And when you do find such a subgroup, the smaller sample size may lead to an unacceptable loss of precision. Furthermore, the subgroup may be somewhat unusual, making it difficult for you to generalize the findings.

Another way to restore balance in a study is the use of weights. Suppose the treatment group includes 25 males and 75 females, but in population we know that there should be a 50/50 split by gender. We could re-weight the data, so that each male has a weighting factor of 2.0 and each female has a weighting factor of 0.67. This artificially inflates the number of males to 50 and deflates the number of females to 50. The control group might have 40 males and 60 females. For this group, we would use weights of 1.25 and 0.83.

[Insert a better example here]

The statistical analysis gets a bit tricky with weights, but nothing that a professional statistician can't handle. Weights can also improve the generalizability of a study. If the overall a sample has a skewed demographic, weights can help bring it back in line with the population of interest.

1.4.4 Matching and adjustments: what to look for.

If a study uses covariate adjustments, look for the following things:

  • Did the study adjust on variables that are truly important to the outcome?
  • Were the variables used in adjustment measured accurately?
  • Were there unmeasured covariates that could have influenced the outcome?

1.5 Summary -- Is randomized better than observational?

Can matching and/or statistical adjustments in an observational study provide a comparison as fair and as persuasive as a randomized study? This is an unfair question, because sometimes a randomized study is just not possible. Also, there are so many different types of observational studies that it would be difficult to come up with a good general answer. Still, some people have tried to answer this question.

An empirical study of observational and randomized studies of the same topic (Concato 2000) found that there was a high level of consistency between the two. This contradicted the previously held belief that observational studies tended to overstate the effectiveness of a new treatment. The debate about this finding continues to rage, but perhaps the quality of the design and the sophistication of the adjustments used in observational studies places them on a level comparable to randomized studies. Another study published on the web (www.symposion.com/nrccs/koch.htm) showed that a large non-randomized registry provided data that was comparable to that collected in randomized studies.

In spite of this research, information from a randomized study is usually consider a stronger form of evidence. Randomization provides a greater level of assurance that the two groups are comparable in every way except for the therapy received. An editorial in the Journal of the American Medical Association (Sherwin 1997) noted the weakness of observational studies while trying to make sense of recent studies of the effect of dietary fat on obesity, heart disease, and stroke. After reviewing numerous studies, the editorial comments:

"At present, most of this evidence in humans is observational and, consequently, an imperfect basis for causal inference. Large scale experimental studies that would provide more compelling data (such as the Women's Health Initiative) cost hundreds of millions of dollars and take decades to complete. Each study can only address the effects of a single nutritional change. Thus, it is still necessary to base advice to patients on dietary information that is less than certain and complete."

Randomized studies do have some weaknesses. The very process of randomization will create an artificial environment that does not represent how medicine is normally practiced (Sackett 1997). When you go to your doctor for assistance with birth control, you do not expect him/her to randomly assign you to a particular method. And if your doctor said you had a 50% of getting a placebo contraceptive, you would probably switch doctors. Because an observational study does not have to cope with the intrusion of the randomization process, it can often study medicine in an environment much closer to reality.

Another problem with randomized designs is the limit to their size and scope. The logistics of randomization make it more expensive than a comparable observational study. Thus effects that require a very large sample size to detect (such as rare side effects) or effects that take a long time to manifest themselves (such as the progression of many types of cancer) cannot be examined in a randomized experiment. An observational approach like post marketing surveillance is more likely to be successful in these situations.

Furthermore, the use of a placebo in a randomized trial creates an artificial situation where patients are more likely to drop out and less likely to report side effects (Rochon 1999).

Studies of the potential harm caused by environmental exposures (such as lead based paint, second hand tobacco smoke, or electro-magnetic fields) are often impossible to randomize because of logistical and ethical issues.

On the other hand, observational studies often require either matching or statistical adjustments. While both matching and adjustments can help to some extent with covariate imbalance, these approaches do not work as well as randomization. In particular, some of the covariate imbalance may be due to factors that are difficult to measure like the psychological state of the patient, initial severity of the disease, and/or the presence of comorbid conditions. All of these factors can influence the outcome, but if you can't measure them easily, matching or adjustment is not possible.

Generally, the advantages of a randomized design outweigh the disadvantages. All other things being equal, a randomized study provides a higher standard of evidence than an observational study. Nevertheless, much can be learned from observational studies. Even though observational studies provide weaker evidence, but if you can bring other data to bear on the problem, as through replication or the establishment of a scientific mechanism, you can gain quite persuasive evidence from observational data. Almost everything we know about the risks of cigarette smoking, for example, came from observational designs. The identification of Reye's syndrome and its link to aspirin was also established solely through observational data.

This webpage was written by Steve Simon on 2003-07-03, edited by Steve Simon, and was last modified on 2008-07-08. Send feedback to ssimon at cmh dot edu or click on the email link at the top of the page. Category: Statistical evidence


Please fill out an evaluation form. Your input is important. These evaluation forms also ensure that we can offer Continuing Medical Education credits for this class.