First use of comparison groups in a clinical experiment
First application of mathematical analysis to test a hypothesis Introduction of concept of confounding (i.e., patients' response may vary for reasons other than treatment)
First systematic use of statistics in medicine
Development of systematic data collection
Calculation of rates of morbidity and mortality, based on hospital intake and discharge data
Development of statistical methods: accounting for the role of chance in scientific studies Development of experimental study design
First randomized clinical trial (RCT)
U.S. 6th Circuit Court of Appeals grants RCTs status as a standard of evidence in the regulatory authority of the Food and Drug Administration
Principles of evidence-based medicine delineated and published
James Lind: experiments to treat scurvy in British sailors
P.C.A. Louis: observations of the effect of timing of bloodletting on pneumonia outcomes
Florence Nightingale presented data on mortality in field hospitals during Crimean War, leading to fundamental changes in patient hygiene
Ronald Fisher, as described in his book Design of Experiments (1935)
Sir Austin Bradford Hill assessed the use of streptomycin in treating tuberculosis
Working group led by Gordon Guyatt, McMaster University
Louis was well ahead of his time in his use of standardized data collection and in his framing of research questions, but perhaps more importantly to the history of evidence-based medicine, Louis clearly recognized the limitations of his work. He wrote of the possibility that alternative, unmeasured factors (besides bloodletting) could explain his findings. He understood that his patients may have differed for reasons unrelated to treatment and that these differences might have had a more important influence on their outcomes than the treatment itself.3 That is, Louis questioned the cause-and-effect relationship between bloodletting and increased survival, whether or not he articulated it as such.
The first historic example of the use of comparison control groups in clinical investigation comes from the well-known story of James Lind, an 18th-century British physician (1716-1794) who addressed the issue of scurvy in the British Navy. The value of fresh fruit in treating and preventing scurvy had been suggested by an earlier scholar, but Lind was the first to apply an experimental design to the investigation of this hypothesis. In 1747, he selected 12 patients/seamen on board a navy vessel, as he said, "as similar as I could have them," and then assigned 2 each to various treatments, one of which was to eat two oranges and one lemon per day (others were given cider or seawater). He found, of course, that the 2 who received the citrus fruits recovered the best, with those taking cider recovering next best. Although not a randomized design (he stated that "two of the worst" received the course of seawater), Lind at least attempted to start with a homogeneous group, reflecting his intent to reduce the effect of confounding. Although there was no formally declared untreated group, and each treatment group was quite small, the systematic, prospective construction of comparison groups was new to medical science.4
Florence Nightingale (1820-1910) collected data on the mortality experience of solders injured in the Crimean War (1853-1856). Her presentation of statistics on the vast improvement in patient outcomes following the introduction of hygienic practices into the field hospital led to widespread reforms in military medicine. Nightingale found, based on careful record keeping and comparisons to civilian populations, that infection among soldiers led to a doubling of expected mortality; this required development of new statistical methods. From this experience, Nightingale collaborated with other scientists to develop a systematic method for collecting data on disease and mortality in hospitals. The key data elements collected in this system were the counting of all patients entering and leaving the hospital, and the mean duration of stay, thus providing denominators for the reporting of true rates of morbidity and mortality.5
After scientists of the 19th century (such as Nightingale) developed their work in vital statistics, the growth of statistical theory, including ideas about randomization, flowered in the first half of the 20th century.6 Ronald Fisher, an agricultural scientist, pioneered the theory and use of randomization in experiments. Fisher asserted that a "properly designed" experiment is one about which one can say that "Chance would so rarely cause such a large difference in outcome that I shall attribute the observed difference to the treatments,"
and that the only two possible explanations are chance and the treatments, that is, not bias or confounding. Another key feature of the randomized design is to vary the essential conditions only one at a time. In summary, the two main principles of the experimental method are numerical balance (equal numbers of subjects in the test and control group) and randomization of all the factors that are not being tested.7
In the area of observational medical science, or epidemiology, Austin Bradford-Hill developed a set of criteria to evaluate cause-and-effect relationships in disease. Around the same time, Bradford-Hill launched the first randomized clinical trial, investigating the efficacy of streptomycin in treating tuberculosis, which introduced the only clinical study design able to assess the question of cause and effect directly (more on this design follows). The introduction of the randomized trial to clinical cancer research followed a few years thereafter.
The design of clinical trials evolved in the 1950s and 1960s through the many trials that were initiated as a result of demand from the pharmaceutical industry, which wished to introduce to the market new drugs that met the standards of rigorous clinical testing. Despite struggles by research clinicians against rigorous randomized clinical trial designs (typically in the interest of providing all patients the opportunity of palliation or preventing disease progression), some important trials proceeded. In the 1970s, the U.S. 6th Circuit Court of Appeals granted randomized trials status as a standard of evidence toward the U.S. Food and Drug Administration's regulatory authority over the pharmaceutical industry.8 Trials of chemotherapeutic agents and analgesics often gave disappointing results; however, a landmark randomized trial showing segmental mastectomy with axillary node dissection for breast cancer to be as effective as total mastectomy in demonstrating long-term survival was published by Bernard Fisher and colleagues in the National Surgical Adjuvant Breast and Bowel Program9 and produced a demonstrable breakthrough in breast cancer care. At the same time, the results overturned centuries-old assumptions about the biology of breast cancer and how it spreads.
Evidence-Based Medicine as a Tool for Clinical Decision Making
Dr. A, a first-year oncology resident, sees patient X, a 47-year-old woman referred to the service with a 2-cm ductal carcinoma in situ found on a screening mammogram. How does Dr. A approach the management of patient X?
The classic approach Dr. A. might take is to consult someone who has treated similar patients before. She can also call on her knowledge from coursework. Finally, she can consult the literature, which might well be a daunting task in itself. The primary literature consists of scores of thousands of original research articles; the secondary literature consists of thousands of review articles.
The first stage of evidence-based decision making is to look closely at the information available. When a clinical scenario is written down, the practitioner can scan it with regard to a set of central clinical issues to identify gaps in knowledge that need to be filled with additional clinical information, or by turning to the literature, or both. These issues are (1) clinical findings, (2) etiology, (3) clinical manifestations of disease, (4) differential diagnosis, (5) diagnostic tests, (6) prognosis, (7) therapy, (8) prevention, (9) patient experience and meaning, and (10) practitioner self-improvement.2 Then, the questions can be formulated, and the questions should comprise four components: description of the patient or target disorder of interest, intervention, comparison intervention (relevant for therapy questions), and outcome.10 Using the previous example, the clinician might ask, "For a 47-year-old woman with ductal carcinoma in situ, what is the likelihood that lumpectomy followed by radiation, compared to lumpectomy alone, will prevent recurrence?"
Central to the idea of evidence-based medicine is the idea that there is a hierarchy of quality of evidence that is related to the design and conduct of the study or studies from which it arose. It should be kept in mind as well that different study designs on the same topic often answer rather different questions from one another. The hierarchy of study designs is illustrated in the pyramid in Figure 1.1, which also reflects the relative numbers of studies in each category.
As described previously, P.C.A. Louis' writings were centuries ahead of his time in terms of suggesting the possibility of alternative explanations for his findings, or the concept of confounding, defined as a factor that tends to co-occur with the predictive (presumed causal) factor under study and that also tends to co-occur with the outcome. Louis, for example, found that bloodletting later in the disease process was associated with longer survival. To our knowledge, Louis did not make note of his patients' diet. It is possible that those who consumed more calories would have on the one hand received the bloodletting intervention later in their disease course, because they looked healthier to begin with, and on the other
hand would have survived their disease better because in fact they were healthier. Diet would thus have been a confoundei of the apparent association between timing of bloodletting and improved survival.
Unmeasured confounding is a central reason for distortion of study results, although it arises in different ways, which we discuss as they appear throughout this chapter. One of those is selection bias: in the case of Louis' work, patients may have been selected for late treatment as opposed to the early-treatment group for reasons other than random chance, thus leaving open the likelihood that factors (that is, the con-founders) other than the treatment influenced the outcome.
Random allocation to a treatment or control group is the basis of all experimental design, and it is the only way to isolate the effect of a single factor under study on a given outcome and thus avoid the distorting effects of confounding. Even though potential confounding variables still exist among study subjects, randomization is designed to distribute them evenly between the test and control groups, thus removing their effects. The power of the randomized design is that it should provide equal balance not only of known confounding factors but also of unknown potential confounders. We discuss a few refinements to the randomized study design in the remainder of this chapter, but an entire chapter of this book is dedicated to randomized trials in cancer as well (see Chapter 8).
Another central (and related) tenet of scientific inquiry is the idea of a comparison or control group. In clinical research, in the absence of a control group similar in every way to the test group receiving an investigational intervention, it is impossible to discern how many subjects benefited from treatment as opposed to improving on their own.
An important factor to consider when evaluating oncology research, particularly studies of cancer therapy, is the choice of endpoints. Endpoints include health outcomes (total mortality, cause-specific mortality, quality of life) or indirect surrogates for any of these. Examples of surrogate endpoints are disease-free survival, progression-free survival, or tumor response rate. Studies of surrogate endpoints represent weaker, more indirect, evidence; however, a clinician may weigh studies differently depending on patient values.
The bulk of our understanding of risk factors and preventive factors for cancer comes from observational studies (that is, those described here). Recent research has moved toward testing hypotheses generated by observational studies in the context of clinical trials, sometimes with unexpected results, such as a study of prevention of cancer by beta-carotene in smokers that found that beta-carotene increased lung cancer incidence compared with placebo.11 This finding stood in contrast to results from nonrandomized observational studies, which had suggested a benefit.
A cohort study identifies a group of subjects on the basis of their naturally occurring exposure to an agent or agents of interest and follows them in time to observe their experience (incidence) of disease. Data can be collected from the present into the future (prospective cohort study) or use historic data, such as records of occupational exposure, to look from the present into the past (retrospective cohort study). For cancer, the cohort study design tends to be inefficient, because cancer is considered to be a relatively rarely occurring outcome requiring the following over time of rather large initial populations to observe statistically meaningful results. Most prospective cohorts that study cancer were constructed to study other diseases, such as heart disease (for example, the Framingham Heart Study), or a range of diseases (for example, the Nurses Health Study). The benefit of a prospective cohort design is that exposures are evaluated in individuals before their diagnosis of disease; thus, the disease cannot distort the measurement of exposure as in cross-sectional or retrospective study designs. Often cohort studies collect baseline exposure information in great detail that is highly useful (although imperfect) in controlling for confounding.
A drawback of prospective cohort studies is the attrition of participation over time, or loss to follow-up. Cohort studies often involve the repeated filling out of lengthy, detailed questionnaires on diet and other lifestyle factors, clinical examinations, and/or telephone interviews. Subjects remaining in such studies tend to be healthy relative to those who drop out, or the most motivated by health concerns, and may differ from those who drop out in other respects that are difficult to measure, which can result in confounding. Other reasons for attrition are illness, changing residence, and other changes in life circumstances that may be associated with unmeasured characteristics which differ between those who remain in a cohort study and those lost to follow-up and that are associated with disease risk. Such differences can distort the results or make study results less representative of the original target population.
In addition, cohort studies can be extremely expensive. Designing and implementing a cohort study that provides adequate richness of data and study outcomes and avoids issues of loss to follow-up is resource intensive, as are maintenance and analysis of the data, keeping track of the details of protocols, and many other administrative tasks.
A nested case-control study selects, from subjects in a cohort study, subjects who have the disease of interest (called case subjects) and a sample of subjects who are disease free at the time of sampling as controls. Similar to a conventional case-control study, this study design is efficient for studying rare diseases. It is a useful design if it is too expensive to measure a risk factor for every subject within a cohort study. It also shares with cohort studies the advantage of having subjects selected at baseline from the same population, so that case and control subjects chosen later are likely to be more comparable than those in a conventional case-control study.
It is possible, if the exposure of interest is not measured before subjects in the nested case-control study are selected, that such measurement will be essentially retrospective, for example, if a detailed dietary frequency questionnaire is administered, which results in the same limitations as a conventional case-control study. However, this design is useful for studying biologic markers of exposure that are measurable in blood or other tissue samples and that remain stable in storage, especially freezing. Often blood samples are taken from all subjects of a cohort at baseline and frozen for later analysis. As an example, the Kaiser Permanente Health Maintenance Organization collected and froze serum samples for all subjects on enrollment. Their record-keeping system provided longitudinal data on patients, including data on disease incidence. Researchers who wished to investigate the association between exposure to the pesticide DDT (dichlorodiphenyltrichloroethane) first selected, from the Kaiser cohort, breast cancer case subjects and a sample of control subjects who were disease free when the case was diagnosed. They then retrieved frozen serum samples that, for the case subjects, were taken at least several years before diagnosis with breast cancer, and measured the concentration of DDT in the samples to compare the concentrations in case and control subjects.12
Many life-threatening diseases studied by epidemiologists are relatively rare. Consider, for example, the likelihood of being diagnosed with breast cancer in a given year, compared with that of coming down with a cold. If you only had the resources to study 500 women aged 50, with average risk factors, over a 5-year period, only 5 or 10 of them would be expected to be diagnosed with breast cancer during that time, whereas a large majority of them are likely to come down with a cold at least once in 5 years. The number of breast cancer cases is simply not adequate to allow valid comparisons among different hypothesized risk factors between women who are diagnosed with breast cancer and those who are not. On the other hand, if you set out to identify 250 women newly diagnosed with breast cancer (case subjects) in a given population, and simultaneously identified a suitable comparison group of 250 women (control subjects), you could substantially improve the statistical power (also called efficiency) of the study.
A study in which subjects are identified on the basis of their disease status has what is known as a case-control design. Once you identify the subjects, you can then interview them about hypothesized disease risk factors such as diet, pharmaceuticals, sun exposure, pregnancy history, and so on. This is the bread-and-butter design of the bulk of cancer epidemiology. However, it is particularly prone to sources of bias and confounding that randomized controlled trials and cohort studies are not.
The most important downside of case-control studies is potential bias from errors in recalling and reporting risk factors. For example, many people tend to underestimate or underreport their alcohol consumption. If people with and without a disease under study underestimate their consumption to a similar degree, then a true association between alcohol consumption and the disease would be more difficult to observe. In contrast, if someone diagnosed with a disease believes that his or her past alcohol consumption may have played a role in the disease, that consumption could either be further underreported or overreported relative to the true exposure. In any case, it is quite possible that the recall is different from that of someone without the disease. Just as it is difficult to predict which way someone is likely to misre-port exposure, it is by extension difficult to predict the effect of such misreporting on estimates of disease association or risk.
Another drawback of case-control studies is the fact that exposures reported to occur at a given point in time might not represent the exposures that actually cause the disease. This consideration is important in diseases such as cancer that have a long latency period, that is, the time between a causal exposure and the diagnosis of disease. Conceivably, cancer itself could alter dietary or lifestyle patterns in the period before diagnosis, thus reversing the cause-and-effect sequence. In addition, if a marker of a hypothesized disease risk factor is measured in a body tissue, such as the concentration of a pesticide in blood, it is possible that the measurement could be affected by the disease, resulting in a spurious association or masking a true association.
Selection of an appropriate control group is particularly important, but also particularly difficult, for case-control studies. Bias, used here to mean systematic error in an estimate, can arise if case subjects and control subjects arise from populations with different underlying baseline characteristics. The more the case and control populations differ from one another, the more difficult it is to ensure that observed differences in risk factors are not due to extraneous, unmeasured factors, or confounders, that are associated with the factors under study and with the disease. For example, a study might find that people with lung cancer are more likely to drink alcohol than a group of control subjects similar in age. Rather than assuming that this finding indicates that alcohol is a risk factor for lung cancer, it is prudent to consider whether alcohol consumption is related to smoking, an established cause of lung cancer.
In summary, case-control studies are often more convenient to assemble than the preceding study designs when a rare disease or outcome is being studied, and they are less expensive than cohort (follow-up) or experimental studies. However, because exposure is assessed retrospectively, errors in recall are often a problem. In addition, it is sometimes difficult to generalize the results or to avoid bias from confounding because of the ways in which patient and control groups are selected. For these reasons, case-control studies tend to provide a lower level of evidence than cohort studies or experimental studies.
A cross-sectional study estimates the prevalence of disease (the number of cases of a disease) and possible disease risk factors in a given population at one point in time. Such studies are most usefully conducted by random sampling, which helps ensure that their results are representative of the larger population of interest. A special case of the cross-sectional design, which is, technically, a survey follow-up study because it is repeated on a regular basis, can be found in the Behavioral Risk Factor Surveillance System (http://www.cdc.gov/brfss/), an annual national survey that selects a representative sample of U.S. residents and interviews them about such behavioral factors as exercise, human immunodeficiency virus (HIV) awareness, drug use, and smoking. Statistics describing the prevalence of these factors can then be compared from year to year, and a new set of subjects is sampled each year.
Similar to cohort studies, subjects in cross-sectional studies are not selected for study on the basis of their disease status. Similar to case-control studies, however, measurement of risk factors is either nondirectional or retrospective, and the presence of risk factors cannot be shown to precede disease; that is, the temporality requirement for declaration of cause and effect is lost. Therefore, although cross-sectional studies are sometimes used to evaluate associations between risk factors and disease and to generate hypotheses, their ability to support evidence of causation is more limited than that of other observational studies.
In studies of ecologic design, the number of people with an exposure is known, as is the number of people with a disease or outcome (mortality, for example), but the number of people with both the exposure and the disease is not known. In general, relevant information on individuals in the population is unknown. A study that is at least partly ecologic in design may be the only feasible option in the case of an environmental exposure experienced by an entire population. A well-known example of an ecologic study stemmed from the observation that the number of deaths in London increased sharply relative to average death rates during a period of particularly heavy smog and was closely proportional to the ambient temperature during that period.
Generally, however, the level of evidence provided by eco-logic studies is considered quite weak, primarily because of the studies' inability to correct for other variables, that is, confounders, at the individual or aggregate level that could explain the observed associations. Indeed, such confounding remains a possibility in the London smog example, in which the agent that caused the excess deaths cannot be known with certainty. A commonly encountered example of ecologic data is the observation that rates of certain kinds of cancer, especially breast cancer, are high in countries with high consumption of dietary fat and that cancer rates are low in those countries reporting low dietary fat consumption. Such observations are useful in generating hypotheses for further study; however, as in the example of dietary fat and breast cancer, epidemiologic studies based on individual-level data with the ability to adjust for confounding factors often show little or no association between dietary fat and breast cancer.
Reports of individual cases and case series represent the earliest known method of accumulating medical knowledge on most diseases. Although their importance is lower today given the availability of controlled studies, particularly clinical trials, they remain a popular mode of publication by clinicians of their investigations and observations. As evidence, however, case series and reports pose a number of problems and should be interpreted with substantial caution.
Many case series are collected retrospectively from medical records, and recording of information may be selective and subject to incompleteness or other forms of error compared to information collected according to a predefined plan. Selection bias can occur when the series is not representative of the general population, in particular when subjects with similar prognosis are selectively lost to follow-up.
Case series based on medical records are also likely to lack adequate (if any) information on confounders. Finally, the decision as to which data to report may be selective, particularly if eligibility criteria are not established in advance. For example, striking results may lead to report of a case or a series of cases, distorting the sense of what would be expected in general. Unfortunately, for some exceedingly rare diseases, clinical knowledge rests on case series and case reports for lack of sufficient numbers to support more robust study designs (J. Lau, unpublished observation).
Laboratory studies using immortalized cell lines, whole tumors, or some other system below the level of the organism are important in basic oncology, but their purpose is to isolate small subsets of the complex tumor biology machinery to elucidate mechanisms (Table 1.2). Rarely should they be taken in isolation as evidence for or against a given treatment strategy. They do represent a level of control that may never be attainable or ethical in whole humans; on the other hand, it is their very lack of organismal context that makes them unreliable to extrapolate to humans. Despite a tendency of some researchers and the media to tout breakthroughs in biomedical research on the basis of laboratory studies, they should be seen by practicing clinicians for their intent: mechanistic, preliminary, and hypothesis generating in relation to medical practice.
The toxic or beneficial effects of drugs, environmental agents, and foods are typically evaluated using laboratory rodents or other small mammals, according to stringent experimental and statistical analytic protocols. These protocols allow statistically efficient estimates of beneficial, safe, or toxic doses of chemicals in genetically homogeneous animals. Laboratory animals may also be used for mechanistic studies, for example, using gene knockout models. It is important to be able to test chemicals with uncertain safety on nonhumans. However, because mice and rats are not humans, assumptions must be made regarding the extrapolation of results to humans and again should not be used by the clinician in isolation for clinical decision making.
Until modern times, 'facts' were deduced by arguments from premises approved by tradition and authority, without appeal to experimental validation. Even when observation ran counter to 'facts,' it was still believed that in some mysterious way authority must still be correct, particularly at a time in history when the fabric of society was such as to frown upon the challenge of authority. The modern therapeutic trial offers an alternative by relying upon impartial observance without regard for authoritarianism. Such an approach provides the foundation of scientific medicine."
In other words, the evidence of expert opinion is only as strong as the empiric evidence from which it is derived. As Albert Einstein pointed out: "Propositions arrived at purely by logical means are completely empty as regards reality. Because Galileo saw this, and particularly because he drummed it into the scientific world, he is the father of modern physics—indeed of modern science altogether." (From Ideas and Opinions, Modern Library, 1994)
Was this article helpful?
Among the evils which a vitiated appetite has fastened upon mankind, those that arise from the use of Tobacco hold a prominent place, and call loudly for reform. We pity the poor Chinese, who stupifies body and mind with opium, and the wretched Hindoo, who is under a similar slavery to his favorite plant, the Betel but we present the humiliating spectacle of an enlightened and christian nation, wasting annually more than twenty-five millions of dollars, and destroying the health and the lives of thousands, by a practice not at all less degrading than that of the Chinese or Hindoo.