|
The Psychometrics and Science of the Standardized
Field Sobriety Tests
Steve Rubenzer, Ph.D.,
Clinical and Forensic Psychologist
NHTSA Trained in
Administration of the SFSTs
Article Provided Here with express permission
of Steve Rubenzer, Ph.D
The National Highway Safety Transportation Administration
(NHTSA) standardized field sobriety tests (SFSTs) came under intense
scrutiny by the defense community when they went into widespread use
in the 1980’s. At that time, the scientific literature to support their
use was limited to two NHTSA-sponsored laboratory studies[1]
and two very modest field studies.[2]
Both the NHTSA researchers and critics pointed out that the tests had
not proven themselves in the field and that studies done under roadside
conditions were badly needed. Several groups of critics trenchantly
derided the SFSTs and their supporting empirical base and detailed other
significant problems.[3]
In the past seven years, three large-scale field studies have been conducted
that potentially address some of the problems noted earlier. Indeed,
Marcelline Burns, a primary researcher in the development of the SFSTs,
has stated the initial laboratory studies have limited relevance to
understanding the use and accuracy of the SFSTs twenty-five years later
in field settings.[4]
Have the subsequent Colorado, Florida, and San Diego SFST field studies
rectified the earlier problems? What about research by other researchers
or agencies? This paper will review the NHTSA SFST field studies and
related works, appraise their impact on the research base for the SFSTs,
and review the SFSTs’ standing as psychological tests in light of current
standards.
The NHTSA SFST Field Studies
The original NHTSA
laboratory studies examined field sobriety tests as applied to volunteers
in indoor, well-lighted conditions. For Horizontal Gaze Nystagmus (HGN),
examiners had the benefit of equipment to stabilize the subject’s head
and a protractor for measuring the angle of onset of nystagmus. Could
officers obtain usable or valid results under traffic stop conditions?
The field studies were designed to address this question.
The first such study was completed in 1981, but encountered
such poor cooperation from participating officers that the data were
deemed unsuitable for analysis.[5]
Presumably because of this initial negative experience,
subsequent field testing locations were chosen largely based on the
cooperation and support of the administration and officers that would
carry out the testing (“…only agencies that could assume an extremely
high level of cooperation and commitment would be recommended for participation.”[6]).
The officers that would perform SFSTs in the new generation of studies
were not reluctant draftees, but volunteers,[7]
SFST instructors,[8]
or exhibited “genuine interest in the study and eagerness to
be selected.”[9]
The three major NHTSA field studies consist of investigations
carried out in Colorado, Florida and San Diego in 1995, 1997 and 1998,
respectively.[10]
The designs of the studies were highly similar, so they will be discussed
together. In each, actual traffic stops using the SFSTs were investigated.
Police officers were recruited to participate in the study from agencies
that supported the research efforts. Officers had previous training
and experience in the SFSTs (in the Florida study, all 16 were SFST
instructors) and received “refresher” training before beginning data
collection. In the Colorado and Florida studies, observers from the
study (either researchers or participating police officers) monitored
about half of the stops to ensure they observed the study protocols
(no use of portable breath tests [PBTs] until after the SFSTs were given
and scored) and that the SFSTs were administered correctly. In the Colorado
and Florida studies, researchers obtained PBTs on the majority of drivers
who were tested but released. This allowed an estimate of false negatives—failures
to make an arrest when warranted. The different studies investigated
the SFSTs performance at BAC levels of .05% and .08%. All three studies
reported that correct arrest decisions based on the SFSTs exceeded 90%,
with two of the three reporting higher levels of false negatives (erroneous
releases).
Table 1
Percent of Decisions Correct in the
Three NHTSA SFST Field Studies
|
|
Arrest Decisions
|
Release Decisions
|
Total Decisions
|
|
Colorado
(.05%)
|
93
|
64
|
86
|
|
Florida (.08%)
|
95
|
82
|
93
|
|
San Diego
(.08%)
|
90
|
94
|
91
|
In all three studies,
the proportion of drivers arrested to those tested was quite high—well
over 50%. Mean BAC level of those arrested were .138% (San Diego), .150%
(Florida), and .152% (Colorado). In the Colorado study, HGN was scored
differently than in all other studies, as scores for left and right
eyes were not distinguished, and the scores ranged only from 0-3.
There is no indication of what instructions were given for the WAT
and the OLS in the Colorado and San Diego studies, while the
instructions used in the Florida study differ substantially from the
2000 NHTSA student manual. The Colorado study reported that only 13
errors of administration and 6 errors in instructions were observed
in 305 SFST administrations (only 41% were observed).No errors
were observed in the 313 SFST batteries given in the Florida study,
although only two-thirds of the administrations were monitored.
The NHTSA Student Manual,[11]
the official SFST training guide for police officers, provides cutoff
scores for each test to optimally classify a person as above or below
.10%. It appears that the NHTSA-suggested decision rules for the SFSTs
were not used in the Colorado and Florida studies—officers had
access to test scores but used their own best judgment as the final
criterion for arrest. Officers’ failures to follow the recommended SFST-decision
rules were cited as a significant problem in the San Diego study. In
the Colorado study, incorrect arrest decisions were attributed to officers
focusing on poor Walk and Turn (WAT) and One Legged Stand (OLS) performance
when the suspects’ HGN performance was normal.
Other Studies
Several investigations besides the three NHTSA field studies examined
the performance of SFSTs in detecting BAC levels. Two optometrists analyzed
the results from 2429 administrations of the HGN conducted during normal
traffic stops in Ohio.[12]
They reported results, in the form of a table, that suggest high levels
of accuracy (92%) for HGN—the other SFSTs were not examined. However,
all of the suspects were arrested (even those that passed the HGN),
and 92% of them had a BAC of above the .10% standard used in Ohio. In
other words, the officers would be right 92% of the time by arresting
everybody (which they did) or by randomly arresting suspected drunk
drivers: the test added nothing.[13]
The authors report very few details of the data collection, there were
no observers present, and there is no indication whether PBTs were used.
The only NHTSA-sponsored sobriety test studies that
have been published in peer reviewed journals detail the development
of a standardized boating sobriety test[14]
and an investigation of various sobriety tests at detecting BAC at .04%[15].
The marine environment is unique because the motion of a watercraft
makes the WAT and OLS unsuitable for on-the-spot testing. Like the 1981
SFST study, both laboratory and field observations were made. HGN and
three other tests were identified as most promising based on their correlation
with BAC.
In the field portion of the boating study, the Maryland
Department of Natural Resources Police administered the four SBST candidate
measures. Officers all had been certified on the SFSTs, were described
as “highly experienced” regarding DUI/BUI, and were given an additional
day and a half of training before beginning the study. Officers were
instructed not to obtain PBT readings until after recording the SBST
results, but no observers monitored this or administration and scoring
procedures. HGN was found the best individual test, correlating .77
with BAC in the field stops. Using HGN scores alone resulted in 100%
classification of BAC >.10% and 90% correct classification below .10%.
Two tests used in the field battery, “saying the alphabet” and “hand-pat,”
showed respectable correlations with BAC but did not improve upon decisions
based on HGN alone. The authors nonetheless recommended the full battery
because the latter tests provide some measure of performance impairment
(vs. BAC level), whereas HGN does not.[16]
A very recent investigation[17]
found that only HGN was effective at distinguishing persons above or
below a BAC of .04%, a standard sometimes applied to drivers of commercial
vehicles and, in some states, to drivers younger than 21. Both laboratory
and simulated field conditions were investigated, and several variations
of HGN and the OLS were tried. The variations did not matter much, but
the optimum cut-score for HGN was two clues rather than four. Even so,
the observed accuracy level obtained was lower than for higher BAC levels:
79% of those above .04% were correctly identified, while 38% of those
below .04% were wrongly classified.
Critique of the SFST Field Studies
A scientific study should evaluate the effect of
a variable, or a test, controlling for the effects of extraneous variables
as much as possible.[18]
In the case of the SFSTs, a rigorous test of their validity would be
to examine the correct classification rate (i.e., BAC > .08%) using
only information from the test(s)—not from the suspect’s driving performance,
demeanor, smell, previous arrest record, etc. Accomplishing this level
of control would probably require video taping only the relevant (officially
scored) aspects of SFST performance. The test performance would be scored
by officers who had no other information regarding the suspects and
no opportunity to observe, smell, or talk to them. A rigorous study
of HGN, probably only feasible in a laboratory study, would involve
partial masking of the eyes, so eye redness, glassiness, and eyelid
droop could not be observed.
Ideally, subjects in an experiment are randomly assigned
to a treatment or experimental group. In this way, differences between
the groups are minimized. The original NHTSA laboratory studies assigned
subjects to a target BAC group based on their drinking history. In the
field studies, there were no experimentally created groups—just drivers
stopped for one reason or another. Therefore, the NHTSA field studies
are quasi-experiments, not experiments.[19]All
the officers employed the SFSTs and no control group was used. A control
group is considered a near-essential feature of a rigorous study because
it duplicates all the relevant factors that might account for the results
in the experimental group except for the variable under study. In the
case of the SFSTs, adjacent jurisdictions might be compared—one department
using the SFSTs and another not. Or some members of the department might
be trained in the SFSTs, others given other DWI-detection training.
Without such a control group, the results observed are ambiguous. Is
90-95% a better accuracy rate than without the SFSTs?[20]
Was the high accuracy rate due to the quality of the officers? Their
sensitization to DWI detection because of their recent training? The
fact that they were observed by researchers and supervisors?
Significant defects of the SFST field studies as
rigorous scientific studies can be summarized in the following five
points:
1.
The field studies validated the arrest decisions of the officers in
the studies, not the SFSTs.
Because officers had access to driver behavior and demeanor, the field
studies did not specifically test the accuracy of the SFSTs as stand-alone
tests. They were not conducted “blind,” much less double blind. As stated
in the Colorado study, “Some of the information underlying an officer’s
decision is not documented and cannot be examined.”[21]
In the San Diego and the boating studies, officers may have also had
use of PBTs, which would contaminate the test with the criterion—a fatal
flaw. Even in the other two studies, large proportions of the stops
were unobserved, so officers could have used PBTs before scoring the
SFSTs. In sum, the officers’ judgments of intoxication and arrest decisions
were not solely due to the SFSTs, and cannot provide solid evidence
for SFST validity.
2.
The police officers and the degree of supervision in the field studies
were not typical of typical DWI stops.
In each study, participating officers were highly
motivated, highly experienced volunteers. In two studies, they were
monitored by either civilian research observers or their colleagues.
It is well known that people who are watched tend to perform better—in
social psychology this is known as theHawthorne Effect. Supervision
likely made officers more attuned to accurate administration and recording
than an officer working on his own would be. The very low rate of administration
errors reported for the Colorado and Florida studies attest to this,
and contrasts greatly with the experience of many DUI attorneys.[22]
3.
The studies are insufficiently documented
for scientific papers, a point made in U.S. v. Horn.[23]For
example, two of the SFST studies do not specify the instructions used
to administer the tests (the instructions have changed considerably
since the initial 1977 study). None of the studies examined the combination
of HGN and WAT that is referenced in the NHTSA manuals, or examined
interrater reliability (how well different observers agreed on scoring
or arrest decisions) or internal reliability (how well the different
scoring clues agreed). There is no discussion of the weaknesses or limitations
of the studies, as is customary in the discussion section of a published
paper. Instead, the Florida study ends with an astonishingly strong
conclusion: “There appears to be little basis for continuing legal challenge
(to the SFSTs).”[24]
4.
The authors did not report the accuracy
of arrest decisions for stops that were observed vs. those that were
not, or for SFSTs performed under adverse climate conditions vs. those
that were not. This is surprising, since this latter issue was a
one of the primary goals of the Colorado study.
5.
None of the SFST field studies have been
published in peer-reviewed scientific journals. The reports were
submitted to state DOT agencies or simply “written up.” Peer review
exposes the work to the criticism of other researchers and authors who
may not share the same beliefs and purposes, and who have training and
experience in valid experimental design. The scrutiny that this process
brings is crucial to detecting error and bias.
Because of the limitations of the field studies cited
above, it could be argued that the 1981 laboratory study, and a similar
work by non-NHTSA authors,[25]
remain the primary evidence of SFST reliability and validity. Supporting
this claim, NHTSA continues to cite the accuracy figures from the 1981
study in the student manual[26]
rather than much higher figures obtained in the field studies. Although
the laboratory studies were rigorous in some respects, they have several
significant limitations: 1) subjects had no reason to fear detection/arrest,
2) testing was conducted during the day rather than night, when most
DWIs occur, 3) officers were able to observe, talk to, and smell the
subjects, 4) for the NHTSA study, subjects were recruited from the state
employment officeand
are not representative of the general population, and no attempt was
made to justify this source as representative of DWI stoppees, and 5)
the same subjects were used to create the cutoff scores for the test
and to evaluate the accuracy of these cutoff scores. This procedure
will lead to inflated estimates of accuracy, because the test decision
rules are tailored to the subjects on which it was calibrated.[27]
The cutoff rules from the first group should be cross-validated on a
new group of subjects. The accuracy level achieved in the second group
will be an unbiased estimate of the accuracy when applied to a new group
of similar subjects, such as DWI suspects, assuming the base rates (frequency)
of intoxicated persons are similar in both groups.
A Comment on HGN
HGN has repeatedly been found in NHTSA-sponsored studies to bethe
best psychophysiological test to estimate BAC.[28]
Conducted by medical or optometry personnel in laboratory conditions
with healthy, rested subjects, there is little doubt that HGN can be
a good indicator of BAC. However, most police officers lack in-depth
training, and estimating a 45-degree angle is a poor substitute for
laboratory apparatus that can measure angles to a tenth of degree. Data
from the 1981 study indicate that most officers had difficulty accurately
estimating 45 degrees,[29]
which the authors stated “is a critical factor in making accurate decisions
from sobriety test battery performance.”[30]
Officers were deemed proficient if they could estimate an angle within
3 degreeswith use of a protractor.[31]
Thus, even when officers are freshly trained and use an apparatus to
assist in their observations, a six-degree range of error is expected.
One of the clues for HGN is onset of nystagmus before 45 degrees of
lateral deviation. If a six-point spread is acceptable, one officer
may estimate 45 degrees at 42 degrees, another at 48. If the officers
are consistent in their scoring, the first officer will score this clue
much less often than the second will.
Difficulties can arise in several other ways when
interpreting HGN. Are a subject's eye movements smooth pursuit movements
with nystagmus or natural saccadic movements? At least one board certified
ophthalmologist wrote that NHTSA’s recommended “smooth pursuit” administration
(two seconds across each eye) invites saccadic movements because it
requires the eye to move too fast.[32]
The 1981 study authors acknowledged that as many as 50% of people show
some nystagmus at maximum deviation in at least one eye.”[33]
In New Hampshire v. Dahood, the court reported “Drs. Citron (an
ophthalmologist) and Rizzo (a neuro-ophthalomologist) were adamant in
their opinion that the distinct nystagmus at maximum deviation clue
should be eliminated from the HGN test.”[34]
Recently, it has been reported that fatigue can induce nystagmus at
maximum deviation in 50% of people, and that nystagmus persists after
BAC levels have fallen to zero.[35]
Lastly, the Maryland court of appeals in Shultz v. State recognized
thirty-five causes of nystagmus in addition to alcohol.[36]
Two recent court opinions have held that HGN does
not meet Daubert[37]
standards to be admissible as direct evidence of intoxication or impairment.
InU.S. v. Horn, the court held HGN is not generally accepted
among psychologists.[38]
In New Hampshire v. Dahood,[39]
the trial court, on remand from the Supreme Court of New Hampshire on
the issue of admissibility, cited an inability to determine error rates
and concluded HGN is not generally accepted among ophthalmologists.
On appeal, however, the New Hampshire Supreme Court held that HGN does
meet the four Daubert criteria, and reaffirmed other state court opinions
that the relevant professional communities for HGN include behavioral
psychology, highway safety, neurology, and criminalistics in addition
to optometry and ophthalmology, where it maintained.[40]
However, it maintained that HGN is only circumstantial evidence of impairment
and cannot be introduced at trial to estimate BAC.
The SFSTs as Standardized Tests
SFSTs are quite similar to the neuropsychological
tests, which detect brain damage and assess sensory, motor, and cognitive
impairment. To the extent that the SFSTs are standardized tests, they
should meet the relevant professional standards. Standards for Educational
and Psychological Testing[41]is
an authoritative guide that enumerates many criteria for test construction,
reliability, validity, documentation, and implementation, and provides
a useful introduction to these issues. Some of these are directly relevant
for the SFSTs. For example, Standard 1.10 states “When interpretation
of performance on specific items, or small subsets of items, is suggested,
the rationale and relevant evidence in support of such interpretations
should be provided.” This is not addressed in the SFST literature. The
following table lists the standards that are most relevant for examination
of the SFSTs. The next sections address problem areas regarding standardization,
reliability, and validation.
Table 2
Selected Standards
for Psychological Tests
|
Standard
#
|
|
|
1.10
|
When interpretation
of performance on specific items, or small subsets of items,
is suggested, the rationale and relevant evidence in support
of such interpretations should be provided.
|
|
1.17
|
If test scores
are used in conjunction with other quantifiable variables (i.e.,
driving errors, odor of alcohol) to predict some outcome or
criterion (i.e., BAC), regression (or equivalent) analysis should
include those additional relevant variables along with test
scores.
|
|
3.5
|
Relevant
experts external to the testing program should review the test
specifications.
|
|
3.6
|
Test content
should be chosen to ensure that intended inferences from test
scores are equally valid for members of different groups.
|
|
3.9
|
The process
by which items are selected and data used for item selection,
such as item difficulty, item discrimination, and/or item information,
should be documented.
|
|
3.23
|
Scorer reliability
and potential drift over time in the scorer’s rating standards
should be evaluated and reported…
|
|
4.19
|
… the rationale
and procedures used for establishing cut scores should be clearly
documented.
|
|
5.2
|
Modifications
or disruptions of standardized test administration procedures
or scoring should be documented.
|
|
5.9
|
When test
scoring involves human judgment, scoring rubrics should specify
criteria for scoring. Adherence to established scoring criteria
should be monitored and checked regularly. Monitoring procedures
should be documented.
|
|
6.5
|
The test
manual/documentation should include the standard error of measurement.
|
|
6.7
|
Test documents
should specify qualifications that are required to administer
a test and to interpret the test scores accurately.
|
|
7.2,
7.3
|
If age or other demographic
variables effect test performance, these issues should be studied
and the test used only for those subgroups for which evidence
indicates valid inferences can be drawn from test scores.
|
|
9.3
|
(Tests) generally
should be administered in the test taker’s most proficient language…
|
|
10.1
|
Test users
should take steps to ensure that the test score inferences accurately
reflect the intended construct rather than any disabilities
and their associated characteristics…
|
Deficiencies
of the SFSTs as Psychological Tests
Standardization Problems – As
the name implies, the SFSTs gain their special status because they have
been standardized,
meaning specific rules for administering, scoring, and interpretation
have been specified and researched. Standardization is crucial if research
findings are used to support the validity of the tests, since a test
that is modified is no longer the same test. As NHTSA states, “If any
one of the standardized field sobriety test elements is changed, the
validity is compromised.”[42]
A number of courts have held that if not properly administered, the
SFSTs are not admissible.[43]
The following problem areas are organized in the
chronological order that the SFSTs are administered and scored.
1.
Screening questions
for possible medical problems and conditions should be standardized
and validated.The
NHTSA student manual states the officer should ask about certain topics,
but does not specify the form of the questions. The wording of a question,
and how it is asked, are crucial to obtaining valid data. Screening
questionnaires are used in a variety of medical fields. A good screening
test should identify virtually everyone who has the condition being
queried about—and should be demonstrated to do so. In the case of the
SFSTs, the questions should uncover relevant conditions that could invalidate
or affect SFST performance. No research has been conducted on this issue.
2.
The SFST instructions
have changed repeatedly from the initial laboratory studies to the field
studies to the current NHTSA student manual used to train police officers.
3.
SFST training does
not emphasize rigorous adherence to the standardized instructions.
Psychologists routinely administer standardized tests. Many, like the
Wechsler intelligence tests, come with materials that direct the examiner
to read the instructions verbatim. This was my expectation when
I learned the SFSTs. Although the NHTSA instructions are given in quotation
marks, suggesting they should be delivered verbatim, this level of proficiency
is not specifically endorsed. Consequently, students and instructors
do not seem to aspire to it. Some training films actually demonstrate
inaccurate delivery.[44]
4.
SFST training materials
do not address how instructions are to be delivered (attitude, speed,
and tone).Should
the officer be polite? Authoritative? Commanding? Is it OK to be impatient,
surly, and condescending? How does this affect performance? What about
speed of delivery? Should the officer’s demeanor facilitate maximum
performance? That is the usual standard for neuropsychological tests.[45]
In contrast, some officers appear to make the tests harder by delivering
instructions in a rapid, bored, monotone voice. It is unlikely that
the officers in the laboratory studies, using volunteers and monitored
by the researchers, adopted the hostile, impatient demeanor sometimes
displayed by officers during SFST administrations. To the extent that
arresting officers behave differently than the officers in the NHTSA
studies (which was not recorded), the validation evidence is diminished.
5.
For the Walk and
Turn, a variety of line situations are permitted.
There is no research on the effect of using an imaginary line, a crooked
line, an offset line, or one that the line creates an uneven surface.
6.
What constitutes
“demonstrates understanding”?
For the WAT and OLS,
officers are directed to determine that the suspect understands the
instructions. A “yes” or “no” question often suffices. If a suspect
equivocates, the officer may become impatient and demand an answer.
Clearly, this is not an adequate assessment. The tests are designed
to test ability to follow directions and perform after the instructions
are understood. (Standard 9.3)
7.
Scoring rules are
often inadequately specified.
What constitutes an “inappropriate turn?” In HGN, the examiner must
make two passes for each eye to assess each of the three signs. Does
the clue have to occur on both passes, or just one? If it occurs on
just one, should the examiner administered another pass and make a decision
based on two out of three?
8.
It is unclear, both
in the studies and the student manual, what the criteria are for failing
the SFST battery.
The student manual
provides cutoff scores for each test, plus a decision grid for the combination
of the HGN and WAT. What it does not say is what criterion is
primary. Thus, a suspect apparently can fail at least four ways (from
each of the three tests and from the combination of the HGN and WAT).
If the defendant is given multiple chances of failing, the risk of a
false positive finding will accumulate with each additional test unless
credit is given for those tests passed.
9.
Officers are not
specifically directed to record their observations immediately.
Failure to do so encourages a tendency to assign scores consistent with
the officer's arrest decision and, for example, to remember seeing a
particular cue in both eyes rather than one. As the authors of the 1981
laboratory study stated, “…many of the advantages of standardized scoring
are lost when the scoring is left to memory.”[46]
Reliability and Validity
Problems
1)
The SFSTs have not
been subjected to a rigorous “blind” assessment of their validity.
As discussed above, none of the studies of the SFSTs have been truly
double blind, as expected in medical research. The laboratory studies
came close; the field studies do not. (Standard 1.17)
2)
The effects of fatigue,
sleepiness, circadian rhythm, driver stiffness or roadside conditions
on SFST performance have not been adequately investigated.
(Standard 10.1) The angle of onset of nystagmus was
found to advance five degrees in the hours after midnight, while the
other laboratory studies were conducted during daytime hours.[47]
In the 1981 study, the authors stated that exercise, sleep loss, elevated
temperatures, and antihistamines are associated with increased body
sway.[48]
Strobe and emergency lights, gusts of wind from passing traffic—all
have unknown effects on SFST performance and validity given the limitations
of the field studies.
3)
Drivers suspected
of DWI and subjected to the SFSTs may be highly anxious, which alone
or in combination with small amounts of alcohol, may influence their
performance.
In the laboratory studies, subjects were volunteers who had no reason
to be anxious, aside from possible self-consciousness.There
are theoretical reasons to believe that fear, anxiety, or stress may
affect performance on the WAT and OLS,[49]
and no study has demonstrated these factors are not relevant.
4)
The clues for the
WAT and OLS lack documentation of their individual validity and reliability.
The validation and reliability data focus solely on the total scores,
not the individual clues. It is possible that all eight clues are valid—or
that half of them are not. Since there is no published data on this
issue, it cannot be assumed that the clues your client failed are valid
ones. (Standard 1.10)
5)
Reliability data
are lacking or below accepted standards for psychological tests used
for making decision about individuals.
Reliability refers to the consistency with which
a test produces results across conditions that can change, such as testing
at different times or by different evaluators. Authorities recommend
such tests show “a bare minimum” reliability of .90, with .95 “considered
the desirable standard.”[50]
None of the reliability figures for the SFSTs are this high, and most
are much lower. Different raters scoring the same subject at the same
time show reliability coefficients between .62 and .74 on the SFSTs,
and lower figures (.58-.59) for their decisions about whether the person
is impaired and should be arrested. Other NHTSA researchers assessed
the SFSTs to be quite low on “Ease of Scoring,” providing ratings on
a 1-100 scale of “5” for HGN, “25” for WAT, and “30” for OLS.[51]
No figures have been reported to assess the internal reliability (coherence)
of the SFST items. This is a standard, expected piece of information
for a psychological test. The following table displays the only figures
that have been reported.[52]Reflecting
on these figures, the authors candidly admitted, “… the interrater reliability
for the nystagmus score is not as high as expected…”[53]
Table 3
Reliability Coefficients
for the SFSTs
|
|
Types of Reliability and Associated Coefficients
|
|
|
Test-retest
|
Interrater
|
|
HGN
|
.66
|
.62
|
|
WAT
|
.72
|
.74
|
|
OLS
|
.61
|
.70
|
The reliability coefficients are estimates
of how much of the test score is reliable—a reliability coefficient
of .70 indicates 70% of the score is reliable and 30% is error. However,
each reliability coefficient reflects only some of the potential sources
of error: The observed score is a function of the quality that is being
measured (intoxication) plus numerous sources of error, including who
administered the test, the particular occasion and conditions it was
administered under, and the quality of the items composing the test.
Unfortunately, you cannot simply add up the errors from the different
reliability estimates. However, one dramatic illustration of the role
of multiple sources of error comes from the 1981 study: The test-retest
coefficient for the WAT scored by a different rater is .34,
as opposed to .61 when scored by the same rater. The moderate reliability
figures cast doubt on the high accuracy rates reported in the field
studies, since high reliability is a prerequisite for high validity.[54]
6)
Standard errors of
measurement (SEM) are not provided.
(Standard 6.5) The standard error of measurement is the average amount
of error in the typical measurement for that test. The SEM is used to
create confidence intervals around an observed score to show how precise
the estimate (observed score) is. For example, a 95% confidence interval
around a score of 4 on the HGN might be 2 to 6. But NHSTA studies do
not include basic descriptive statistics of the data (means and standard
deviations) that would allow calculation of these values.
7)
SFSTs have not been
normed on sober people.
As acknowledged in the 1981 study, “Balance tests
of various sorts show large individual differences in the performance
of sober individuals…”[55]
When most psychological tests are developed, they are tested on a large
sample to determine what is “normal.”The
Personality Assessment Inventory is a self-report test designed to assess
psychopathology. Before it was published, the author administered it
to some twelve hundred psychiatric patients—the intended population
for the test. But he also administered it to over twelve hundred volunteers
from around the country. Then, volunteers were dropped in order to obtain
a census-projected nationally representative sample in terms of age,
race, and education.[56]
The SFSTs have never been administered to a large, representative group
of sober people. We don’t know what is a normal score.
8)
There is very limited data on the SFSTs for
people under 21 or over 50-55.
(Standard 3.6) Only 3.1% of the NHTSA 1981 study sample used to standardize,
calibrate, and validate the SFSTs were older than 55. Reporting of age
groups is inconsistent across the field studies, but in all three, people
above 50-60 made up a very small portion of the sample. There have been
no comparisons made of the validity of the SFSTs for younger vs. older
groups. (Standards 7.2, 7.3, 10.1)
9)
SFSTs have questionable
validity for those who are elderly, in poor physical condition, or overweight.
If the SFSTs are of questionable validity for people more than 50 pounds
overweight,[57]
what about short people who are 45 or 40 pounds over the ideal? Proportionately,
a person who is 4’8” and 40 pounds overweight is likely to be more physically
impaired than someone 6’3” and 51 pounds overweight. Why does the test
suddenly become invalid when one goes from 50 to 51 pounds over the
ideal? Obviously, the impediment due to weight is likely to be gradual.
The same issue applies to people in their late 50’s vs. the arbitrary
cutoff of 60[58]
or 65.[59]
Physical health and condition are likely to be more important than age.
(Standards 7.2, 7.3, 10.1)
10)
Even NHTSA claims
the SFSTs, when optimally used, are only 80% accurate.[60]This
is perhaps the most direct and compelling evidence of the SFST validity
problems. Although a 20% error rate may be acceptable in a test
used for evidence of probable cause of a BAC of .08% or more,
it seems insufficient when the SFSTs are used as to establish, beyond
a reasonable doubt, intoxication or impairment. Further, consider that
the SFSTs were 1) evaluated by the tests’ developers, 2)
under laboratory conditions, 3) only a fraction of subjects were in
the critical .05-.15% BAC range, and 4) the same subjects used to calibrate
the tests were used to assess their accuracy. Given all of these potential
biases in their favor, a hit rate of 80% is unimpressive.
Another perspective on SFST accuracy is provided
by using a bathroom scale as an analogy. Even a cheap scale might be
expected accurate within a few pounds. Yet, the NHTSA authors state
“ …it is unrealistic to attempt to use behavioral tests to discriminate
BACs in a +.02% margin around a given level.”[61]
This is equivalent to a one hundred pound woman stepping on a scale,
seeing a reading of 120, and being told the scale is functioning within
its design limits. And this is under ideal conditions. But how well
can police officers actually estimate individuals’ BACs? In the 1981
laboratory study, police officers’ estimates of BAC (measured by Intoximeters)
were incorrect by an average of .03%
[62]—meaning
approximately half the errors were larger than this.
Psychologists often calculate confidence intervals
to communicate that a given score, like an IQ, is an imprecise measurement.
For example, an IQ of 100 may have a confidence interval of 94-106.
If someone obtained an IQ of 100 on one occasion, it is likely that
he or she would obtain a score within the confidence interval if tested
again. Confidence intervals are not absolute, but based on probability.
The most common probability used is 95%, meaning that on 95 of 100 retests,
the new score would fall within the confidence interval created from
the first score.
Let’s return to the analogy of a one hundred-pound
woman stepping on a bathroom scale using the SFST BAC estimation errors.
Using the most conservative average error reported (.03%), and using
standard tools to create a confidence interval,[63]
we find that a 100-pound woman would observe a scale reading of between
25 and 175 pounds on 95 of 100 trials. The other five percent of readings
would be more inaccurate. In the 1981 field study,
officers’ average BAC estimates were off by an incredible .077% before
training and a whopping .0537% after training.[64]
Creating a 95% confidence interval from the “before training” figure
(.077%) means our 100-pound woman will weigh anywhere from –93 to 293
pounds on our SFST bathroom scale—95% of the time.
Miscellaneous
Issues
1.
The SFSTs have been
evaluated primarily by NHTSA supported researchers, with no rigorous
evaluation by disinterested researchers in a field settings.
Replication by
impartial researchers is the sine qua non of reliable scientific knowledge.
2.
SFSTs have usually
been evaluated in high base rate settings where up to 92% of the persons
tested were legally intoxicated.
Base-rates have a major effect on the confidence that can be given to
a test result.[65]
In both the laboratory and field studies, the majority of subjects or
drivers tested were intoxicated. Aside from the other problems with
the studies, generalization to settings (sobriety checkpoints or daytime
stops) where the incidence of DUI is much lower is not warranted. An
earlier NHTSA study[66]
showed high rates of false positives when the frequency of intoxicated
(BAC > .10%) drivers was experimentally set to 48%. The following table
from that paper illustrates that HGN, either alone or in combination
with observations of driver behavior and appearance, showed false positive
rates of up to 75% for those with a BAC between .05% and .09%. Officers
who received only three hours of training in administration of HGN[67]
assessed 24% of those in the .00-.04% BAC range as impaired—and the
great majority of these were probably completely sober.
Table 4
Percentage of Drivers Judged Impaired At Different
BAC Levels Using a Test Battery Including HGN or HGN Alone
|
|
BAC
|
|
|
.00-.04%
|
.05-.09%
|
.10-.15%
|
Briefly Trained Officers
|
24%
|
75%
|
89%
|
Fully Trained Officers
|
8%
|
52%
|
100%
|
|
HGN only
(both groups)
|
15%
|
64%
|
95%
|
Figures in bold are false positives—drivers who would
have been falsely arrested.
2)
SFST scoring is potentially biased by the
officer’s suspicion of intoxication.
The SFSTs require subjective judgment to score, as acknowledged by Marcelline
Burns,[68]
NHTSA reviewers,[69]
and as indicated by their moderate inter-rater reliability coefficients.
An officer could easily decide a WAT turn is improper based, in part,
of how the driver smelled and his clarity of speech. When these biases
seep in, the test has been contaminated.
3)
The SFSTs may be
harder than driving.
The WAT and OLS are unfamiliar and probably strain many sober peoples’
abilities, especially those that are not in good physical condition.
To quote the NHTSA student manual, “Tests that are difficult for a sober
person to perform have little or no evidentiary value.”[70]
A recent survey of British police surgeons found about half expressed
concerned about the SFSTs being too difficult or the grading too harsh.
Amongst those with advanced credentials (a Diploma of Medical Jurisprudence
or Diploma of Forensic Medicine) over 60% of respondents expressed reservations
for the Walk and Turn and One Legged Stand.[71]
4)
Although the SFSTs
were not designed as indications of driving impairment and have undergone
little validation for this purpose, they are still frequently admitted
as evidence for establishing the driver was impaired.
The SFSTs were expressly developed and validated
to distinguish between BACs of above and below .10%—not driving impairment.
Marcellin Burns has emphasized this distinction,[72]
but NHTSA materials[73]
and court decisions[74]
wrongly equate the two terms. While the SFSTs attempt to gauge BAC,
NHTSA plainly states “Impairment varies widely am |