|Speaker identification evidence: its forms, limitations, and roles
University of Cambridge
Just as an artefact carries traces of its production – a carving, the marks of the chisel, or a painting, the brushstrokes, and both of them the style of the artist – so a sample of speech carries the imprint of its originator. This is self-evident from our everyday experience. We are frequently able to identify familiar speakers without seeing them, for instance when they are speaking outside the door before knocking, or to recognise a voice on the telephone as one we have heard before, even though we may not know who the speaker is. Most people, if they were to be asked whether it is possible to identify speakers from their speech, would readily answer ‘yes’. It’s common sense.
The identity of a speaker is quite often at issue in court cases. A crime victim may have heard but not seen the perpetrator, but claim to recognise the perpetrator as someone whose voice was previously familiar, or as that of a suspect; or there may be recordings of a criminal whose identity is unknown or disputed which are available for comparison with the voice of a suspect. Common sense tells us that either of these should be feasible. Furthermore, in the second sort of case, sending the recordings of criminal and suspect to a ‘voice expert’ must surely allow a reliable ‘scientific’ determination to be made.
In this paper I will give an overview, from the perspective of a phonetician (but without including a lot of technical phonetic information – a more technical introduction can be found in Nolan (1997)), of the ways in which speaker identification evidence can be used in criminal and civil court cases, and suggest how its limitations can best be kept in view. Common sense, as usual in science, and in the law, cannot be our only guide.
Individuals and their voices
We tend to think of people as having a ‘voice’, but this is an oversimplification, firstly because a person’s voice is far from constant, and secondly because it is not yet clear how far the variations in a person’s speech cause their voice to overlap the ranges of other members of the speech community. In dealing with both these issues I will use fingerprints as a reference.
We think of the fingerprint as the ‘benchmark’ for identification of individuals. Any doubts raised about the reliability of fingerprint identification tend to concern inadequacies in the procedure of examination; it has not been suggested that a person’s fingerprint pattern is other than a constant (short of damage or destruction of the skin of the fingertip). Why, in constrast, can we not rely on a constant relation between a voice and its ‘owner’? The answer is that while a fingerprint is the direct trace of a virtually invariant physical characteristic, the voice is the product of two mechanisms which exhibit considerable flexibility. I have sometimes referred to this variability of the mechanisms behind speech as ‘plasticity’ (Nolan 1983). The mechanisms in question are the speech organs and language.
The various speech organs have to be flexible to carry out their primary functions such as eating and breathing as well as their secondary function of speech, and the number and flexibility of the speech organs results in a high number of ‘degrees of freedom’ in the machine producing speech. These ‘degrees of freedom’ may be manipulated at will, as when someone ‘puts on a voice’, or may be subject to variation due to external factors such as stress, fatigue, health, and so on. The net result of this plasticity of the vocal organs is that no two utterances from the same individual are ever, strictly speaking, identical in a physical sense.
In addition to this, the linguistic mechanism – language – driving the vocal mechanism is itself far from invariant. We are all aware of changing the way we speak, including the loudness, pitch, emphasis, and rate of our utterances; aware, probably, too, that style, pronunciation, and to some extent dialect, vary as we speak in different circumstances. To be a speaker who commands only one register is to be an impoverished member of a linguistic community. Most of us will vary our language, often accommodating our regional or social variety to that of our interlocutor, and our ‘tone of voice’ to the circumstances.
Speaker identification thus involves a situation where neither the physical basis of a person’s speech (the vocal organs) nor the ‘software’ driving it (language) are constant. If we were to consider, in comparing a pair of recordings, just two observable parameters, for instance the average pitch of the voice and whether (in English) the speaker(s) produce a glottal stop or a /t/ in the middle of words such as ‘better’, the inherent plasticity of speech production would mean we had a very poor basis for determining anything about identity. There are very many speakers who would say ‘be’er’ in their casual style, but switch in a more careful style to ‘better’, as well as speakers who invariably say ‘be’er’ and speakers who always say ‘better’. Average pitch, too, is far from constant; it varies with psychological stress, time of day, type of utterance, speaking volume, and so on. There is every potential for the observed behaviour of a speaker, on just two parameters, to be distinct on two occasions, and also to overlap the behaviour of another. If we observe ten parameters, we will be better off; but still run the risk of one speaker coinciding with others in terms of the values observed on each parameter. The more parameters are observed, the nearer we approach a reliable discrimination of one speaker from the rest of the population; but as yet, today, we do not have the large-scale population studies which would tell us how many different parameters we would have to consider to be confident of discriminating every speaker in the population – in the way we believe fingerprints allow us to discriminate every individual. Indeed, given the inherent variabilities of speech, we do not know whether ‘absolute discrimination’ is even theoretically attainable. Each speaker occupies not a point in the notional multidimensional space of our different observed parameters, but an area of variation; and we do not know for sure whether, even with a large number of parameters, each individual’s area is discretly separate from all others’ areas. For this fundamental reason, we should always approach speaker identification evidence of whatever kind with caution.
3. Types of speaker identification evidence
There are two broad categories of speaker identification evidence, ‘naïve’ and ‘technical’ (Nolan 1983: 7). Naïve speaker identification involves the application of our natural abilities as human language users to the identification of a speaker. Given the sophistication of these abilities, the term ‘naïve’ is perhaps inappropriate. The term emphasises, however, the lack of specific training on the part of the person making the decision. There are five main circumstances which may give rise to such evidence. A witness to a crime may claim to identify a voice heard (‘it was X’s voice making the bomb threat’); a witness may recognise a voice heard without being able to identify it (‘it was the anonymous caller who rang twice yesterday’); or a witness may be asked to listen to a ‘voice parade’ or ‘voice line-up’ containing the voice of a suspect and a number of foils and pick out one as the perpetrator’s voice. Alternatively, a person investigating a crime where a voice has been recorded may identify the voice as that of ‘X’, being a person known to the investigator. One further circumstance in which naïve speaker identification comes into play in the legal process is when tapes are played to a jury or other judicial body.
Technical speaker identification is defined by the employment of any trained skill or any technologically-supported procedure in the decision-making process. This applies almost exclusively when there is an incriminating recording (a bomb hoax, a fraudulent bank deal, a wire tap, and so on) and a recording of a suspect. An expert, normally a phonetician, is then asked to assess the likelihood that the suspect is heard on the incriminating tape. The expert, ideally, will apply both auditory skills acquired through phonetic training, and techniques for acoustic visualisation and measurement.
I will deal below in more detail with both these major types of speaker identification.
4. Naïve speaker recognition
There are two inherent limiting factors on the reliability of naïve speaker recognition, and a large number of contingent limiting factors. The inherent limiting factors are the potential overlap of the voices of different speakers, as discussed above, and the performance of the human perceptual, storage, and retrieval mechanisms. Not all acoustic features of a sample of speech can be discriminated perceptually; to some extent it is advantageous for our perception to ‘blur’ characteristics which are mere noise from the point of view of extracting the message, but may constitute part of the ‘imprint’ of the producer (see e.g. Nolan 1994: 336-344). Human memory, particularly long term memory (as opposed to short term, or ‘echoic’ memory), is not tape-recorder like, but selective and stores information in a processed and encoded manner. And not all that is stored can be retrieved accurately at will, as when we know a word but can’t recall it.
Contingent factors affect performance. Performance of subjects in experiments replicating ‘earwitness’ tasks has been found to depend on a large number of factors, being improved for instance by earwitnesses’ prior familiarity with one or more voices, by longer samples, shorter time elapsed between exposure and identification, by the ‘recognisability’ of the target voice (recognisability presumably consisting in more extreme values on a number of parameters), and, perhaps surprisingly, by becoming familiar with the voice in a stress-inducing situation or by interacting with the speaker rather than just overhearing (for a summary and references see Hollien, Huntley, Künzel, and Hollien (1995) and Nolan and Grabe (1996: 75-77)). Style shifting on the part of the speaker has also been shown to lead to misidentifications (Bahr and Pass 1996). All experiments, including those involving ‘closed’ tasks where (unlike in most forensic situations) the target voice is known to be present among the samples heard, yield accuracies below 100%, often very dramatically below. ‘Open’ tasks, where (as in real life) the experimental subject (or witness) has to decide whether the target voice is present in the line-up, allow a further category of error and reduce performance correspondingly.
The lesson to be drawn from the by now extensive body of experimental evidence on earwitness-related tasks is that mistaken identity is every bit as real a risk in speaker identification as in visual identification. Some experiments (e.g. Rose and Duncan 1995) have shown that mistakes are made even in the identification of close friends and relatives, so even a claim to have identified a voice with which the earwitness was previously familiar may not be accurate. In my view, no prosecution can rely predominantly on earwitness identification of a prior known voice, or subsequent identification of a suspect.
With subsequent identification, the evidential value depends on the care with which the identification task is presented. To play the witness a tape of the suspect and ask ‘is this the man/woman you heard?’ (a ‘line-up of one’) provides no safeguard against false identification. As with visual identification, a proper ‘parade’ or ‘line-up’, by placing the suspect in a group of others, means that the witness cannot merely say ‘yes’ in an attempt to be co-operative, and, in the worst case where the witness insists on making an identification but is only able to guess, affords at least a probabilistic protection to an innocent suspect – if the line-up is properly constructed, he or she has only a one-in-eight (or so) chance of being picked.
The theory of line-ups, visual or auditory, is far from established, however. Even the basic matter of selection of foils is open to discussion (e.g. Wells 1993: 563-4). Should they resemble the suspect? The reductio ad absurdum of this strategy is that the perfect line-up would consist of the suspect and nine identical clones, which clearly would present an impossible task. The alternative, that the foils should merely satisfy the characteristics of the perpetrator described by witnesses, is probably more logically defensible, but has the disadvantage that the worse the witnesses’ descriptions, the more variation the line-up could contain, and the more likely an inadvertent resemblance between an innocent suspect and the perpretrator is to result in false identification.
When it comes to auditory line-ups, the technique is still in its infancy; though this may have the advantage that the police are more likely to seek help when constructing a voice line-up than a visual line-up. The multidimensionality of voices means that the definition of ‘similar’ is far from trivial, and requires expert advice. Nolan and Grabe (1996) describe in detail a case in which the line-up was subjected to two pre-tests with listeners to ensure that the suspect’s voice neither stuck out from the foils, nor was indentifiable as stereotypically that of a sex-offender (the relevant crime). The witness identified the suspect with a high degree of confidence, and although the resultant line-up was challenged by the defence on some grounds, I believe it was essentially a fair test of the witness’s ability to identify the suspect’s voice as that of her attacker, and a conviction resulted. I was also recently consulted by a police force who had, with considerable thought but without expert phonetic advice, put together two voice line-ups. They played them to me, and in each case, to the policemen’s dismay, I was able with little difficulty to tell which the suspect’s voice was. This was because all samples except the suspects’ samples were clearly read speech. I advised them to collect their foil samples by conducting mock police interviews (the suspect’s sample is standardly extracted from recorded police interviews in the British legal systems). This is a relatively simple strategy for police who wish to construct a line-up themselves to adopt, and I believe it would be the single most effective step towards making a line-up of this type fair.
There is, more widely, considerable scope for the development of a rigorous set of procedures, including perhaps a library of foil voices, for voice line-ups. Many of the issues are dealt with in Broeders (1996) and Broeders and Rietveld (1995), the latter including a set of recommended procedures to be followed in the construction of line-ups.
As for crime investigators identifying recorded voice samples, this is subject to all the reservations about naïve speaker identification, plus the serious concern that there may be a predisposition to identify known criminals. In a case in which I have recently been involved at the appeal stage, part of the evidence that the voice on the incriminating tape was that of the defendant consisted of identifications by two policemen who knew the defendant. During the course of the appeal and review, it emerged as very likely that the policemen, by the time they listened to the incriminating tape and made their identfications, had already been aware of other evidence linking the defendant to those who committed the crime. This, in my view, rendered their purported identification, already highly dubious, completely worthless.
The last situation in which naïve speaker identification comes into play, when for instance a jury is invited to compare for themselves a tape containing an incriminating recording and a tape (or evidence given live) by the defendant, is highly undesirable. The jury cannot approach the task unbiased; and, as with a ‘line-up of one’, constitutes no test of the jury’s ability to do the task. There will be no control even over basic factors such as jury-members’ hearing acuity, let alone more sophisticated matters such as their familiarity with the language-variety in question (there is a serious danger that samples in an unfamiliar dialect will sound more similar than equivalent samples in a familiar dialect). In my view, the jury should not be allowed to ‘be their own experts’, even where expert phonetic witnesses have disagreed. If phoneticians have disagreed, the jury should take with them to the jury room the message that speaker identification is problematic.
5. Technical speaker recognition
Technical speaker recognition can be called on when a recording exists of an unknown speaker committing a crime (a threat, a kidnap demand, etc.), or talking about committing a crime (a discussion of a forthcoming drug deal recorded secretly, etc.), and the voice of one or more suspects is available. The latter is usually also recorded. In some jurisdictions, such as England and Wales, the statutory recording of a police interview may be used as reference recording for the suspect, but this is not permitted in all jurisdictions. Only if a recording is made specifically for comparison (with the suspect’s consent) can the linguistic content be controlled to be the same as that on the unknown recording, and even then the impossibility of replicating the context of the unknown recording makes exact linguistic equivalence impossible – quite apart from the risk of deliberate alteration of the voice. More commonly, the comparison has to be made on different material. This seriously restricts the potential to apply fully automatic techniques of speaker comparison, such as those used in Automatic Speaker Verification (where, for instance, access can be controlled by comparing the voice of an individual to the voice of the person whose identity is being claimed). There are two broad categories of speaker identification method which are commonly applied: those relying on human auditory perception, and those using acoustic analysis.
The auditory technique undoubtedly builds on the naïve ability to recognise speakers discussed above, but beyond that draws on an explicit analysis of sounds provided by traditional phonetics. The framework of traditional phonetics, embodied most prominently in the International Phonetic Alphabet and the theory behind it (see International Phonetic Association 1999), provides a method for recording the audible details of sounds. With appropriate training, which consists in intensive practice in discriminating and producing sounds within and beyond his or her native language, the phonetician can describe in quite some detail fine differences in pronunciation. This concerns mostly the pronunciation of vowels, consonants, and prosodic characteristics such as intonation. In my view, the phonetician has a considerable advantage in bringing to consciousness, and being to able to organise, evaluate, and communicate, delicate distinctions of pronunciation.
I have however pointed out in a number of publications that traditional phonetic training generally does not include the systematic analysis of personal voice quality. This is understandable, since most phonetic training takes place in the context of the analysis of language. There does exist one comprehensive framework, at least, for the analysis of voice quality, that of Laver (1980). This captures long term tendencies in a speaker’s speech, such as a tendency to whispery voice, or heavy nasality. It has been applied in speech pathology as well as in linguistic description. It has not, however, been widely used in forensic phonetics, however, perhaps because to use it reliably really requires the practitioner to have been trained in its use. Most comments on personal voice quality which I have seen in the forensic context are couched in rather informal and impressionistic terms.
The means that the usual contribution of auditory phonetics to the forensic process is a fine-grained analysis of the vowels and consonants. Now, there is a difference of view over what this achieves (see e.g. Nolan 1991, reviewing Baldwin and French 1990). Some practitioners argue that we each have our own personal accent, and therefore that a fine-grained analysis of accent will partition the population to the point where a single individual is pin-pointed. My view is that this is unlikely in principle, and it has certainly not been demonstrated scientifically to be feasible in practice. The proper role of auditory phonetics, therefore, is to determine whether two samples are spoken in the same accent, and also to check for signs of inconsistency which might betray a speaker trying to disguise his own accent or adopt another.
If two samples are radically distinct in terms of accent, this points strongly in the direction of different speakers. There is, however, still the possibility that a speaker is bi-dialectal. It is unusual for a speaker to have command of two discretely different varieties, and so there would need to be strong reasons in this situation for rejecting what would be the default conclusion, that the samples are spoken by different speakers. More commonly, of course, speakers command continua of variation: sociolects within an urban dialect, or degrees of approximation to a national standard from a regional basis, often correlated with stylistic variation. In detail, therefore, the phonetician’s analysis has to be informed by good sociolinguistic knowledge. Differences of pronunciation between two samples indicate different speakers unless they can be explained by a coherent model of variation, principally a sociolinguistic one.
If the samples are not detectably distinct in terms of accent, it is reasonable to conclude, on linguistic grounds, that the possibility remains open that they are spoken by the same speaker. This is not, of course, the same as ‘making a positive identification’. It is now that the purely auditory phonetician should recognise the limitation of the method, having elucidated the ‘imprint’, primarily, of the linguistic system.
At this point the acoustic phonetician can take the matter further. The imprint of the vocal mechanism can be heard, of course, in aspects of the voice that we can describe auditorily in terms such as ‘a deep voice’, ‘a resonant voice’, and so on, but the detailed description of the imprint is only possible through acoustic analysis.
Acoustic analysis allows us to see, and measure, a number of parameters which reflect a person’s vocal organs, including their fundamental frequency (pitch), and formant (resonant) frequencies. Given the plasticity of the vocal mechanism the values on these parameters are not, of course absolutely determined by the vocal mechanism – a person can choose to speak on a higher pitch, or to lower resonances by protruding their lips – but in general they will be within a characteristic range. In particular, details of the dynamics of resonances as the vocal tract changes to make and release particular sounds may constitute particularly telling imprints of the vocal mechanism, reflecting both the dimensions of the vocal tract and its pattern of movement. Naturally, the more values or patterns are found which coincide, the more compatible the two samples are with the hypothesis that they are from the same speaker. Values or patterns which are different point in the direction of different sources of the speech signal, unless they can be explained (as in the case of accent differences) with reference to established models.
For instance, the vowel /e/ might have different average formant values in two samples. But if these were measured from ten occurrences of the word ‘deménted’ in sample A, and ten occurrences of ‘tórment [noun]’ in sample B, we need to consider whether the difference is compatible with what is known about the effects of the different degrees of stress on that vowel in the two words. If the difference is compatible, our model tells us that this is not evidence for ‘different speakers’. This is one simple instance of the fact that machines and measurements are of little value without the complex process of interpretation which the phonetician brings to speaker identification.
I have written as though auditory and acoustic analysis are carried out in sequence, but I have done this only for the purpose of outlining what can be told from each. In practice a phonetician will start off by listening, but quite soon enter an interactive cycle of auditory and acoustic analysis. A suspected auditory difference can be explored and confirmed or disconfirmed acoustically, and an acoustic observation can bring to attention a pronunciation feature previously missed. The two forms of analysis are symbiotic. Sometimes the quality of a recording is too poor to allow sensible acoustic analysis. In this case a limited opinion on accent may be feasible, but it is vital to appreciate that the degradation of the signal which prevents acoustic analysis also impairs auditory analysis. The ear cannot magically restore information which has been lost – not reliably restore, anyway. I have argued strongly (Nolan 1994: 336-344) that the information provided by auditory and acoustic analysis is complementary, and there is no justification for setting one aside in favour of the other if both are feasible. This, I believe, is now the consensus widely, though not universally, held among practitioners in this field.
6. Expressing an opinion
Much ink has been spilt on the controversies surrounding forensic speaker identification: over the methods, over the reliability attainable, and over whether phoneticians should offer opinions at all in the absence of authoritative scientific evaluation of the reliability of speaker identification. However I incline more and more to the view that, real as these issues are, the crux of the matter is how phoneticians express their opinion. The expression of the opinion is both an outward sign of the way they conceptualise the task in which they are engaged, and an interface between science and the needs of lawyers. Commonly, in the British legal systems, the opinion will be of the kind ‘it is highly likely / likely that the speaker heard on the incriminating recording is the defendant.’
In the mid 1990s I chaired a committee of the International Association for Forensic Phonetics which was given the tasks of researching the ‘scales of opinion’ used by speaker identification practitioners, and exploring the possibility of devising an agreed scale for use by members of the Association. Such scales of opinion generally run from some term such as ‘highly probable…the same’ through ‘probably…the same’ to ‘possibly…the same’ to either ‘possibly…different’ and ‘probably…different’ or to ‘unlikely to be the same’.
I duly sent out a questionnaire and tried to tabulate the results, attempting judgments of Solomon as to whether, for instance, one person’s top opinion ‘My firm opinion…the same’ indicated an degree of certainty intermediate between another person’s top opinion ‘Without any reasonable doubt…the same’ and their second rank ‘With very great probability… the same’. Even before I had started, I had sensed that I was the wrong person to push through an agreed ‘scale of opinion’, being sceptical of the notion, and my analysis of the results of the survey convinced me that the honest chaos which reigned better reflected the state of the art than an agreed scale. An agreed scale would risk clothing divergent behaviour and conceptualisations of the problem with a cloak of spurious uniformity. In the end I left the ‘agreed scale’ task with those who believed in it – and there the matter has rested, as far as I know.
My scepticism at that time stemmed in part from the feeling that, exceptional circumstances apart, the strongest identification I would normally want to give would be ‘it remains fully possible that the two samples contain the same speaker’. This arises from thinking of the task as comparable to the scientific testing of a null hypothesis. The hypothesis to be tested is that the two samples are from the same speaker (the null hypothesis), but the samples must be subjected to a large number of tests, each of which has the potential to throw up a difference. Each difference, unless it can be explained with reference to an established model of variation (sociolinguistic, prosodic, etc.), has the capacity to cause the null hypothesis to be rejected (in favour of the alternative hypothesis that two speakers were involved). If a suitably large number of tests is carried out, and no otherwise explicable differences arise, we conclude that we have not disconfirmed the null hypothesis, and, indeed, it is fully possible within the boundaries of our method that the samples come from the same speaker.
This is not a wishy-washy opinion, in reality; it is a very strong piece of evidence – though not, quite properly, strong enough on its own to convict. What it does not say is that no-one else could possibly, or reasonably, have made the call – which I take to be the implication of formulations such as ‘With very great probability… the same’. The leap from ‘fully possible’ to any kind of ‘probable’ is a leap from examining two samples for inconsistencies to evaluating those samples against the whole relevant population. As Broeders (1999: 232) also notes, approaching the continuum of ‘possible’ to ‘probable’ from a slightly different perspective:
In semantic terms, probability and possibility are separate modalities that are not on an equal footing. Both are ways of expressing one’s view of the factuality of actions, states or events but probability is clearly subordinated to possibility: it is pointless to speak of the probability of things or events that are impossible; in order for things to be even remotely probably they must first be possible.
If the internal evidence suggests that it is ‘fully possible’ that two samples were spoken by the same individual, we can only go further, that is, make the leap towards ‘probability’, if their shared phonetic characteristics are unlikely to be found in combination in any other individual in the relevant population.
With hindsight, and more particularly with the insight provided by a very accessible book on evidence and probability, Robertson and Vignaux’s (1995) Interpreting Evidence, I see that I was unwittingly groping towards a Bayesian conceptualisation of the technical speaker identification task. Commentators such as Robertson and Vignaux and Evett (1995) have for some time been making the case that forensic evidence in various fields should be presented in a way compatible with Bayesian statistical theory. The central concept is the likelihood ratio:
P(E | H1)
P(E | H2)
This means the ratio of the probability of getting the observed evidence (E) given the hypothesis (H1) that the suspect left the evidence to the probability of getting E on some other hypothesis (H2) – the latter could be as general as that anyone else in the population left it, or as specific (in the case that only one or person could conceivably have committed the crime) as that one other individual committed the crime.
As an example, suppose that a bloodstain is compared on some test to the blood of the suspect, and found to be of the same type. The probability of (E | H1) is 1; that is, if the suspect leaves a bloodstain, it must perforce be of that type. Suppose H2 is merely that someone else in the population left the bloodstain. The probability of E given this hypothesis depends on how many people in the population have the same blood type. If, of 60 million people, 30 million have this type, the probability of encountering this blood type given H2 is 30/60 = 0.5. The likelihood ratio is 1.0/0.5 = 2. On the other hand if the blood group is less common, shared by only 1 million people, the probability of H2 is 1/60 = 0.017. The likelihood ratio is now 1.0/0.017 = 60.
How are these outcomes to be integrated into the court case? Whatever the odds in favour of the defendant being present at the scene of the crime based on all the other evidence, the blood type evidence makes them 2 times higher, if the blood group is the common one, or 60 times higher if it is the rarer one. If the odds based on other evidence are only 2 to 1, and the blood type is the common type, then the combined odds are 4 to 1 (which means there is a one in four chance that the defendant was not present at the crime, and a conviction would be highly dangerous). If, on the other hand, the odds based on the other evidence are already 100 to 1, and the blood type is the rarer one, the combined odds are 6000 to 1.
There are complications, of course, in applying this approach to forensic speaker identification. Unlike this simple (and imaginary) blood group example, the phonetician is dealing with a large number of independent and mutually dependent parameters. Worse, for very few of these do we have the population statistics which would be required for a quantitative application of the Bayesian approach. For these reasons, perhaps, phoneticians have shied away from it, both in practice, and even in principle; their reluctance reinforced by a legal profession which is probably loath to abandon familiar formulations, as reflected in the responses collected from trainee lawyers by Sjerps and Biesheuvel (1999). A notable exception is Rose (forthcoming), who provides worked examples of how a quantitative Bayesian approach can be applied to acoustic phonetic data. Nonetheless the conceptualisation of the problem (as opposed to the quantitative application of the Bayesian formula) is of absolute relevance to technical speaker identification. As soon as we venture into the realms of probability, we are implicitly claiming to know, even if only intuitively, the value of the lower term P(E | H2) of the likelihood ratio, and we should adopt an appropriate formulation for conclusions.
The failure to appreciate the Bayesian approach can have very damaging consequences. As I have said in a number of reports for defence counsel, an expert can tot up matching phonetic values between two samples to his or her heart’s content, but such a list tells us nothing about the probability of the two samples coming from the same speaker unless those features have the power to discriminate among members of the relevant population. This may seem self-evident, but it is a point which some practitioners repeatedly fail to understand. Challenged with it in connection with an opinion of the form ‘completely satisfied… the same’, the response of one phonetic expert has been
I was not asked to compare one voice with an entire population of voices. I was asked to give my opinion and to compare [the incriminating sample] with that of [the suspect]. So the question of whether voices could be similar to each other in the general population is not relevant.
This completely misses the point. It matters absolutely whether the values found matching between two samples are vanishingly rare, or sporadic, or near universal in the general (relevant) population.
Where does this leave us? Forensic phoneticians have to move on, I think, from saying things like ‘I am completely satisfied that the defendant spoke the incriminating sample’, however much prosecuting authorities might find this convenient. Instead, we should stick to the facts. If these allow us to point out that no significant differences exist between the two samples, and that for instance it is very rare to encounter in combination a stutter of exactly these characteristics, a lisped /s/, and an extremely low pitch, the jury will be able to draw the inference that the probability of the defendant having spoken the incriminating sample is quite high. If, on the other hand, the matching available characteristics are all commonplace, the jury will not convict, unless the prior odds (on the basis of the other facts of the case) are already weighted heavily in favour of guilt. This is as it should be; in very few cases is it safe for voice evidence to constitute the primary evidence.
As phoneticians, we are not in a position to use the Bayesian approach quantitatively in the way it can be used, say, in DNA evidence; and the phonetic signal is so multi-faceted that we may never be able to do so straightforwardly. Broeders (1999:240) suggests:
For disciplines like handwriting, speech and language analysis then, the value of the Bayesian approach lies perhaps not in its direct applicability but in providing a conceptual framework within which our thinking about the identification process can take place.
The way of thinking this framework dictates, with its due emphasis on how likely the observed evidence is given the alternative hypothesis, must be incorporated into forensic phonetics. When it is, much of the heat will be taken out of the debate between the doubters and the believers, those who feel phoneticians have little to offer the courts and should stay clear, and those who willingly (and sometimes recklessly) engage in the forensic process.
In that debate I count myself a believer; I do consider that phoneticians have a very important contribution to make to the judicial process in the realm of forensic speaker identification. This is true whether the task is to assist in the construction of a fair voice line-up, or to assess the similarity of an incriminating recording and a recording of a suspect. The need to know whether a particular person spoke on a particular occasion will not go away, and if phoneticians adopt a purist view and refuse to offer expertise, lawyers will have recourse to self-styled experts whose knowledge is less appropriate – sound engineers, dialect enthusiasts, mineralogists (!) (Braun and Künzel 1998: 19), and so on. This argument in favour of ‘engagement’ is well articulated by Braun and Künzel (1998), who rightly note that after my predominantly sceptical stance expressed in Nolan (1983) my own view has evolved in the direction of engagement, partly because of these dangers. Language and speech form an immensely complex and plastic system, and understanding the ways in which a speaker’s identity imprints itself on a sample of speech requires a firm foundation in linguistics, dialectology, sociolinguistics, phonetics, and acoustics, at the very least. Usually the person best qualified in this area is the phonetician (or speech scientist).
Nonetheless it must be admitted that the story of forensic technical speaker identification is not an uplifting one for the discipline of phonetics. Too often, practitioners of technical speaker identification have let their enthusiasm – or that of the judicial representatives instructing them – get the better of them, and give opinions which far outstrip what can be scientifically justified. What is needed is an understanding that two questions are involved: how likely are the characteristics of the incriminating sample if the suspect produced it, and how likely are the characteristics of the incriminating sample under the relevant alternative hypothesis. One day we may have the population statistics on acoustic values, phonetic values, sociolinguistic values, and so on, to provide a quantified Bayesian expression of how the voice evidence affects the overall probability of guilt, but for the present we have to rely to a large extent on the judgment of the expert. If these two likelihoods are presented cogently to the court, properly signalled as estimates, the jury will be able to give the voice evidence the weight it deserves in balancing the evidence.
Baldwin, J. and J.P. French (1990) Forensic Phonetics. London: Pinter.
Braun, A. and H. Künzel (1998) Is forensic speaker identification unethical – or can it be unethical not to do it? Forensic Linguistics 5(1), 10-21.
Broeders, A. (1996) Earwitness identification: common ground, disputed territory and uncharted areas. Forensic Linguistics 3(1), 3-13.
Broeders, A. (1999) Some observations on the use of probability scales in forensic identification. Forensic Linguistics 6(2), 228-241.
Broeders, A. and A. Rietveld (1995) Speaker identification by earwitnesses. In J-P. Köster & A. Braun (eds) Studies in Forensic Phonetics. Trier: Trier University Press.
Huntley Bahr, R. and K.J. Pass (1996) Forensic Linguistics 3(1), 24-38.
Evett, I.W. (1995) Avoiding the transposed conditional. Science and Justice 35(2), 127-131.
Hollien, H., Huntley, R., Künzel, H. and Hollien, P.A. (1995) Criteria for earwitness lineups. Forensic Linguistics 2(2), 143-153.
International Phonetic Association (1999) Handbook of the International Phonetic Association. Cambridge: Cambridge University Press.
Laver, J. (1980) The Phonetic Description of Voice Quality. Cambridge: Cambridge University Press.
Nolan, F. (1983) The Phonetic Bases of Speaker Recognition. Cambridge: Cambridge University Press.
Nolan, F. (1991) Forensic phonetics. Journal of Linguistics 27, 483-493.
Nolan, F. (1994) Auditory and acoustic analysis in speaker recognition. In: J. Gibbons (ed.), Language and the Law. London: Longman.
Nolan, F. (1997) Speaker recognition and forensic phonetics. In W.J. Hardcastle and J. Laver (eds), A Handbook of Phonetic Sciences. Oxford: Blackwell.
Nolan, F. & E. Grabe (1996) Preparing a voice line-up. Forensic Linguistics 3(1), 74–94.
Robertson, B. and G.A.Vignaux (1995) Interpreting Evidence. London: Wiley.
Rose, P. (forthcoming) Forensic speaker identification. To appear in H. Selby et al. (eds), Expert Evidence.
Rose, P. and S. Duncan (1995) ‘Naive Auditory Identification and Discrimination of Similar Voices by Familiar Listeners’, Forensic Linguistics 2 (1): 1-17.
Sjerps, M. and D.B. Biesheuvel (1999) The interpretation of conventional and ‘Bayesian’ verbal scales for expressing expert opinion: a small experiment among jurists. Forensic Linguistics 6(2), 214-227.
Wells, G. L. (1993) What do we know about eyewitness identification, American Psychologist, 48(5), 553-571.