## Modeling Perception and Judgment

Once again let's imagine the case of a robot, only now what the robot is thinking about is perception, not action. The robot has just made a perceptual mistake. It saw a straight object that it took to be bent. It stuck a stick into a pool of water and observed the stick change chape. However, after doing various experiments, such as feeling the object as it entered the water, it decides that the stick never actually bends, it just appears to.

This story sounds plausible, because we've all experienced it ourselves, one way or another. Actually, there is no robot today that could go through this sequence of events, for several reasons. First, computer vision is not good enough to extract arbitrary, possibly surprising information from a scene. A typical vision system, if pointed at a stick in a tub of water, would probably misinterpret the highlights reflected from the water surface and fail to realize that it was looking at a tub of water with a stick thrust into it. Assuming it didn't stumble there, and assuming it was programmed to look for sticks, it might fit a line to the stick boundary and get a straight stick whose orientation was halfway between the orientation above the water level and the orientation below. Or it might see one half of the stick, or two sticks.

Even if we look forward to a time when computer vision systems work a lot better than they do now, there are still some gaps. There has been very little work on "cross-modality sensor fusion," which in this case means the ability to combine information from multiple senses to get an overall hypothesis about what's out there. No robot now is dexterous enough to feel the shape of a stick, but even if it were there would still be the problem of combining the input from that module with the input from the vision module to get an overall hypothesis. The combination method has to be clever enough to know when to reject one piece of information completely; taking some kind of average of the output of each sense will not be useful in the case of the unbent stick.

Even if we assume this problem can be solved, we still don't have the scenario we want. Suppose the robot is reporting what it senses. It types out reports like this:

Stick above water

Stick goes into water; stick bent

(Feels)

### Stick straight

The question is, How does this output differ from the case where the stick was really bent, then straightened out? For some applications the difference may not matter. Suppose the robot is exploring another planet, and sending back reports. The humans interpreting the output can realize that the stick probably didn't bend, but was straight all along. Let's suppose, however, that the robot actually makes the correct inference, and its report are more like

Stick above water

Stick goes into water; stick bent

(Feels)

Correction: stick never bent

We're still not there; we still don't have the entire scenario. The robot isn't in a position to say that the stick appeared to be bent. Two elements are missing: The first is that the robot may not remember that it thought the stick was bent. For all we know, the robot forgets its earlier report as soon as it makes its new one. That's easy to fix, at least in our thought experiment; as long as we're going far beyond the state of the art in artificial intelligence, let's assume that the robot remembers its previous reports. That leaves the other element, which is the ability to perceive the output of sensory systems. As far as the robot is concerned, the fact that it reported that the stick was bent is an unexplained brute fact. It can't yet say that the stick "appeared to be" anything. It can say, "I concluded bent and then I concluded straight, rejecting the earlier conclusion." That's all.

This may seem puzzling, because we think the terms in which we reason about our perceptions are natural and inevitable. Some perceptual events are accessible to consciousness, while others are not, because of the very nature of those events. But the boundary between the two is really quite arbitrary. For instance, I can tell you when something looks three-dimensional, whether it is or not. I know when I look through a stereoscope that I'm really looking at two slightly different two-dimensional objects; but what I see "looks" three-dimensional. If someone were paying me money to distinguish between 3-D and 2-D objects, I would disregard the strong percept and go for the money. Another thing I know about stereo vision is that it involves matching up tiny pieces of the image from the left eye with corresponding pieces of the image from the right eye, and seeing how much they are shifted compared with other corresponding pieces. This is the process of finding correspondences (the matching) and disparities (the shifts). But I am completely unaware of this process. Why should the line be drawn this way? There are many different ways to draw it. Here are three of them:

1. I could be aware of the correspondences and disparities, plus the inference (the depths of each piece of the image) that I draw from it. In the case of the stereoscope I might continue to perceive the disparities, but refuse to draw the inference of depth and decide that the object is really 2-D.

2. I could be aware of the depths, but, in the case of the stereoscope, decide the objects is 2-D. (This is the way we're actually built.)

3. I could be unaware of the depth and aware only of the overall inference, that I'm looking at a 2-D object consisting of two similar pictures.

It's hard to imagine what possibilities 1 and 3 would be like, but that doesn't mean they're impossible. Indeed, it might be easier to build a robot resembling 3 than to build one resembling us.

Nature provides us with examples. There are fish called "archer fish" that eat insects they knock into the water by shooting them with droplets. These fish constantly look through an air-water boundary of the kind we find distorting. It is doubtful that the fish find it so; evolution has no doubt simply built the correction into their visual systems. I would guess that fish are not conscious; but if there were a conscious race of beings that had had to judge shapes and distances of objects through an air-water boundary throughout their evolutionary history, I assume their perceptual systems would simply correct for the distortion, so that they could not become aware of it.

The difference between people and fish when it comes to perception is that we have access to the outputs of perceptual systems that we don't believe. The reason for this is fairly clear: Our brains have more general mechanisms for making sense of data than fish have. The fish's brain is simple, cheap, and "designed" to find the best hypothesis from the usual inputs using standard methods. If it makes a mistake, there's always another bug to catch (and, if worse comes to worst, another fish to catch it). People's brains are complex, expensive, and "designed" to find the best hypothesis from all available inputs (possibly only after consultation with other people). The fact that a perceptual module gave a false reading is itself an input that might be useful. The next time the brain sees that kind of false reading, it may be able to predict what the truth is right away.

Hence a key step in the evolution of intelligence is the ability to "detach" modules from the normal flow of data processing. The brain reacts to the output of a module in two ways: as information about the world, and as information about that module. We can call the former normal access and the latter introspective access to the module. For a robot to be able to report that the stick appeared to be bent, it must have introspective access to its visual-perception module.

So far I have used the phrase "aware of" to describe access to percepts such as the true and apparent shape of a stick. This phrase is dangerous, because it seems to beg the question of phenomenal consciousness. I need a phrase to use when I mean to say that a robot has "access to" a representation, without any presupposition that the access involves phenomenal consciousness. The phrase I adopt is "cognizant of."2

There is another tricky issue that arises in connection with the concept of cognizance, and that is who exactly is cognizant. If I say that a person is not cognizant of the disparities between left and right eye, it is obvious what I mean. But in talking about a robot with stereo vision, I have to distinguish in a non-question-begging way between the ability of the robot's vision system to react to disparities and the ability of the robot to react to them. What do I mean by "robot" over and above its vision system, its motion-planning system, its chess-playing system, and its other modules? This is an issue that will occupy us for much of the rest of this book. For now, I'm going to use a slightly dubious trick, and assume that whatever one might mean by "the robot as a whole," it's that entity that we're talking to when we talk to the robot. This assumption makes sense only if we can talk to the robot.

I say this trick is dubious for several reasons. One is that in the previous chapter I admitted that we are far from possessing a computational theory of natural language. By using language as a standard part of my illustrations, I give the impression of a huge gap in the theory of robot consciousness. I risk giving this impression because it is accurate; there are several huge gaps in the theory, and we might as well face up to them as we go. Another risk in bringing language in is that I might be taken as saying that without language a system cannot be conscious, which immediately makes animals, infants, and maybe even stroke victims unconscious. In the long run we will have to make this dependence on language go away. However, I don't think the linkage between language and consciousness is unimportant. The fact that what we are conscious of and what we can talk about are so close to being identical is in urgent need of explanation. I will return to this topic later in the chapter.

Let's continue to explore the notion of cognizance. In movies such as Westworld and Terminator, the director usually feels the need, when showing a scene from a killer robot's point of view, to convey this unusual perspective by altering the picture somehow. In Westworld this was accomplished by showing a low-resolution digital image with big fat pixels; in Terminator there were glowing green characters running down the side of the screen showing various ancillary information. In figure 3.1 I have made up my own hypothetical landscape adorned with both sorts of enhancements; you may imagine a thrilling epic in which a maniacal robot is out to annihilate trees. What's absurd about these conventions is the idea that vision is a lot like looking at a display. The visual system delivers the information to some internal TV monitor, and the "minds's eye" then looks at it. If this device were used in showing human characters' points of view, the screen would show two upside-down images, each consisting of an array of irregular pixels, representing the images on the backs of their retinas. The reason we see an ordinary scene when the movie shows a person's point of view is that what people are normally cognizant of is what's there. The same would, presumably, be true for a killer robot.

What's interesting is the degree to which people can become cognizant of the pictorial properties of the visual field. Empiricist psychologists of the nineteenth century often assumed that the mind had to figure out the correct sizes and shapes of objects starting from the distorted, inverted images on retinas. A young child, seeing two women of the same height, one further away, would assume the further one was smaller (and upside

Figure 3.1

How Hollywood imagines a robot's visual experience

### Figure 3.1

How Hollywood imagines a robot's visual experience down), because her image was. He would eventually learn the truth somehow, that is, learn to infer the correct size and orientation of an object, and then forget he was making this inference; indeed, he would find it hard to become aware of it. Nowadays we know that the perception of the sizes of objects at different distances is an innate ability (Baillargeon et al. 1985; Banks and Salapatek 1983). What's remarkable, in fact, is that with training people can actually become cognizant of the apparent sizes of images on the retina.3 This is a skill artists have to acquire in order to draw images that look like actual images instead of schematic diagrams. Not everyone can do it well, but apparently anyone can grasp the idea. Look at two objects that are about the same size, two people, for instance, who are at different distances from your eye. Mentally draw two horizontal lines across the scene, one touching the top of the nearby object, the other the bottom. The faraway object can fit comfortably between the two lines with space to spare. If it doesn't, move your head up or down until it does, or find a different faraway object.

I could draw a picture of this to make it easier to visualize, but that would defeat the point I'm trying to make, which is that the average person, with training and practice, can view his or her visual field as if it were a picture. In doing this operation you are using the space of visual appearances in a way that is quite different from the way it is normally used. It is easy to imagine a race of beings with vision as good as ours who are incapable of carrying these operations out. They might simply be unable to see the visual field as an object at all. (Most animals presumably can't.) Or they might be able to draw imaginary lines across their visual field, but might be able to conceive of them only as lying in three-dimensional space. Asked to draw a horizontal line in the direction of a faraway person, touching the head of a nearby person, they might invariably imagine it as touching the head of the faraway person, as a horizontal line in space actually would. It is reasonable to suppose that there is some evolutionary advantage in having the kind of access that we have, and not the kind this hypothetical race would have.

Note that current vision systems, to the extent that they're cognizant of anything, are not cognizant of the two-dimensional qualities of the images they manipulate. They extract information from two-dimensional arrays of numbers (see chapter 2), but having passed it on, they discard the arrays. It would be possible in principle to preserve this feature of the introspective abilities of robots, even if they became very intelligent. That is, they could use visual information to extract information very reliably from the environment, but would never be able to think of the image itself as an object accessible to examination.

So far, so good; we have begun to talk about the way things appear to a perceptual system, but we still haven't explained phenomenal consciousness. That appears only in connection with certain kinds of introspection, to which we now turn.