Recognizing Movement using Motion Histograms - MIT Human

Recognizing Movement using Motion Histograms - MIT Human

M.I.T Media Laboratory Perceptual Computing Section Technical Report No. 487 Work originally completed April 30, 1998 Recognizing Movement using Moti...

179KB Sizes 0 Downloads 2 Views

Recommend Documents

Human Motion Detection Using Fuzzy Rule-base - CiteSeerX
rule-base classification scheme based on moving blob regions. This approach first obtains a motion image through the acq

Human Potential Movement
Esalen Institute). History: In the 1940s and 1950s the humanistic psychologist Abraham Maslow developed a psychological

Markerless tracking of human movement using Microsoft Kinect
joints (Baker, 2006), although alternative sensor systems (such as electromagnetic tracking systems e.g. Polhemus, USA)

Simple Key Enumeration (and Rank Estimation) using Histograms: an
It allows obtaining simple bounds on the (small) rounding ... First, it directly leads to simple bounds ... m lists as m

Human Movement Science - Haskins Laboratories
Bruno H. Repp *. Haskins Laboratories, 300 George Street, New Haven, ...... Baltimore, MD: Paul H. Brookes. Repp, B. H.

On Wh-Movement - MIT OpenCourseWare
The rule of wh-movement has the following general characteristics: a. it leaves a gap b. where there is a bridge, there

Using Stepper Motors for Motion Control
Figure 1 shows a simple two-pole bipolar stepper motor. ... As electric current flows through the coil, the polarity ind

HUMAN RIGHTS IN MOTION commemoratIve ISSue
Yangraf Gráfica e Editora Ltda/Alphagraphics. ADVISORY BOARD. Alejandro M. Garro Columbia University (United. States).

Recognizing 50 Human Action Categories of Web Videos - UCF CRCV
Recognizing 50 Human Action Categories of Web Videos. Kishore K. Reddy · Mubarak Shah. Received: date / Accepted: date.

Recognizing Pain Management as a Human Right - California Society
Recognizing Pain Management as a. Human Right: A First Step. Scott M. Fishman, MD. Pain management as a human right is a

M.I.T Media Laboratory Perceptual Computing Section Technical Report No. 487 Work originally completed April 30, 1998

Recognizing Movement using Motion Histograms James W. Davis

MIT Media Laboratory, 20 Ames Steet Cambridge, MA 02139 [email protected] Abstract In this paper, we present a real-time computer vision approach to recognizing human movements based on patterns of motion. The underlying representation is a Motion History Image (MHI) which is characterized by multiple histograms of the local motion orientations. The approach is adapted to accommodate movements of dierent durations by using a simple iterative technique. Quantitative results are given showing discrimination between dierent human movements using the approach. An extension addressing occlusion and distractor motion is also presented within this framework.

1 Introduction The recognition of human motion and action using computer vision has widespread interest ranging from surveillance applications to entertainment systems. Being able to recognize the presence of human motion is desirable because every little change or movement in the environment may not be consequential. Monitoring applications, for instance, may wish to signal only when a person is seen in a particular area (perhaps inside a dangerous or secure area). Thus only those motions belonging to human activity are of importance. In the entertainment domain, an increasing interest in \people watching" systems is growing. Here the systems watch for the gestures made by participants which control/drive the program or interaction 7, 3]. Thus one requirement (or demand) of such machine vision systems is their ability to perform in real-time. It would not be of much use for the monitoring system to report that a person entered into a dangerous area an hour after the fact. Also, systems incorporating human gestures for input must recognize and respond quickly to the user without noticeable lag to give a sense of immersion and actual control. The quickness of response is paramount. In this paper, we present a real-time computer vision approach to recognizing human movements. In earlier work 2], we described a representation of movement

known as a Motion History Image (MHI). The MHI is a compact representation of temporal movement and is simple to compute. In this paper, we present a new method for recognizing movement which relies on localized regions of motion derived from the MHI. By gathering and matching multiple, overlapping histograms of the motion orientations from the MHI, we oer a realtime solution to recognizing various human movements. The remainder of this paper rst examines the related research for which this work has been based (Section 2). Next we present the approach of using motion histograms (Section 3). This section is sub-divided into discussions on the underlying representation (Section 3.1), the calculation of the motion orientations (Section 3.2), and the histogram generation (Section 3.3). We then present a simple recognition method (Section 4) along with some quantitative results (Section 4.2). A method for handling variable-length movements is also described (Section 4.3). We then address the notions of occlusion and distractor motions within an extension of this framework (Section 5). Lastly, we conclude with a brief summary of the approach (Section 6).

2 Related work

In previous work 2], we presented a real-time approach for representing and recognizing simple human movements. The motivation for the approach was based on how easily people can recognize common human movements (like sitting) from low-resolution imagery where the image features are basically not perceivable, thus showing the importance of the motion information. The approach relies on \patterns of motion" rather than structural features as the representation for identifying various human movements. In that method, there is a collapsing of the space-time volume into a single 2-D template form, where the representation still perceptually captures the essence of the movement and its temporal structure. This template is referred to as a Motion History Image (MHI). The MHI can be compared to many famous stroboscopic images 5, 1] and comic strip panels showing character motion, where time is

3.1 Motion history images

collapsed and represented in a single static frame. For recognition, seven higher-order moments are used as global shape descriptors and energy localizers for the motion template. These descriptors are then statistically matched to stored examples of dierent movements. Though this method has shown promising results, the main problem is with the recognition approach, where the discrimination between motion templates is based upon global properties and therefore is susceptible to region-based errors such as the addition or removal of motion. Another limitation is that the recognition was token (label) based, where it cannot yield much information other than recognition matches (i.e. it cannot report that a lot of \up" motion is happening in a particular area). We therefore wish to develop better methods of representation and recognition which can account for various motion regions around the body by retaining and analyzing a more localized form of the motion. In this paper, we develop such a method using multiple, overlapping histograms of the MHI motion orientations. The most closely related work to our motion histogram approach is that of Freeman and Roth's work on recognizing hand gestures from orientation histograms 6]. In their approach, they use a single histogram of edge orientations of the user's hand to recognize various static gestures. Though they address dynamic gesture, the problem of nding the start and stop times of a gesture was not considered. Thus all input sequences to their system were xed-length, and the resultant representation was basically a concatenation of individual orientation histograms from each image in the sequence. Their method is simple, fast, and robust against certain illumination changes. Our method in this paper recasts their orientation histogram approach to become \motion orientation" histograms, where the directions of motion are accumulated in a histogram format and used for recognition. By using the motion template representation from 2] to generate the motion information and using a simple iterative matching technique, we can account for various length movements while still retaining real-time performance.

We use the MHI representation described in 2] as the basis for the motion histograms. Currently, we generate the motion between frames by dierencing successive binary silhouette images of the person. The reason for this is two-fold. First, we believe that strict optical ow methods are still too brittle for real imagery of people moving (due to noise, shadows, textures, and rate of movement) and generally computationally taxing (i.e. not real-time)1 . Image dierencing continues to be a fairly robust method for cheaply locating the presence of motion. One of the main problems with image dierencing though (as opposed to optical ow) is that one cannot tell the magnitude or direction of the motion, only its presence. Thus it is hard to remove spurious unwanted motion purely from the dierencing result. But as we will show, the accumulation of image dierences can yield directional motion information. The second reason for dierencing silhouettes corresponds to the fact that much of the clothing texture frequently signals unwanted motion, which can cause problems when using motion for recognition. For this reason, we chose to extract the silhouette form of the person (thus masking the clothing texture). A side effect of using silhouettes is that no motion inside of the silhouette can be seen. For example, a camera facing the person would not show the hands moving \in front of" the body in the silhouette. One possibility to help overcome this problem is to use multiple cameras (the approach here easily extends to multiple views). Therefore, image dierences (we used the union of dierences at both normal and low resolutions) show only boundary motion of the silhouettes, but still yield quite useful motion information for many movements. To acquire the full-body silhouette of the person, we developed a robust and precise real-time silhouette extraction process based on spectral selectivity 4]. To generate the MHI for the movement, we weight and layer the successive silhouette image dierences. In the MHI, each pixel value is a function of the temporal history of motion at that point from all the frames in the movement sequence. We currently use a simple replacement and decay operator based on time-stamping (the previous method was based on frames rather than time):

3 Motion histograms The method for generating the motion histograms is developed by extracting directional motion information from the movement sequence's MHI representation. We then cluster this motion information into overlapping histogram regions to more locally represent the movement.

(

M H I x y

1

2

)=



if current motion at (x,y) 0 else if M H I (x y) < ( ; )



The overall approach outlined in the paper though is general enough to also be used on optical ow data if desired.

(a)

mation. For this work, we union the convolution at two resolutions (the original and a lower resolution) to handle more widespread gradients (due to diering speeds of movement). Sobel gradient masks were used for the convolution: 2 3 2 3 ;1 0 1 ;1 ;2 ;1 Fx = 4 ;2 0 2 5 , Fy = 4 0 0 0 5 ;1 0 1 1 2 1 With the gradient vectors calculated, it is a simple matter to get the gradient orientation for a pixel by:

(b)



= arctan FFy : x

(c)

We must be careful when calculating the gradient information because it is only valid at particular locations within the MHI. The boundaries of the MHI should not be used because non-moving (zero valued) pixels would be included in the gradient calculation, thus corrupting the result. Only MHI interior motion pixels should be examined. Additionally, we must not use gradients of MHI pixels which have a too low or too high contrast in their local neighborhood. A small contrast does not give a reliable measure of the gradient direction, and a large contrast signies a large temporal disparity between pixels, which makes the directional information biased and un-usable. The results of the motion orientation using gradient masks is shown in Figure 2.

(d)

Figure 1: Generation of MHI for raising arms move-

ment. (a) Sample silhouette of the person with their arms raised near the end of the movement. (b) A dierence of silhouettes early in the sequence. (c) A dierence of silhouettes later in the sequence. (d) Resulting MHI of layered silhouette dierences (normalized for display). where  is the current time-stamp, and  is the maximum time duration constant. The time-stamps allow for an easier port of the system between various faster and slower platforms (time is constant where frame rate is not). The above function is called to update the MHI each time a new image dierence result is calculated. By linearly normalizing the MHI time-stamps to values between 0 and 255, we can see that the more recently moving pixels are brighter than pixels belonging to older motion. The result of the above process is shown in Figure 1 for the movement of raising both of the arms.

3.3 Histogram hierarchy

Previously in 2], we performed recognition on the MHI using a set of global moment-based features. Though these moments are excellent shape descriptors and energy localizers, they are still global computations and do not describe the motion characteristics in dierent regions around the body (e.g. \a lot of upward motion in the left-side of the body"). Here, we present a more characterizing and local approach for representing the motion information. A simple means of localizing the motion for recognition is to separately pay attention to dierent regions around the body. One way of doing this is to divide the MHI into various regions (or windows) and then characterize each region. A method of characterization is to use a histogram of the motion orientations for a region. Thus we can divide the MHI motion pattern into regions each being represented by a histogram of local motion orientation. We dene the center point of the window conguration based upon the centroid of the current silhouette of the person, and also dene the boundaries (or extent) of the regions from a bounding box over the MHI motion pixels. Thus the windowing can adapt to the location of the person and the size of

3.2 Gradient of motion

The MHI layers the silhouette dierences in such a way that motion from the silhouette boundary can be perceived in the gradient of the MHI. This is very similar to the concept of normal ow. Notice that as the arms are raised up in Figure 1(d), that the intensity fading (from dark to light) gives the impression of motion in the direction of the arm movement. It can be said that the MHI \visually encodes" some motion information from the silhouette boundary. We see the direction of movement clearly, but the magnitude is not as accessible. Our goal is to use this directional motion information for recognition. The local gradient orientations of the MHI directly show the direction of the silhouette boundary movement. Therefore, we can convolve classic gradient masks with the MHI to extract the directional motion infor3

(a)

Figure 3: Overlapping windows for generating motion

histograms. The dark areas represent areas which are included in that window's histogram white areas are ignored. The rst window (top window) covers the entire motion region within the MHI. The windows below cover progressively smaller regions of the motion.

(b)

Figure 2: Directions of motion from MHI gradients.

(a) MHI for raising arms movement. (b) Result of convolving gradient masks with the MHI. The gradient directions show the approximate motion of the arms.

pixels found in the gradient map (also the number of entries in the overall window histogram #1). We note that the total number of motion pixels should be greater than some minimum threshold to be eective. The result of this normalization is that the histograms are no longer relatively sensitive to small motions. Figure 4 shows the nine overlapping, normalized histograms for the left-arm-up movement. Notice the large response for window #3 which registers the left motion. The concentration of entries around bins 7-10 (180 - 90 degrees, respectively) shows that there is much upward and sideways motion on the left side of the body. The histogram for the opposite side of the body (window #2) hardly shows any motion. Having a collection of histograms also lets us employ, if desired, the motion orientations directly to get a ner sense of how motion occurred within the dierent regions around the person (rather than just the presence of motion, or a labeled movement). For example, the localized directions of motion may be useful for interaction or control mapping. We now present a simple recognition method which uses these motion histograms to recognize various body movements.

their recent movement. The window placement yields X-Y translation invariance and the boundary helps in achieving scale invariance during recognition. One possible set of histogram windows is shown in Figure 3. This set of nine overlapping regions basically divides the body into windows covering the whole body, left-side, right-side, upper, lower, and four surrounding quadrants. So instead of having just one large window, we have a hierarchy of additional support windows to help characterize the motion. To generate the histograms for these window regions, we rst quantize the gradient directions from the MHI into multiples of 30 degrees, resulting in histograms with 12 bins each (mainly for speed during recognition). Since the number of bins is a relatively small, we chose not smooth the histogram as done in 6]. To handle changes in scale between dierent sized people (or in location of depth), we need to normalize these histograms with respect to some measure of the person or motion. One possibility is to normalize each histogram by the number of entries in that histogram (e.g. normalizing for a probability distribution). This approach is highly subjective to problems arising from only small motions present in a window. A better method is to normalize each window by the sum of all the motion orientation

4 Recognition

The result of generating motion histograms for the body movement is a collection of nine, 12-bucket histograms. There are many possible ways of using this data for recognition. The simplest approach, and the one taken 4

lfan.hists 1

0.05 0.045 0.04

norm count

0.035 0.03 0.025 0.02 0.015 0.01 0.005 0

1

2

3

4

5

6

7

8

9

10

11

12

dir bin

win-1 lfan.hists 3

lfan.hists 5

0.045

0.045

0.045

0.045

0.04

0.04

0.04

0.04

0.035

0.035

0.035

0.035

0.03 0.025

0.03 0.025

norm count

0.05

norm count

0.05

0.03 0.025

0.03 0.025

0.02

0.02

0.02

0.02

0.015

0.015

0.015

0.015

0.01

0.01

0.01

0.01

0.005

0.005

0.005

0.005

0

1

2

3

4

5

6

7

8

9

10

11

0

12

1

2

3

4

5

6

dir bin

7

8

9

10

11

0

12

1

2

3

4

5

6

dir bin

win-2

7

8

9

10

11

0

12

0.045

0.04

0.04

0.035

0.035

0.035

0.035

0.03 0.025

norm count

0.045

0.04

norm count

0.045

0.04

norm count

0.045

0.03 0.025

0.02

0.02

0.02

0.015

0.015

0.015

0.01

0.01

0.01

0.01

0.005

0.005

0.005

0.005

5

6

7

8

9

10

11

dir bin

win-6

12

0

1

2

3

4

5

6

7

8

9

10

11

12

dir bin

0

1

2

3

4

5

6

7 dir bin

win-7

8

9

10

11

12

8

9

10

11

12

0.03

0.02

4

7

0.025

0.015

3

6

lfan.hists 9

0.05

2

5

win-5

0.05

1

4

lfan.hists 8

0.05

0

3

dir bin

0.05

0.025

2

win-4

lfan.hists 7

0.03

1

dir bin

win-3

lfan.hists 6

norm count

lfan.hists 4

0.05

norm count

norm count

lfan.hists 2

0.05

win-8

8

9

10

11

12

0

1

2

3

4

5

6

7 dir bin

win-9

Figure 4: Motion histograms for raising the left arm movement. Most of the motion is localized to the left (win-3)

and top side (win-4, win-8) of the body. The clustering of directions around bins 9 and 10 (approx. 90 degrees) shows the upward motion, and the clustering around bins 7 and 8 (approx. 180 degrees) shows the left side motion. here, is to concatenate the histograms into a single column vector (108 x 1) and use the Euclidean distance between an input and stored model vector as a measure of closeness for recognition. An eect of the histogram normalization method (based on the total amount of motion) is that histograms with larger amounts of motion end up being more heavily weighted in the Euclidean distance than histograms with smaller amounts of motion. Intuitively, we believe that this is a desirable eect (and not problematic) because a tiny motion region should not carry as much weight as a larger more encompassing (or expressing) area.

gether the histograms (in vector form) of the training data. The training examples are then used to nd the mean and variance of the Euclidean distance from the training vectors to the newly generated mean vector. By collecting a mean and variance measure of the Euclidean distance using multiple training examples, we can select a recognition threshold based on the variability of the data measures from training. We could have examined variances within each of the histograms and used these measurements as weighting factors in a new distance metric (e.g. weighted Euclidean distance), but for simplicity we chose to measure only the change in the global Euclidean distance. The vector mean, distance mean, and distance variance of the training motion histograms are stored as the model for that particular movement. This process is repeated for each of the dierent movements that are required to model.

4.1 Movement model

To generate a model for a particular movement, we gather multiple examples of a person (or people) performing the movement. The motion histograms for each move are generated and stored (in vector form) when each example is completed (by manual selection during training). For a simple model of this movement, a set of mean motion histograms is formed by averaging to-

4.2 Matching

As for matching new input, we simply calculate the Euclidean distance between the input motion histogram 5

lfan

T1 T2 T3 T4 T5

(a)

angel

fan crouch 289.94 230.33 217.49 265.74 176.77 135.29 138.28 396.34 3.83

Table 1: Matching results. Entry i,j] reports the Ma-

(b)

(c)

rfan

5.99 179.27 119.63 190.83 58.98 2.93 105.73 137.35 98.18 82.95 8.80 97.45 59.89 42.11 54.14 10.24

halanobis distance between test input i (Ti ) and model j. The models in this example are T1 = left-arm-fan-up, T2 = right-arm-fan-up, T3 = squat-with-two-arm-fan-up, T4 = two-arm-fan-up, and T5 = crouch-down. The bold entries highlight the minimum distances for each input. of motion from the histograms.

(d)

4.3 Variable length movements

Since a set of movements to recognize most likely contains gestures of dierent time lengths (durations), we need a recognition mechanism that can handle variable length movements. The main problem to overcome is that we basically need to generate an MHI for each model movement to recognize, or more precisely, for each model movement that has a dierent time duration. If we know a priori the minimum and maximum durations of all the movements to recognize, then all recognizable movements have a duration within that time window. With the simple replacement and decay operation used in generating the MHI (which generates an MHI for a specic time duration), we have an inexpensive iterative means of achieving multiple simulated MHIs from only a single MHI. We begin by always generating an MHI with the time duration constant  being the maximum duration found from all the movements to recognize. Thus the MHI is generated for the movement(s) which have the longest time duration. During the recognition process, we can iteratively lower the time duration constant (from the maximum to the minimum) for the MHI which thus thresholds and removes older motions we then look for a match with the new simulated MHI (of a smaller time duration) and its updated motion histograms (updated by removing those values deleted from the MHI) the process is repeated until the time duration reaches the minimum value for the recognizable movements. This method progressively removes older motion from the MHI (and histograms) in such a way that all possible MHIs (and histograms) that could have been generated within the movement duration window (maximum and minimum movement times) are in fact quickly created for examination of a match during the iteration phase. The newly created MHIs are not approximations to the true MHIs, but

(e)

Figure 5: Silhouette key-frames for the movements to

recognize. (a) left-arm-fan-up. (b) right-arm-fan-up. (c) squat-with-two-arm-fan-up. (d) two-arm-fan-up. (e) crouch-down.

vector and the model mean vector. Using the model's distance mean and variance, we then calculate the Mahalanobis distance 8] for the new vector's Euclidean measure. This gives a measure of how many standard deviations the input's Euclidean distance is away from the model statistics. We can threshold this value based on statistics to declare if a match was found. This process can be easily repeated to seek a match against all the stored movement models without much computational expense. Table 1 shows the Mahalanobis distances for a set of new input examples (which were not used in training) matched against the stored models (silhouette key-frames for the dierent movements are shown in Figure 5). The training process used in generating Table 1 only employed ten training examples for each model, which is generally too small of a training set to gather good variances, but sucient enough to show the discrimination power of the approach. The table shows the correct classication with the Mahalanobis distances being considerably smaller for the correct target as opposed to the other movements. We see that this recognition method clearly discriminates between this set of moves using only the location and direction 6

are the genuine re-creations. This approach can also be coded in such a way that information is only updated (removed) rather than being totally recalculated (e.g. it is too time-consuming to recalculate all the motion gradients during the iteration process). This makes for a faster algorithm.

4.4 Computational speci cs

The entire matching process results in real-time performance of speeds greater than 20 Hz on-average for an SGI R10000 O2 platform with no special computer hardware. Algorithm specics for the above speed include:  Image resolution of 160 x 120.  Fast silhouette extraction 4].  Five model movements to match against. (Having more models will not signicantly eect the speed.)  A minimum/maximum time duration window of 1.0, 2.5] seconds for iterative matching ( = 2:5).  Iterative step size removing time-stamped motion in 0.067 second (15 Hz) intervals. (Minimum step size is bounded by the digitizing rate, typically 30 Hz.)

Figure 6: New motion window regions generated by

combining the four primary windows. The dark areas represent areas which are included in that window's histogram white areas are ignored. What is more desired is to be able to remove this occlusion region from its membership in the window hierarchy, and to match based on some function of the remaining regions. Thus the windowing hierarchy as previously shown may not necessarily be the most appropriate for being able to resist or discount the occluded data. By considering a new bottom-up hierarchy from the four quadrants, we can derive a dierent set of windows where the possible regions of occlusion can be explicitly modeled and discounted. Given four \primary" windows (the four quadrants), we can generate combinatorially fteen combination windows         4 + 4 + 4 + 4 = 24 ; 1 = 15 1 2 3 4 which are shown in Figure 6. If smaller primary windows were used, many more window combinations would be generated. This set has no constraints on continuity of the original primary windows. If we did impose a continuity, say 4-way connectedness, then there would only be thirteen windows (fteen minus the two diagonal only pairs). Other constraints could also be imposed. For example, in addition to the primaries, requiring a combination of only even numbers of primary windows along with 4-way connectedness generates the original set of windows shown in Figure 3. Given this new bottom-up window set, we now have additional region information (e.g. 15 windows instead of only 9, including cross-connected regions) and explicit occlusion regions (or regions to discount) modeled into the window set. Having a multiplicity of motion windows does not directly solve the occlusion problem, though. A ques-

5 Extensions for occlusions and distractor motions

With the above methodology for localizing and recognizing motion information, we also have an opportunity to address occlusions and distractor motions. Though the current matching technique is simple and does not directly model or explain any form of occlusions or large amount of noise motion, the windowing process does offer a framework for handling such problems. In both occlusion and distractor motion cases, we want to remove those troublesome regions from the matching function. We will mainly discuss occlusions here, but it is clearly extendible to distractor motions as well. For simplicity, let's consider an example occlusion contained and bound by one of the four motion histogram quadrants (the four quadrants are shown in the bottom row of Figure 3). This occlusion propagates back through the hierarchy and results in four of the total nine motion histograms containing this occlusion region, and therefore almost half of the motion histograms are corrupted from the occlusion. With the current matching process, this would certainly cause problems, being that each motion histogram contributes to the overall match (though smaller windows can give less weight, due to the current normalization method). 7

tion of how to discount the regions must be addressed. Consider a movement where one primary window is occluded. Then seven out of fteen windows will be corrupted by propagation of that region through the hierarchy, but the remaining eight windows still encode a large variety of valid locations. We need to retain these valid regions and remove the occluded regions. One possibility is to collect model statistics for each window histogram from training data and use this information to compute plausibility for the new input on a windowby-window basis. A question of normalization follows. Recall that normalization of the histograms is needed with the current method to achieve scale invariance during matching. Normalization also allows a simple Euclidean measure to be used for matching. Previously, all histograms in the hierarchy were normalized by the largest overall motion window (encompassing the entire motions of the body). This will no longer be applicable if we perform a verication on the windows in a window-by-window fashion, because the largest window will contain the occlusion if one exists, and thus aw the overall normalization. Therefore, we need a new means of normalization for the motion histograms. We could easily revert to self-normalization by the amount of motion in each window, but as previously stated, this becomes problematic when the size of the window or size of the motion becomes small. A method more closely related to the previous overall normalization approach is to use the plausible input window with the most motion to normalize all the remaining plausible sub-windows. To \verify" the plausibility, we might be able to use the individual window counts to \roughly" match for plausibility, then later normalize based on the largest, plausible histogram. Other means of verication are possible, which may include looking at the maximum direction of motion, or spread of the motion information.

When matching against several models, we must also consider the case where the input matches well on a small number of veried regions of move A and matches not quite as well on a larger set of veried regions of move B. Here there is a tradeo between precision of a window match and the number of veried windows. There are several issues involved in making this decision. We could simply state that the model with the most \veried" windows is the best match, because to verify means to be acceptable given some training measurements. But on the other hand, a solid match on few windows may be more descriptive of the actual activity, given that the verication method will most likely be loose in its acceptance criteria. It is not clear which method of recognition is more desirable, but perhaps the goal and context of the situation may determine which method is preferable.

6 Summary In this paper, we presented a real-time computer vision approach to recognizing human movements based on patterns of motion. The underlying representation is a Motion History Image (MHI) which is characterized by multiple, overlapping histograms of the motion orientations. These histograms separate and localize regions of motion for a better description of the movement. Quantitative results show that the method can easily discriminate between dierent human movements, and is extendible to variable length motions. Occlusion and distractor motions are also addressable within this framework.

References 1] Braun, M. Picturing time: The work of EtienneJules Marey (1830 - 1904). University of Chicago Press, 1992.

This process will work only if we retain the original model histograms (un-normalized) and re-normalize to them to match the test input normalization. The motion histograms for the input are individually veried against the target model. The largest, most encompassing input histogram which is veried is then used to renormalize the remaining veried histograms (the model is also normalized using that corresponding histogram in its collection). Once the normalization process is complete, a ner recognition method using only the valid histograms can be used to determine a match (as in the Euclidean match described above). This is a simple computation at virtually no expense during recognition. The process can easily be re-applied to all target models for recognition against multiple movements.

2] Davis, J. and A. Bobick. The representation and recognition of human movement using temporal templates. In Proc. Comp. Vis. and Pattern Rec., pages 928{934, June 1997. 3] Davis, J. and A. Bobick. Virtual PAT: a virtual personal aerobics trainer. MIT Media Lab Perceptual Computing Group Technical Report No. 436, MIT, 1997. 4] Davis, J. and A. Bobick. SIDEshow: A silhouettebased interactive dual-screen environment. MIT Media Lab Perceptual Computing Group Technical Report No. 457, MIT, 1998. 8

5] Edgerton, H. and J. Killian. Moments of vision: the stroboscopic revolution in photography. MIT Press, 1979. 6] Freeman, W., and M. Roth. Orientation histogram for hand gesture recognition. In Int'l Workshop on Automatic Face- and Gesture-Recognition, 1995. 7] Freeman, W., Tanaka, K., Ohta, J., and K. Kyuma. Computer vision for computer games. In Int'l Workshop on Automatic Face- and Gesture-Recognition, 1996. 8] Therrien, C. Decision Estimation and Classication. John Wiley and Sons, Inc., 1989.

9