Scientific Committee

Credits

Conference proceedings

Conference archives

Visuo-Gestural Interaction with a video wall

Frédérick Gianni, Patrice Dalle

IRIT-TCI
gianni@irit.fr, dalle@irit.fr

Abstract

This article introduces a design methodology for free-hand gestures based interactions in an environment composed of a wide diversity of information sources and a video-wall. This article is composed of three parts. In the first part we define the characteristics and singularities of our experiment. In the second part we describe the methodology used to create the gesture language. Finally we present the current image processing results. We conclude on the evaluation perspectives for this interface.

Key-Words :

Gestures Interaction, Gestures Interpretation, Computer Vision

1. Introduction

Man machine interaction is considered as a major problem by the computer-vision community, particularly with the aim at driving large screens with gestures. One of those stakes in this context is to elaborate solutions at different levels from the language used to the video processing to implement. We will first present the conception of the gestural language, thanks to two wizards of Oz experienced on several persons ; it is a mean to get unconstrained gestures realised in different ways. Then we will define the framework of use, which implies some operating conditions. Computer engineering will next be detailed to explain the tracking of the speaker and the interpretation of his gestures. We will point out the use of several models that allow us to retrieve relevant information in order to interpret gestures with a monocular vision system. In the end, we will show the provided solutions to interpret the gestures of the defined command language.

Ill.1 : the room environment

2. Framework of use and constraints

2.1 The environment

The studied environment is characterised by a collective display system, or a video wall on which several sources of information can be simultaneously displayed, coming out of local computers. The vision system used is based on a colour camera composed of three CCD. The purpose of the proposed interaction is to organise the displaying surfaces in the room : the video wall and the different spatially distributed computers (see illustration 1). This special room can be used for work presentations, conception meetings, or as a crisis room, where information can be shown as fast as possible. Those cases are the motivations of use of gestures as commands. Those contexts of uses define the operating conditions and the constraints imposed by the vision system.

2.2 Constraints

One of the problems arising when designing a Visuo-gestural interaction system is the precise definition of the contextual constraints to which it will be subjected to. These constraints will strongly influence the performances of the system and the definition of the gestural expressions. We try here to synthesize the main criteria which make possible to define the framework of a Visuo-gestural interaction system. For each criterion we illustrate our positioning. These criteria were established by extending the work of Moeslund (2001, 231-268) and Krahnstoever (2002, 203-207).

Real Time - Interactivity : the algorithms of tracking and image processing must allow an interaction whose time reactivity does not exceed the second. This constraint will exploit the speed of execution and the complexity of the gestures.

Occlusions : when the hands cross or pass behind the body. We decided to use interactions without hand occlusion, as far as it is possible.

Environment variations : the environment can often be reconfigured (furnitures can be moved) and because of the presence of pits of natural light the luminosity can vary according to the period of the day and of the year.

Image resolution : the needed resolution of the image depends on the position of the user in the room and of the size of the parts of the body performing the gesture. Here, the user is free to move in the area. However the different configurations of the hand are not considered in the current system, the hand details being too poor.

Physical characteristics of the user : The colour of he user's clothes should be different from his complexion, and they rather have long sleeves. But we do not pose constraints on the appearance (skin colour, height, pilosity) of the user. The system can adapt to different context.

Sequence of gestures : in our case, interactions are drowned in the context, the user will perform orders in an isolated and specific way but during other parallel and parasitic actions.

Segmentation : from the flow of gestures sequences, it is necessary to determine the markers of commands gestures to allow a temporal segmentation. At the moment, we have considered several solutions which are currently in validation (gaze tracking, orientation, use of key gestures, immobilisation or temporal marker, multi-modal or vocal markers).

Initialisation and calibration : the system can automatically initialise and does not have to be user dependant.

Multiple access : mono-user system or asynchronous multi-user.

General interaction features : dynamic gestures evolving in real time, the user can use his whole body (raise and bend the arms), as well as two arms interactions.

3. Language conception

It's not easy to define a gestures vocabulary which should be sufficiently intuitive to allow a fast training, and be sufficiently discriminative to allow interpretation without any confusions. As underlined by Moeslund (2001, 231-268), it is difficult to create metaphors of adequate gestures. Most of experiments impose a vocabulary to their users. On the contrary we have decided to adopt a process taking into account the users in the specification of those commands gestures. The definition of the language was created through three stages. We first defined a list of five commands that the user will be able to perform : display, move, remove, resize, zoom. This is the definition of the statement of the command.

We then wondered how users would translate these statements into gestures. In this way we organised an experiment in the form of wizard of Oz. We invited eight voluntaries to perform one scenario using all the orders, in various contexts, without constraints on the realisation of the gestures. The effects of the gestural orders were simulated by an operator present in the room. Each experimentation was recorded by two cameras : a contextual one and another one focused on the user. We could then evaluate the realisation and choose the gestures according to their intuitiveness (the most used), complexness (simplest) and singularity (uniqueness). Our estimate of the complexity and the discriminants of the gestures observed was established on partly empirical criteria. We thus had a generic corpus of gestures corresponding to the commands defined.

In the third and last step we wanted to evaluate the intuitiveness of our corpus and to note possible interpretations of formal descriptions of the gestures. We organised a new wizard of oz around two new scenarios of presentation implementing in particular these gestures in extreme situations. Nine new users took part in this meeting and nine others observed and evaluated the use of the corpus. From this new session we refined the corpus, starting from the study of the performed gestures and the parasitic gestures. Here is the list of the commands and their realisations :

- display : point a computer, point a part of the video wall, hand return back to a rest position

- move : point a window, move in the area of the video wall, hand return back to a rest position

- remove : point a window, move it out off the video wall bounds, hand return back to a rest position

- resize : point a window, indicate the scale ratio with two hands (distance between the two hands), hands return back to a rest position

- zoom : point a window, moving the hand closer to the body to perform a zoom in, further for a zoom out, hand return back to a rest position

We defined three separate cases of commands production. The first one is the case where commands are emitted one at a time we will call this "isolated command". In this situation the command statement is composed of three parts :

- the pre-command, where the hand is in a resting position and start to move to indicate the object on which the command will be apply to

- the command, which is the action to perform

- the post-command, where the hand move back to its rest position.

The second one is where several commands are issue with the same subject of command (understand a window on which the command will be applied to), here the command will has the form : the pre-command, the command, another command, the post-command. The last one is where multiple commands are perform on several subject, the command statement will be : pre-command, command, pre-command, command, ..., post-command. These two last cases will be called "chained commands".

4. The vision system

Ill. 5 :skin pixels in the foreground

Ill. 4 : after adaptive threshold

Ill. 6 : Detection of the user head

Ill. 3 : pixels detected as foreground (in red and green)

Ill. 7 : Detection and identification of the right hand

Ill. 9 : interpretation of the command display, the hand has been detected in the area of a computer (in purple), and is now detected in the right bottom part of the video wall (in blue)

Ill. 8 : Initialisation, positioning of the video wall in the user space

Ill. 2 : Background model

From the constraints identified and the complexity of the vocabulary, the operators of image processing can be implemented. We need to use several model as a priori knowledge, those models are : the background model of the scene, the skin colour model of the user and an anthropometric model of the user. We will see now the whole sequence of operators used for user detection, segmentation, tracking and gestures interpretation.

4.1 Initialisation

The purpose of our first operator is to separate the background and the foreground from an image sequence produced by a fixed camera. For this, we model each pixels of a background image sequence, in a statistical manner, in the HSV (hue, saturation, value) colour space in order to learn its luminosity and chromatic variations (illustration 2). After applying an adaptive threshold, we can retrieve the silhouette of the user (see illustration 3, 4). Using only the chrominance information we can eliminate the silhouette shadows. This background model can be calculated at any time and so allow adaptation to the variations of the environment. As specified above, this step permit us to free ourselves from the obligation of a uniform background (Wren 1997, 780-785). Other methods such as modelling background by mixtures of gaussian (Stauffer 1999, 246-252) shows certain limitations : the number of gaussian is arbitrarily set and their initialisation remains a problem. Han (2004) proposes, always with the aim of finding a person, to use the mean-shift algorithm but this one is limited by its execution time and its memory occupation even in its optimised version.

Next we locate the head of the user, from his silhouette we are looking for skin coloured pixels. We use a probabilistic skin model to segment skin tones pixels. It has been shown (McKenna 1997, 140-151) that human skin tones can be characterised by a multivariate normal distribution in the HSV colour space.

Providing a human skin image we estimate the distribution in a parametric form as a gaussian model to obtain the skin colour model. Using this model we compute the likelihood of any pixel of the silhouette belonging to the skin model (see illustration 5). We then select the most probable pixels and regroup them in area by connexity. The biggest area will be identified as the head of the user (see illustration 6). The areas of the left and right hand are identified in the same way we just use one more model : an anthropometric model in order to validate the position of the hands relatively to the head. This last model permit us to compute the expected size of the hands and define the areas of search around the head in the silhouette (see illustration 7).

4.2 Hands tracking

At time t a new image arrives, we already have detected and identified the head, the left and right hands of the user. We use the position of the old bounding box, at time t-1, to re-detect the hand and update its trajectory. It can happen that more than one skin coloured area can be found in this bounding box, we select the one minimising the variations of size, of direction, of speed. In a case where the hand move between the video camera and the hand of the user, a merge situation arise which will be followed by a split situation : the pixels heap representing the hand will merge with the pixels heap representing the head and then the hand heap will split from the head heap.

4.3 Gestures interpretation

Our method for interpreting the gesture relies on the spatial nature of our command language. During a phase of initialisation the user has to point the corner of the objects of interest out, see illustration 8. The objects of interest are the computers providing the information to be displayed and the video wall. A command is interpreted if the corresponding sequences of action as been performed. For instance the display command is realised once a hand has been detected in the area of a computer and then in the area of the video wall, with a smooth trajectory from a computer to the video screen (see illustration 9). In order to have a good differentiation between the co-verbal gestures produced during a work presentation and the command gestures, the user have to use some temporal markers : each time a designation is produce he has to hold is gesture in the area of the object a few seconds.

5. Conclusion

In this paper we have presented a process of conception of an ambitious system of visuo-gestural interaction. We have proposed a set of criteria which permit to define the constraints and singularity of this kind of system compared to existing research. We also have presented the different elaboration steps of an intuitive gestural language followed by the presentation of the image processing operators. These operators have been selected on their capacity to extract the data respectively to the temporal constraints needed for the interaction. An architecture permitting to add in a simple way some other image processing operators is in conception. This will allowed some evolution in the image processing system in order to possibly modify the command language and the interactive context. The next step in this study is to validate the system using the chained commands situation. It will be interesting in particular to study the evolution of the realisation of the commands gestures during the use.

Bibliography

Han, B., Comaniciu, D. & Davis, L. (2004). Sequential kernel density approximation through mode propagation : application to background modelling. Proceedings of the Asian Conference on Computer Vision 2004.

Krahnstoever, N., Schapira, E. Kettebekov, S. & Sharma, R. (2002). Multimodal Human Computer

Interaction for Crisis Management Systems. Proceedings of the sixth IEEE Workshop on Applications of Computer Vision, (pp. 203-207).

McKenna, S. Gong, G. & Raja, Y. (1997) Faces recognition in dynamic scenes. Proceedings of the British Machine Vision Conference, (pp.140-151).

Moeslund,T.B. & Granum,E. (2001). A survey of computer vision-based human motion capture.

Computer Vision and Image Understanding, 81(3), 231-268.

Stauffer, C., & Grimson, W.E.L. (1999). Adaptive background mixture models for real-time tracking. Proceedings of IEEE Computer Society Conference on Computer Vision Pattern Recognition, (pp. 246-252).

Wren, C., Azarbayejani, A., Darrell, T. & Pentland, A. (1997). Pfinder :Real-time Tracking of the

Human Body. IEEE Transaction on Pattern Analysis and Machine Intelligence 19(7), 780-785.